·
AI & ML interests
Privacy, Large Language Model, Explainable
Recent Activity
reacted to theirpost with ❤️ about 2 hours ago This new preprint fine-tunes T5-small and Mistral-7B on the AI4Privacy PII-Masking-200K dataset and shows that lightweight models can match and sometimes rival much larger LLMs for privacy tasks.
The study tackles a real deployment question many teams face:
Is PII masking a model-size problem, or a data-quality problem?
Using AI4Privacy’s large-scale, standardized PII annotations, the authors systematically compare:
Encoder–decoder models (T5) vs
Decoder-only models (Mistral)
across accuracy, robustness, latency, and real-world conversational text.
What stood out:
Mistral-7B achieved higher recall and robustness across noisy, informal inputs but with 10× higher latency
T5-small, trained on the same AI4Privacy data, delivered fast, structured, low-cost masking, making it viable for real-time systems
Dataset normalization (not model size) was one of the biggest drivers of performance gains
The models were then deployed in a live Discord bot, where performance dropped under real-world conditions a reminder that benchmarks alone aren’t enough.
The takeaway is hard to ignore:
Privacy-preserving AI scales through data design, not just bigger models.
This work reinforces why open, well-curated datasets like AI4Privacy PII-Masking-200K are becoming foundational infrastructure for privacy-first AI especially for teams that need self-hosted, transparent solutions.
📄 Read the paper: https://arxiv.org/abs/2512.18608 posted an update about 2 hours ago This new preprint fine-tunes T5-small and Mistral-7B on the AI4Privacy PII-Masking-200K dataset and shows that lightweight models can match and sometimes rival much larger LLMs for privacy tasks.
The study tackles a real deployment question many teams face:
Is PII masking a model-size problem, or a data-quality problem?
Using AI4Privacy’s large-scale, standardized PII annotations, the authors systematically compare:
Encoder–decoder models (T5) vs
Decoder-only models (Mistral)
across accuracy, robustness, latency, and real-world conversational text.
What stood out:
Mistral-7B achieved higher recall and robustness across noisy, informal inputs but with 10× higher latency
T5-small, trained on the same AI4Privacy data, delivered fast, structured, low-cost masking, making it viable for real-time systems
Dataset normalization (not model size) was one of the biggest drivers of performance gains
The models were then deployed in a live Discord bot, where performance dropped under real-world conditions a reminder that benchmarks alone aren’t enough.
The takeaway is hard to ignore:
Privacy-preserving AI scales through data design, not just bigger models.
This work reinforces why open, well-curated datasets like AI4Privacy PII-Masking-200K are becoming foundational infrastructure for privacy-first AI especially for teams that need self-hosted, transparent solutions.
📄 Read the paper: https://arxiv.org/abs/2512.18608 View all activity Organizations
MikeDoes/mmbert-multilingual-20250916-212213
0.1B • Updated • 2
MikeDoes/mmbert-multilingual-20250916-202535
Updated
MikeDoes/mmbert-multilingual-20250916-170430
0.1B • Updated • 2
MikeDoes/mmbert-multilingual-20250916-173350
0.3B • Updated • 8
MikeDoes/mmbert-multilingual-20250916-170450
Updated
MikeDoes/mmbert-multilingual-20250916-155621
0.3B • Updated • 6
MikeDoes/mmbert-multilingual-20250916-155528
Fill-Mask
• 0.1B • Updated • 2
MikeDoes/mmbert-multilingual-20250916-145114
0.3B • Updated • 1
MikeDoes/mmbert-multilingual-20250916-143043
Updated
MikeDoes/mmbert-multilingual-20250916-133611
0.3B • Updated • 2
MikeDoes/mmbert-multilingual-20250916-130537
Fill-Mask
• 0.3B • Updated • 6
MikeDoes/mmbert-multilingual-20250916-120850
Fill-Mask
• 0.3B • Updated • 5
MikeDoes/mmbert-multilingual-20250916-114740
Fill-Mask
• 0.3B • Updated • 3
MikeDoes/mmbert-multilingual-20250916-103748
Fill-Mask
• 0.3B • Updated • 3
MikeDoes/modernbert-english-ner-20250808-034913
Token Classification
• 0.1B • Updated • 1
MikeDoes/modernbert-english-ner-20250806-110517
0.1B • Updated • 1
MikeDoes/quick-ner-model-20250726-011948
Token Classification
• 0.1B • Updated • 1
MikeDoes/eurobert-ner-model-20250726-134739
Token Classification
• 0.2B • Updated • 3
MikeDoes/eurobert-ner-model-20250726-082438
Updated
MikeDoes/quick-ner-model-20250726-004735
Updated