PosterSentry β Multimodal Scientific Poster Classifier
Model Description
PosterSentry is a lightweight, CPU-optimized multimodal classifier that determines whether a PDF is a scientific poster or a non-poster (paper, proceedings, newsletter, abstract book, etc.).
Part of the quality control pipeline for posters.science, a platform for making scientific conference posters Findable, Accessible, Interoperable, and Reusable (FAIR).
Developed by the FAIR Data Innovations Hub at the California Medical Innovations Institute (CalMIΒ²).
Related Models & Tools
| Resource | Description | Link |
|---|---|---|
| PosterSentry | Multimodal poster classifier (this model) | fairdataihub/poster-sentry |
| Llama-3.1-8B-Poster-Extraction | Poster β structured JSON extraction | fairdataihub/Llama-3.1-8B-Poster-Extraction |
| poster2json | Python library for poster extraction | PyPI Β· Docs Β· GitHub |
| poster-json-schema | DataCite-based poster metadata schema | GitHub |
| Platform | posters.science | posters.science |
Pipeline Position
PosterSentry sits at the front of the posters.science pipeline β it screens incoming PDFs before the expensive Llama-based extraction:
PDF Input
β
βΌ
ββββββββββββββββ βββββββββββββββββββββββββββββββββββββ ββββββββββββββββ
β PosterSentry β βββΊ β Llama-3.1-8B-Poster-Extraction β βββΊ β poster2json β
β (classify) β β (extract structured metadata) β β (validate) β
ββββββββββββββββ βββββββββββββββββββββββββββββββββββββ ββββββββββββββββ
poster? β raw text β JSON schema FAIR output
Architecture
Three feature channels concatenated into a 542-dimensional vector, fed to a single LogisticRegression:
| Channel | Features | Dimension | Signal |
|---|---|---|---|
| Text | model2vec (potion-base-32M) embedding | 512 | Semantic content |
| Visual | Color stats, edge density, FFT spatial complexity, whitespace | 15 | Visual layout |
| Structural | Page count, area, font diversity, text blocks, density | 15 | PDF geometry |
Each classifier head is a single linear layer stored as a numpy .npz file (10 KB). Inference is pure numpy β no torch required at prediction time.
Performance
Validated on 3,606 real scientific documents:
| Metric | Value |
|---|---|
| Accuracy | 87.3% |
| F1 (poster) | 87.1% |
| F1 (non-poster) | 87.4% |
| Precision (poster) | 88.2% |
| Recall (poster) | 85.9% |
| Inference speed | ~300 docs/sec (CPU) |
Top Features by Importance
| Rank | Feature | Coefficient | Signal |
|---|---|---|---|
| 1 | size_per_page_kb |
+7.65 | Posters are dense, high-res single pages |
| 2 | page_count |
-5.49 | More pages = not a poster |
| 3 | file_size_kb |
-5.44 | Multi-page docs are bigger overall |
| 4 | img_height |
+1.38 | Posters are large-format |
| 5 | page_height_pt |
+1.38 | Large physical dimensions |
| 6 | avg_font_size |
-1.10 | Papers use smaller fonts |
| 7 | is_landscape |
+0.98 | Some posters are landscape |
| 8 | color_diversity |
+0.95 | Posters are visually rich |
| 9 | edge_density |
+0.79 | More visual edges in posters |
| 10 | text_block_count |
+0.75 | Multi-column poster layouts |
Training Data
Trained on 3,606 real documents β zero synthetic data:
| Class | Count | Source |
|---|---|---|
| Poster | 1,803 | Verified scientific posters from Zenodo & Figshare |
| Non-poster | 1,803 | Multi-page papers, proceedings, newsletters, abstract books |
Sampled from the posters.science corpus of 30,000+ classified PDFs (28,111 posters, 2,036 non-posters from Zenodo and Figshare).
Training data: fairdataihub/poster-sentry-training-data
Usage
Python API
from poster_sentry import PosterSentry
sentry = PosterSentry()
sentry.initialize()
# Classify a PDF (uses text + visual + structural features)
result = sentry.classify("document.pdf")
print(f"Is poster: {result['is_poster']}, Confidence: {result['confidence']:.2f}")
# {'is_poster': True, 'confidence': 0.97, 'path': 'document.pdf'}
# Batch classification
results = sentry.classify_batch(["poster1.pdf", "paper.pdf", "newsletter.pdf"])
Installation
pip install git+https://github.com/fairdataihub/poster-repo-qc.git
# Or install from source
git clone https://github.com/fairdataihub/poster-repo-qc.git
cd poster-repo-qc
pip install -e ".[train]"
Training
python scripts/train_poster_sentry.py --n-per-class 2000
Training completes in ~40 minutes on CPU (PDF rendering is the bottleneck, not the classifier).
Model Specifications
| Attribute | Value |
|---|---|
| Embedding backbone | minishlab/potion-base-32M (model2vec StaticModel) |
| Embedding dimension | 512 |
| Visual features | 15 (color, edge, FFT, whitespace) |
| Structural features | 15 (page geometry, fonts, text blocks) |
| Total input dimension | 542 |
| Classifier | LogisticRegression (sklearn) + StandardScaler |
| Head file size | 10 KB (.npz) |
| Precision | float32 |
| GPU required | No (CPU-only) |
| License | MIT |
System Requirements
- CPU: Any modern CPU (no GPU needed)
- RAM: β₯4GB
- Python: β₯3.10
- Dependencies: numpy, model2vec, scikit-learn, PyMuPDF, Pillow
Citation
@software{poster_sentry_2026,
title = {PosterSentry: Multimodal Scientific Poster Classifier},
author = {O'Neill, James and Soundarajan, Sanjay and Portillo, Dorian and Patel, Bhavesh},
year = {2026},
url = {https://huggingface.co/fairdataihub/poster-sentry},
note = {Part of the posters.science initiative}
}
License
This model is released under the MIT License.
Acknowledgments
- FAIR Data Innovations Hub at California Medical Innovations Institute (CalMIΒ²)
- posters.science platform
- MinishLab for the model2vec embedding backbone
- HuggingFace for model hosting infrastructure
- Funded by The Navigation Fund (10.71707/rk36-9x79) β "Poster Sharing and Discovery Made Easy"