PosterSentry Logo

PosterSentry β€” Multimodal Scientific Poster Classifier

Model Description

PosterSentry is a lightweight, CPU-optimized multimodal classifier that determines whether a PDF is a scientific poster or a non-poster (paper, proceedings, newsletter, abstract book, etc.).

Part of the quality control pipeline for posters.science, a platform for making scientific conference posters Findable, Accessible, Interoperable, and Reusable (FAIR).

Developed by the FAIR Data Innovations Hub at the California Medical Innovations Institute (CalMIΒ²).

Related Models & Tools

Resource Description Link
PosterSentry Multimodal poster classifier (this model) fairdataihub/poster-sentry
Llama-3.1-8B-Poster-Extraction Poster β†’ structured JSON extraction fairdataihub/Llama-3.1-8B-Poster-Extraction
poster2json Python library for poster extraction PyPI Β· Docs Β· GitHub
poster-json-schema DataCite-based poster metadata schema GitHub
Platform posters.science posters.science

Pipeline Position

PosterSentry sits at the front of the posters.science pipeline β€” it screens incoming PDFs before the expensive Llama-based extraction:

PDF Input
   β”‚
   β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ PosterSentry β”‚ ──► β”‚ Llama-3.1-8B-Poster-Extraction    β”‚ ──► β”‚ poster2json  β”‚
β”‚ (classify)   β”‚     β”‚ (extract structured metadata)      β”‚     β”‚ (validate)   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
   poster? βœ“              raw text β†’ JSON schema                  FAIR output

Architecture

Three feature channels concatenated into a 542-dimensional vector, fed to a single LogisticRegression:

Channel Features Dimension Signal
Text model2vec (potion-base-32M) embedding 512 Semantic content
Visual Color stats, edge density, FFT spatial complexity, whitespace 15 Visual layout
Structural Page count, area, font diversity, text blocks, density 15 PDF geometry

Each classifier head is a single linear layer stored as a numpy .npz file (10 KB). Inference is pure numpy β€” no torch required at prediction time.

Performance

Validated on 3,606 real scientific documents:

Metric Value
Accuracy 87.3%
F1 (poster) 87.1%
F1 (non-poster) 87.4%
Precision (poster) 88.2%
Recall (poster) 85.9%
Inference speed ~300 docs/sec (CPU)

Top Features by Importance

Rank Feature Coefficient Signal
1 size_per_page_kb +7.65 Posters are dense, high-res single pages
2 page_count -5.49 More pages = not a poster
3 file_size_kb -5.44 Multi-page docs are bigger overall
4 img_height +1.38 Posters are large-format
5 page_height_pt +1.38 Large physical dimensions
6 avg_font_size -1.10 Papers use smaller fonts
7 is_landscape +0.98 Some posters are landscape
8 color_diversity +0.95 Posters are visually rich
9 edge_density +0.79 More visual edges in posters
10 text_block_count +0.75 Multi-column poster layouts

Training Data

Trained on 3,606 real documents β€” zero synthetic data:

Class Count Source
Poster 1,803 Verified scientific posters from Zenodo & Figshare
Non-poster 1,803 Multi-page papers, proceedings, newsletters, abstract books

Sampled from the posters.science corpus of 30,000+ classified PDFs (28,111 posters, 2,036 non-posters from Zenodo and Figshare).

Training data: fairdataihub/poster-sentry-training-data

Usage

Python API

from poster_sentry import PosterSentry

sentry = PosterSentry()
sentry.initialize()

# Classify a PDF (uses text + visual + structural features)
result = sentry.classify("document.pdf")
print(f"Is poster: {result['is_poster']}, Confidence: {result['confidence']:.2f}")
# {'is_poster': True, 'confidence': 0.97, 'path': 'document.pdf'}

# Batch classification
results = sentry.classify_batch(["poster1.pdf", "paper.pdf", "newsletter.pdf"])

Installation

pip install git+https://github.com/fairdataihub/poster-repo-qc.git

# Or install from source
git clone https://github.com/fairdataihub/poster-repo-qc.git
cd poster-repo-qc
pip install -e ".[train]"

Training

python scripts/train_poster_sentry.py --n-per-class 2000

Training completes in ~40 minutes on CPU (PDF rendering is the bottleneck, not the classifier).

Model Specifications

Attribute Value
Embedding backbone minishlab/potion-base-32M (model2vec StaticModel)
Embedding dimension 512
Visual features 15 (color, edge, FFT, whitespace)
Structural features 15 (page geometry, fonts, text blocks)
Total input dimension 542
Classifier LogisticRegression (sklearn) + StandardScaler
Head file size 10 KB (.npz)
Precision float32
GPU required No (CPU-only)
License MIT

System Requirements

  • CPU: Any modern CPU (no GPU needed)
  • RAM: β‰₯4GB
  • Python: β‰₯3.10
  • Dependencies: numpy, model2vec, scikit-learn, PyMuPDF, Pillow

Citation

@software{poster_sentry_2026,
  title = {PosterSentry: Multimodal Scientific Poster Classifier},
  author = {O'Neill, James and Soundarajan, Sanjay and Portillo, Dorian and Patel, Bhavesh},
  year = {2026},
  url = {https://huggingface.co/fairdataihub/poster-sentry},
  note = {Part of the posters.science initiative}
}

License

This model is released under the MIT License.

Acknowledgments

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support