PosterSentry — Multimodal Scientific Poster Classifier

Model Description

PosterSentry is a lightweight, CPU-optimized multimodal classifier that determines whether a PDF is a scientific poster or a non-poster (paper, proceedings, newsletter, abstract book, etc.).

Part of the quality control pipeline for posters.science, a platform for making scientific conference posters Findable, Accessible, Interoperable, and Reusable (FAIR).

Developed by the FAIR Data Innovations Hub at the California Medical Innovations Institute (CalMI²).

Related Models & Tools

Resource	Description	Link
PosterSentry	Multimodal poster classifier (this model)	fairdataihub/poster-sentry
Llama-3.1-8B-Poster-Extraction	Poster → structured JSON extraction	fairdataihub/Llama-3.1-8B-Poster-Extraction
poster2json	Python library for poster extraction	PyPI · Docs · GitHub
poster-json-schema	DataCite-based poster metadata schema	GitHub
Platform	posters.science	posters.science

Pipeline Position

PosterSentry sits at the front of the posters.science pipeline — it screens incoming PDFs before the expensive Llama-based extraction:

PDF Input
   │
   ▼
┌──────────────┐     ┌───────────────────────────────────┐     ┌──────────────┐
│ PosterSentry │ ──► │ Llama-3.1-8B-Poster-Extraction    │ ──► │ poster2json  │
│ (classify)   │     │ (extract structured metadata)      │     │ (validate)   │
└──────────────┘     └───────────────────────────────────┘     └──────────────┘
   poster? ✓              raw text → JSON schema                  FAIR output

Architecture

Three feature channels concatenated into a 542-dimensional vector, fed to a single LogisticRegression:

Channel	Features	Dimension	Signal
Text	model2vec (potion-base-32M) embedding	512	Semantic content
Visual	Color stats, edge density, FFT spatial complexity, whitespace	15	Visual layout
Structural	Page count, area, font diversity, text blocks, density	15	PDF geometry

Each classifier head is a single linear layer stored as a numpy .npz file (10 KB). Inference is pure numpy — no torch required at prediction time.

Performance

Validated on 3,606 real scientific documents:

Metric	Value
Accuracy	87.3%
F1 (poster)	87.1%
F1 (non-poster)	87.4%
Precision (poster)	88.2%
Recall (poster)	85.9%
Inference speed	~300 docs/sec (CPU)

Top Features by Importance

Rank	Feature	Coefficient	Signal
1	`size_per_page_kb`	+7.65	Posters are dense, high-res single pages
2	`page_count`	-5.49	More pages = not a poster
3	`file_size_kb`	-5.44	Multi-page docs are bigger overall
4	`img_height`	+1.38	Posters are large-format
5	`page_height_pt`	+1.38	Large physical dimensions
6	`avg_font_size`	-1.10	Papers use smaller fonts
7	`is_landscape`	+0.98	Some posters are landscape
8	`color_diversity`	+0.95	Posters are visually rich
9	`edge_density`	+0.79	More visual edges in posters
10	`text_block_count`	+0.75	Multi-column poster layouts

Training Data

Trained on 3,606 real documents — zero synthetic data:

Class	Count	Source
Poster	1,803	Verified scientific posters from Zenodo & Figshare
Non-poster	1,803	Multi-page papers, proceedings, newsletters, abstract books

Sampled from the posters.science corpus of 30,000+ classified PDFs (28,111 posters, 2,036 non-posters from Zenodo and Figshare).

Training data: fairdataihub/poster-sentry-training-data

Usage

Python API

from poster_sentry import PosterSentry

sentry = PosterSentry()
sentry.initialize()

# Classify a PDF (uses text + visual + structural features)
result = sentry.classify("document.pdf")
print(f"Is poster: {result['is_poster']}, Confidence: {result['confidence']:.2f}")
# {'is_poster': True, 'confidence': 0.97, 'path': 'document.pdf'}

# Batch classification
results = sentry.classify_batch(["poster1.pdf", "paper.pdf", "newsletter.pdf"])

Installation

pip install git+https://github.com/fairdataihub/poster-repo-qc.git

# Or install from source
git clone https://github.com/fairdataihub/poster-repo-qc.git
cd poster-repo-qc
pip install -e ".[train]"

Training

python scripts/train_poster_sentry.py --n-per-class 2000

Training completes in ~40 minutes on CPU (PDF rendering is the bottleneck, not the classifier).

Model Specifications

Attribute	Value
Embedding backbone	minishlab/potion-base-32M (model2vec StaticModel)
Embedding dimension	512
Visual features	15 (color, edge, FFT, whitespace)
Structural features	15 (page geometry, fonts, text blocks)
Total input dimension	542
Classifier	LogisticRegression (sklearn) + StandardScaler
Head file size	10 KB (.npz)
Precision	float32
GPU required	No (CPU-only)
License	MIT

System Requirements

CPU: Any modern CPU (no GPU needed)
RAM: ≥4GB
Python: ≥3.10
Dependencies: numpy, model2vec, scikit-learn, PyMuPDF, Pillow

Citation

@software{poster_sentry_2026,
  title = {PosterSentry: Multimodal Scientific Poster Classifier},
  author = {O'Neill, James and Soundarajan, Sanjay and Portillo, Dorian and Patel, Bhavesh},
  year = {2026},
  url = {https://huggingface.co/fairdataihub/poster-sentry},
  note = {Part of the posters.science initiative}
}

License

This model is released under the MIT License.

Acknowledgments

FAIR Data Innovations Hub at California Medical Innovations Institute (CalMI²)
posters.science platform
MinishLab for the model2vec embedding backbone
HuggingFace for model hosting infrastructure
Funded by The Navigation Fund (10.71707/rk36-9x79) — "Poster Sharing and Discovery Made Easy"

Downloads last month: -; Downloads are not tracked for this model. How to track