Key design decisions
Pre-aggregated Parquets, not raw CSV
Raw CSV is 4–5GB. Instead: aggregated data (~smaller than 100MB total) covering all dimension combos × 16 years. Query time: <50ms. Generated by a Python script that runs daily and automatically ETL — the same pipeline I built for the earlier PDF report system.
Triage before synthesis
GPT smaller model (fast, cheap) handles intent classification and parameter extraction. Then another GPT runs on the final synthesis step with clean structured data. Cost: ~$0.01–0.02/query.
Hard rules in synthesis prompt, not soft suggestions
LLMs reliably ignore "try not to..." instructions when they think context justifies it. For data accuracy I use hard "NEVER cite X unless user's question contains word Y" phrasing. Confirmed more reliable via A/B testing on edge cases.
Public/private Space split
Private backend (data, agents, pipeline) + public frontend (thin Gradio UI, gradio_client, supporting both UI and API communication). Frontend only has Pillow in requirements.txt.
Partial-year handling
Data includes 2026 (partial year). Charts always show it; This was one of hardest one to crack surprisingly.
Automated data pipeline
Scheduled notebook monitor daily whether the raw gov.uk dataset has been updated → if so, regenerates Parquets → uploads to data server → Space detects git commit and auto-restarts. Zero manual effort once deployed.