DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints Paper • 2601.18137 • Published 3 days ago • 19
Finch: Benchmarking Finance & Accounting across Spreadsheet-Centric Enterprise Workflows Paper • 2512.13168 • Published Dec 15, 2025 • 50
RoboChallenge: Large-scale Real-robot Evaluation of Embodied Policies Paper • 2510.17950 • Published Oct 20, 2025 • 8