CORE-bench v1.1 Benchmark for AI agents on scientific reproducibility — mainline (39) and OOD (19) splits derived from Code Ocean capsules. agent-evals/core-bench-v1.1-mainline Viewer • Updated 2 days ago • 39 • 52 agent-evals/core-bench-v1.1-ood Viewer • Updated 2 days ago • 19 • 36
CORE-bench v1.1 Benchmark for AI agents on scientific reproducibility — mainline (39) and OOD (19) splits derived from Code Ocean capsules. agent-evals/core-bench-v1.1-mainline Viewer • Updated 2 days ago • 39 • 52 agent-evals/core-bench-v1.1-ood Viewer • Updated 2 days ago • 19 • 36