BadScientist: Can a Research Agent Write Convincing but Unsound Papers that Fool LLM Reviewers? Paper β’ 2510.18003 β’ Published Oct 20, 2025
Visual Aesthetic Benchmark: Can Frontier Models Judge Beauty? Paper β’ 2605.12684 β’ Published 10 days ago β’ 11
TOUCAN: Synthesizing 1.5M Tool-Agentic Data from Real-World MCP Environments Paper β’ 2510.01179 β’ Published Oct 1, 2025 β’ 28
VisualSphinx: Large-Scale Synthetic Vision Logic Puzzles for RL Paper β’ 2505.23977 β’ Published May 29, 2025 β’ 10
VisualSphinx: Large-Scale Synthetic Vision Logic Puzzles for RL Paper β’ 2505.23977 β’ Published May 29, 2025 β’ 10
VisualSphinx-V1 Collection VisualSphinx-V1 is the largest fully-synthetic open-source dataset providing vision logic puzzles. β’ 7 items β’ Updated Jun 3, 2025 β’ 1
VisualSphinx-V1 Collection VisualSphinx-V1 is the largest fully-synthetic open-source dataset providing vision logic puzzles. β’ 7 items β’ Updated Jun 3, 2025 β’ 1