beans of sloar

solarbeams

AI & ML interests

None yet

Recent Activity

reacted to SeaWolf-AI's post with 🔥 about 5 hours ago
ALL Bench Leaderboard — Structural Problems in AI Benchmarking and the Case for Unified Evaluation https://huggingface.co/spaces/FINAL-Bench/all-bench-leaderboard The AI benchmark ecosystem has three structural problems. Major benchmarks like MMLU have surpassed 90%, losing discriminative power. Most leaderboards publish unverified self-reported scores — our cross-verification found Claude Opus 4.6's ARC-AGI-2 listed as 37.6% (actual: 68.8%), Gemini 3.1 Pro as 88.1% (actual: 77.1%). OpenAI's own audit confirmed 59.4% of SWE-bench Verified tasks are defective, yet it remains widely used. ALL Bench addresses this by comparing 91 models across 6 modalities (LLM · VLM · Agent · Image · Video · Music) with 3-tier confidence badges (✓✓ cross-verified · ✓ single-source · ~ self-reported). Composite scoring uses a 5-Axis Framework and replaces SWE-Verified with contamination-resistant LiveCodeBench. Key finding: metacognition is the largest blind spot. FINAL Bench shows Error Recovery explains 94.8% of self-correction variance, yet only 9 of 42 models are even measured. The 9.2-point spread (Kimi K2.5: 68.71 → rank 9: 59.5) is 3× the GPQA top-model spread, suggesting metacognition may be the single biggest differentiator among frontier models today. VLM cross-verification revealed rank reversals — Claude Opus 4.6 leads MMMU-Pro (85.1%) while Gemini 3 Flash leads MMMU (87.6%), producing contradictory rankings between the two benchmarks. 📊 Article: https://huggingface.co/blog/FINAL-Bench/all-bench 📦 Dataset: https://huggingface.co/datasets/FINAL-Bench/ALL-Bench-Leaderboard ⚡ GitHub: https://github.com/final-bench/ALL-Bench-Leaderboard 🏆 Leaderboard: https://huggingface.co/spaces/FINAL-Bench/all-bench-leaderboard 🧬 FINAL Bench: https://huggingface.co/datasets/FINAL-Bench/Metacognitive
reacted to SeaWolf-AI's post with ❤️ about 5 hours ago
ALL Bench Leaderboard — Structural Problems in AI Benchmarking and the Case for Unified Evaluation https://huggingface.co/spaces/FINAL-Bench/all-bench-leaderboard The AI benchmark ecosystem has three structural problems. Major benchmarks like MMLU have surpassed 90%, losing discriminative power. Most leaderboards publish unverified self-reported scores — our cross-verification found Claude Opus 4.6's ARC-AGI-2 listed as 37.6% (actual: 68.8%), Gemini 3.1 Pro as 88.1% (actual: 77.1%). OpenAI's own audit confirmed 59.4% of SWE-bench Verified tasks are defective, yet it remains widely used. ALL Bench addresses this by comparing 91 models across 6 modalities (LLM · VLM · Agent · Image · Video · Music) with 3-tier confidence badges (✓✓ cross-verified · ✓ single-source · ~ self-reported). Composite scoring uses a 5-Axis Framework and replaces SWE-Verified with contamination-resistant LiveCodeBench. Key finding: metacognition is the largest blind spot. FINAL Bench shows Error Recovery explains 94.8% of self-correction variance, yet only 9 of 42 models are even measured. The 9.2-point spread (Kimi K2.5: 68.71 → rank 9: 59.5) is 3× the GPQA top-model spread, suggesting metacognition may be the single biggest differentiator among frontier models today. VLM cross-verification revealed rank reversals — Claude Opus 4.6 leads MMMU-Pro (85.1%) while Gemini 3 Flash leads MMMU (87.6%), producing contradictory rankings between the two benchmarks. 📊 Article: https://huggingface.co/blog/FINAL-Bench/all-bench 📦 Dataset: https://huggingface.co/datasets/FINAL-Bench/ALL-Bench-Leaderboard ⚡ GitHub: https://github.com/final-bench/ALL-Bench-Leaderboard 🏆 Leaderboard: https://huggingface.co/spaces/FINAL-Bench/all-bench-leaderboard 🧬 FINAL Bench: https://huggingface.co/datasets/FINAL-Bench/Metacognitive
View all activity

Organizations

None yet