ProgramerSalar🇿
AI & ML interests
Recent Activity
Organizations
Founding Engineer Role: Access to AIRAWAT Supercomputer (Zulense)
Research Collaboration: Scaling Educational Video AI (Access to AIRAWAT Supercompute)
VRAM usage.
Why it had to be done 👇
PyTorch's Dynamo compiler is increasingly becoming the default interoperability layer for ML systems. Anything that relies on torch.export or torch.compile, from model optimization to cross-framework integrations, benefits directly when models can be captured as a single dynamo-traced graph !
Transformers models are now easier to:
⚙️ Compile end-to-end with torch.compile backends
📦 Export reliably via torch.export and torch.onnx.export
🚀 Deploy to ONNX / ONNX Runtime, Intel Corporation's OpenVINO, NVIDIA AutoDeploy (TRT-LLM), AMD's Quark, Meta's Executorch and more hardware-specific runtimes.
This work aims at unblocking entire TorchDynamo-based toolchains that rely on exporting Transformers across runtimes and accelerators.
We are doubling down on Transformers commitment to be a first-class citizen of the PyTorch ecosystem, more exportable, more optimizable, and easier to deploy everywhere.
There are definitely some edge-cases that we still haven't addressed so don't hesitate to try compiling / exporting your favorite transformers and to open issues / PRs.
PR in the comments ! More updates coming coming soon !
good model, i understand it's very difficult to train diffusion model.
Introducing Waypoint-1: Real-time interactive video diffusion from Overworld
- +3
Differential Transformer V2
AssetOpsBench: Bridging the Gap Between AI Agent Benchmarks and Industrial Reality
This is an exceptionally important and well-executed benchmark. The shift in focus from "did the task succeed?" to "how and why did the process fail?" is precisely what's needed to move AI agents from research demos into high-stakes industrial environments.
The six-dimensional evaluation framework and the TrajFM pipeline for analyzing failure modes are standout contributions. The data you've shared is striking—particularly that no tested model, including the top performers, could meet the 85-point deployment readiness threshold. This honest result highlights a critical maturity gap and sets a clear, high bar for the community.
The findings around multi-agent coordination are especially valuable. The significant accuracy drop from single-agent (68%) to multi-agent (47%) workflows quantifies a major challenge many have anecdotally observed but rarely measured so clearly.
I have a couple of questions based on the thoughtful analysis:
Evolving Failure Taxonomy: You mention the system is designed to discover new failure patterns beyond the predefined taxonomy. Have you observed any novel, recurrent failure modes emerging from the community evaluations that are now being considered for inclusion in the core taxonomy?
Measuring Coordination Quality: The benchmark effectively captures that multi-agent coordination fails. Are there plans to develop more granular metrics to diagnose the quality of coordination itself (e.g., communication efficiency, conflict resolution) as a distinct dimension?
Congratulations to the team on this crucial work. By providing a rigorous, feedback-driven, and privacy-preserving evaluation platform, AssetOpsBench doesn't just measure progress—it actively guides the field toward building more robust and trustworthy industrial agents.
This is a fascinating and thorough update on the Differential Transformer architecture. The transition from DIFF V1 to V2 addresses some critical practical hurdles in a very elegant way.
The key design choice of doubling query heads within shared GQA groups is clever. It successfully decouples the innovative "differential" attention operation from the need for custom kernels, making it a much more viable drop-in replacement for standard attention. The analysis of how this design overcomes the softmax magnitude constraint and helps eliminate attention sinks is particularly convincing.
The reported early results—lower loss, reduced gradient spikes, and better control of activation outliers, especially at large learning rates—are highly promising. It suggests DIFF V2 isn't just a parameter-saving trick but may offer fundamental improvements in training dynamics and stability.
I have a couple of questions out of curiosity:
Long-Context Performance: You mention exploring "context rot" alleviation in later stages. Given the modified attention output dynamics, do you have any early hypotheses on whether DIFF V2 might inherently improve performance on very long sequences compared to a baseline Transformer with similar parameter budgets?
Broader Application: The principle seems powerful yet simple. Beyond the dense and MoE models tested here, do you see potential for applying this differential attention mechanism in other architectures, like state-space models or multimodal transformers?
This is fantastic news! Fine-tuning embedding models is a game-changer for improving RAG performance, and making it faster and more accessible is a huge win for the community.
Fine-tuning embedding models improves retrieval & RAG by aligning vectors to your domain-specific notion of similarity, improving search, clustering, and recommendations on your data.
⭐ Blog + Notebooks: https://unsloth.ai/docs/new/embedding-finetuning
Unsloth trains embedding models 1.8-3.3x faster with 20% less VRAM, 2x longer context & no accuracy loss vs. FA2 setups.
We'd like to thank Hugging Face and Unsloth contributor: electroglyph for making this possible!