Cosmos 3, VANTAGE-Bench, and TAR at Computex 2026
Published:
At Computex 2026, NVIDIA announced Cosmos 3, an open world foundation model for physical AI that brings vision reasoning, multimodal generation, and action prediction together for robots, autonomous vehicles, smart spaces, and vision AI agents.
One of the most meaningful parts for me was seeing two benchmarks I supported, VANTAGE-Bench and Traffic Anomaly Reasoning (TAR), highlighted as part of the Cosmos 3 release story. Both benchmarks focus on the kind of operational video understanding that matters for physical AI: fixed cameras, long videos, small but important events, spatial-temporal reasoning, and explanations that go beyond object detection.
VANTAGE-Bench evaluates vision-language models on real-world fixed-camera footage across warehouse/logistics, transportation, and smart-space domains. The benchmark contains 3,346 image and video assets with 35,027 expert-curated annotations, organized around semantic, spatial, temporal, and spatio-temporal understanding. I helped prepare VANTAGE-Bench as a NeurIPS 2026 competition effort, supporting the benchmark and leaderboard framing around the “Infrastructure AI Gap”: how well VLMs can produce physically grounded insights from fixed infrastructure cameras under realistic deployment constraints.
TAR is the foundation of AI City Challenge 2026 Track 3: Anomalous Events in Transportation, part of the 10th AI City Challenge at ECCV 2026. TAR moves traffic anomaly evaluation from binary detection toward multi-task reasoning, with 44,040 training annotations across 10 task types covering 3,670 CCTV transportation videos. The track asks models to detect, reason about, and explain traffic anomalies through question answering, temporal reasoning, causal linkage, scene description, and video summarization. I supported the launch alignment for TAR as the Track 3 benchmark, helping keep the evaluation and leaderboard direction focused and ready for the Cosmos 3 public launch.
Together, these efforts show a broader shift in video AI evaluation: from recognizing visible objects to understanding what happened, why it happened, and what might happen next. They also connect to my broader MetroAI direction on multimodal AI for cities, transportation, and infrastructure, where video summarization, event understanding, and embedding-based video search are becoming central tools for operational intelligence. It was exciting to see that work connected to the Cosmos 3 launch and to the larger push toward physical AI systems that can reason over real-world environments.

