UrbanAI at NeurIPS 2025: Advancing Multi-Camera Tracking & Multimodal Spatial AI with NVIDIA Metropolis

1 minute read

Published:

I was honored to present at the UrbanAI Workshop at NeurIPS 2025, where we shared NVIDIA’s recent advancements in multi-camera 3D perception, cloud-native tracking workflows, and multimodal spatial AI.

We highlighted the nine-year evolution of the AI City Challenge, which now spans multi-camera 3D perception, traffic safety reasoning, warehouse spatial intelligence, and fisheye object detection—reflecting the growing needs of smart cities and intelligent infrastructure.

Our talk also introduced NVIDIA’s cloud-native, streaming multi-camera tracking workflow, combining DeepStream-based perception, MTMC fusion, real-time location tracking (RTLS), and a full Sim2Deploy pipeline powered by Omniverse, TAO Toolkit, and Metropolis microservices.

We shared progress on MCBLT, a BEV-based multi-camera 3D detection and hierarchical GNN tracking framework that achieves state-of-the-art results on AI City Challenge and WildTrack benchmarks.

Finally, we presented Sparse4D, an end-to-end multi-camera 3D perception model that jointly predicts 3D bounding boxes, tracking, velocity, and ReID embeddings—ranking #1 on the AICity’25 vision-only leaderboard.

It was great connecting with researchers from Google Mobility AI, DeepMind, Columbia University, and many others working to build the next generation of intelligent urban systems. Looking forward to continued collaboration across the UrbanAI and AI City Challenge communities.

Urban AI Workshop at NeurIPS 2025