Joint Multi-View People Tracking and Pose Estimation for 3D Scene Reconstruction

[Paper] [Slides]

Abstract

The goal of data analytics in surveillance videos is to fully understand and reconstruct the 3D scene, i.e., to recover the trajectory and action of each object. In a surveillance system with camera arrays of overlapping views, we propose a novel video scene reconstruction framework to collaboratively track multiple human objects and estimate their 3D poses. First, tracklets are extracted from each single view following the tracking-by-detection paradigm. We propose an effective integration of visual and semantic object attributes, i.e., appearance models, geometry information and poses/actions, to associate tracklets across different views. Based on the optimum viewing perspectives derived from tracking, a hierarchical estimation of human poses is introduced to generate the 3D skeleton of each object. The estimated body joint points are fed back to the tracking stage to enhance tracklet association. Experiments on benchmarks of multiview tracking and 3D pose estimation validate the effectiveness of the proposed method.

Citation

@inproceedings{Tang18JointTrackHPE,
author = {Zheng Tang and Renshu Gu and Jenq-Neng Hwang},
title = {Joint multi-view people tracking and pose estimation for {3D} scene reconstruction},
booktitle = {Proc. ICME},
address = {San Diego, CA, USA},
year = {2018}
}