Universidad Pública de Navarra - Campus de Excelencia Internacional

Introduction

We have created a public database of videos for head tracking and pose estimation. The database consists of a set of 120 videos acquired with a standard webcam, corresponding to 10 different subjects (6 males and 4 females) and 12 videos each. Every set of 12 videos is composed of 6 guided-movement sequences and 6 free-movement sequences. In the guided sequences, the user follows a specific pattern of movement: three pure translations (X, Y, Z) and three pure rotations (roll, yaw, pitch). In the free sequences, the user moves the head at free will combining translations and rotations along the three spatial axes. Translation and rotation axes are defined in the figure below. Every video begins and ends with the head in a frontal position, at a working distance from the camera (55-60cm). Movement ranges are large, translations going up to more than 200mm in any axis from the starting point, and rotations up to 30°.

Definition of translation and rotation axes

The videos are provided in MPEG-4 format, recoded with a loss of approximately 1% with respect to the original recording. They have a resolution of 1280×720 pixels, and have been acquired at 30 frames per second. Every video is 10 seconds long, containing 300 frames. Each video is associated to three ground-truth text files. One contains automatically annotated 2D facial points, the 2D-ground-truth, following a model of 54 facial landmarks. The other two contain the head pose, the 3D-ground-truth. One corresponds to the originally recorded head pose, and the other one corresponds to the same head pose sequence transformed so that the rotation in the initial frame is exactly zero. This transformation is done by multiplying the inverse rotation matrix of the initial pose to the pose of each frame. Getting an exact zero initial rotation is not feasible during the recordings, and applying this small transformation to every video is equivalent to moving the headband slightly at the beginning of each video so that the sensor gives an exact zero rotation for the initial frame. The average deviation from zero of the original ground-truth is of 0.83º, 0.86º, and 1.05º in roll, yaw, and pitch respectively. Translations are given in millimeters and rotations in degrees in the 3D ground-truths, and landmarks position is given in pixels in the 2D ground-truth.

Sample images taken from the database:

Reference

Mikel Ariz, José J. Bengoechea, Arantxa Villanueva, Rafael Cabeza, A novel 2D/3D database with automatic face annotation for head tracking and pose estimation, Computer Vision and Image Understanding, Volume 148, July 2016, Pages 201-210, ISSN 1077-3142

Download the database

This database is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. The data are only to be used for non-commercial scientific purposes. If you use this dataset in scientific publication, please cite the aforementioned paper

You can download ...

Public University of Navarre

GI4E - Gaze Interaction for Everybody

UPNA Head Pose Database

Introduction

Reference

Download the database