Speaker Localization


The dataset contains videos with three and four speakers in a talk-show panel setting. Two train and two test videos have been recorded with each setup. The speakers perform impromptu conversation. The speaker transitions are fully annotated. In some cases, an overlap between speakers occurs.

For the recording, an HD video camera was used, with a wide shot, as it can be seen in the relevant preview screenshots above. The audio was recorded with two microphones in A-B setup, in order to capture information concerning the spatial characteristics of sound.

The videos serve the purposes of training and validating models for the tasks of Speaker Diarization,Audio/and Visual Speaker Localization, Speaker Indexing, and with the motivation to serve the purposes of creating automations for video broadcasting.


Table 1 Specifications of the created AVLab Speaker Localization dataset.

  Speakers Duration (s) Class participation (%)
      Speaker 1 Speaker 2 Speaker 3 Speaker 4 Silence
Video A 3 180 9.8 35.7 29.5 24.9
Video B 3 205 13.0 35.7 18.6 32.8
Video C 4 443 10.3 26.3 25.2 17.2 21.0
Video D 4 226 9.5 34.5 19.4 7.6 29.0



[in review process]


The dataset can be downloaded here.