Speaker Localization

The dataset contains videos with three and four speakers in a talk-show panel setting. Two train and two test videos have been recorded with each setup. The speakers perform impromptu conversation. The speaker transitions are fully annotated. In some cases, an overlap between speakers occurs.

For the recording, an HD video camera was used, with a wide shot, as it can be seen in the relevant preview screenshots above. The audio was recorded with two microphones in A-B setup, in order to capture information concerning the spatial characteristics of sound.

The videos serve the purposes of training and validating models for the tasks of Speaker Diarization,Audio/and Visual Speaker Localization, Speaker Indexing, and with the motivation to serve the purposes of creating automations for video broadcasting.

Table 1 Specifications of the created AVLab Speaker Localization dataset.

	Speakers	Duration (s)	Class participation (%)
			Speaker 1	Speaker 2	Speaker 3	Speaker 4	Silence
Video A	3	180	9.8	35.7	29.5	–	24.9
Video B	3	205	13.0	35.7	18.6	–	32.8
Video C	4	443	10.3	26.3	25.2	17.2	21.0
Video D	4	226	9.5	34.5	19.4	7.6	29.0

Publications

[in review process]

Download

The dataset can be downloaded here.

Speaker Localization

M3C

Search

Share this:

M3C

Search