Audiovisual Recognition Datasets

A comprehensive list of all the audiovisual recognition datasets that have been created within the M3C Research Group research activities. All datasets are publically available for research purposes.

 

Acted Emotional Speech Dynamic Database (AESDD)

The Acted Emotional Speech Dynamic Database is a collection of audio utterances in the Greek language, spoken by professional actors in different emotional states. The dataset can be used for Speech Emotion Recognition, as well as Speaker Recognition and Diarization, etc.

 

Audio Tampering Dataset and Generator

A dataset generator script to automatically generate tampered audio files. The user provides a set of source files, and the script automatically generates files that use the source files for audio splicing. A dataset containing tampered audio files with segments of different audio encodings, which was created using the script is also provided.

 

AVLab for Speaker Localization

This dataset contains videos with 3 and 4 speakers speaking impromptu in a talk-show panel simulation. The audiovisual recognition datasets are recorded with an A-B microphone array, to allow Sound Source Localization information extraction. The videos were recorded in the AV Laboratory of the Engineering Faculty and are fully annotated.

 

BDLib v1 for Environmental Sound Recognition

BDLib v1 is a collection of sounds for the task of Environmental Sound Recognition. The library is organized in the following 10 classes: airplanes, alarms, applause,birds, dogs, motorcycles, rain, rivers, sea waves, and thunders.

 

BDLib v2 for Environmental Sound Recognition

BDLib v2 (an extension to BDLib v1) is a collection of sounds for the task of Environmental Sound Recognition. The library is organized in the following 10 classes: airplanes, alarms, applause, birds, dogs, motorcycles, rain, rivers, sea waves, and thunders.

 

LVLib v1 for General Audio Detection

LVLib v1 is a lightweight dataset for general audio detection and classification. It includes continuous real-world recordings with speech, music, silence and ambient noise classes with a total duration of 10 minutes.

 

LVLib v2 for General Audio Detection

LVLib v2 includes continuous real-world recordings with speech, music, silence and ambient noise classes. It includes a variety of audio content, combining speech, radio and TV broadcasts, music, environmental and household sounds, electronic devices’ noise, ringing, etc., along with their semantic annotation. The majority of the recordings were made with the use of smartphones.

 

LVLib SMO v1 for General Audio Classification

This dataset follows a three-class classification scheme, according to the Speech/Music/Other (SMO) taxonomy.

 

LVLib SMO v2 for General Audio Classification

Aiming at creating a large, yet compact, database for facilitating deep learning demands, the original LVLib SMO v1 has been expanded by embedding the GTZAN Music Genre and AESDD repositories to Music and Speech classes respectively.

 

LVLib SMO v3 for General Audio Classification

This dataset is formulated to expose data normalization issues in CNN topologies. It is an exact copy of the LVLib-SMO-v1, but with partially modified gain.

 

LVLib SMO v4 for General Audio Classification

LVLib-v4 follows the same 3-class scheme as the LVLib SMO family but featuring data deformations: variable gain and additive noise.

 

UrANP for Pollution Detection

UrANP follows a 2-class scheme for discriminating polluting and non-polluting sounds. The dataset contains data from UrbanSound8K, TUT Acoustic scenes 2017, and ESC-50 datasets.

 

iTrackEmo

iTrackEmo is a collection of eye-tracking data of 20 Subjects while watching emotionally charged videos and images with additional information about the material shown and the subjects’ heart rate. Eye tracking data came from 2 eye trackers, tobii pro nano eye tracker and a custom webcam eye tracker.