Datasets

Acted Emotional Speech Dynamic Database (AESDD)

The Acted Emotional Speech Dynamic Database is a collection of audio utterances in Greek language, spoken by professional actors in different emotional states. The dataset can be used for Speech Emotion Recognition, as well as Speaker Recognition and Diarization etc.

 

AVLab for Speaker Localization

This audiovisual dataset contains videos with 3 and 4 speakers speaking impromptu in a talk-show panel simulation. The audio is recorded with an A-B microphone array, to allow Sound Source Localization information extraction. The videos were recorded in the AV Laboratory of the Engineering Faculty and are fully annotated.

 

BDLib v1 for Environmental Sound Recognition

BDLib v1 is a collection of sounds for the task of Environmental Sound Recognition. The library is organized in the following 10 classes: airplanes, alarms, applause,birds, dogs, motorcycles, rain, rivers, sea waves, and thunders.

 

BDLib v2 for Environmental Sound Recognition

BDLib v2 (an extension to BDLib v1) is a collection of sounds for the task of Environmental Sound Recognition. The library is organized in the following 10 classes: airplanes, alarms, applause,birds, dogs, motorcycles, rain, rivers, sea waves, and thunders.

 

LVLib v1 for Generic Audio Detection

LVLib v1 is a lightweight dataset for general audio detection and classification. It includes continuous real-world recordings with speech, music, silence and ambient noise classes with a total duration of 10 minutes. It is recommended for preliminary tests and evaluation of semantic audio analysis engines. This dataset is split up in four subsets, with slightly different content: Subset #1 is a compilation of environmental sounds derived from sound libraries, Subset #2 mostly consists of household sounds, speech, TV and radio recording, Subset #3 contains audio content from an office working environment and Subset #4 includes outdoor audio events in urban environment.

Publications
  • Vrysis, L., Tsipas, N., Dimoulas, C., & Papanikolaou, G. (2015). Mobile audio intelligence: From real time segmentation to crowd sourced semantics. In Proceedings of the Audio Mostly 2015 on Interaction With Sound (pp. 1-6).
Licence

If you wish to use the dataset for academic purposes, please cite the aforementioned publications.

 

LVLib v2 for Generic Audio Detection

LVLib v2 includes continuous real-world recordings with speech, music, silence and ambient noise classes. Uncompressed audio format is used (44,100Hz, 16bit, mono), whereas the majority of the captures were made with the use of mobile terminals (smartphones) and not professional recording equipment. It includes a variety of audio content, combining speech, radio and TV broadcasts, music, environmental and household sounds, electronic devices’ noise, ringing, etc., along with their semantic annotation, defining class-labeled segments. It is split into two subsets. The first one is a sixty minutes compilation of sounds collected from libraries and recordings and is recommended for training. The second can be used for measuring the performance of the system and it is annotated in order to highlight audio events in terms of transition points.

Publications
  • Vrysis, L., Tsipas, N., Dimoulas, C., & Papanikolaou, G. (2016). Crowdsourcing audio semantics by means of hybrid bimodal segmentation with hierarchical classification. Journal of the Audio Engineering Society, 64(12), 1042-1054.
Licence

If you wish to use the dataset for academic purposes, please cite the aforementioned publications.

 

LVLib SMO v1 for Generic Audio Classification

This dataset follows a three-class classification scheme, according to the Speech/Music/Other (SMO) taxonomy. Considering that Speech/Music classification is an easy task, the setup of a more demanding scenario is required. A third class (Other) is introduced, containing a vast number of audio signals (i.e. environmental sounds, human and animal bioacoustics, weather phenomena, engines, motors, other machinery and many types of noise sounds), not listed as speech or music, making the dataset simulate better real-world classification scenarios. To make results comparable across algorithms of different creators, a linear data splitting and 3-fold cross validation is recommended. Split points are 0.33 and 0.66.

Publications
  • Vrysis, L., Tsipas, N., Dimoulas, C., & Papanikolaou, G. (2017, May). Extending Temporal Feature Integration for Semantic Audio Analysis. In Audio Engineering Society Convention 142. Audio Engineering Society.
Licence

If you wish to use the dataset for academic purposes, please cite the aforementioned publications. Data from GTZAN Music-Speech dataset is also included. You should also add required reference.

 

LVLib SMO v2 for Generic Audio Classification

Aiming at creating a larger, yet compact, database for facilitating deep learning demands, the original LVLib SMO v1 has been expanded by embedding the GTZAN Music Genre and AESDD repositories to Music and Speech classes respectively. To keep the balance between classes and common recording specifications, Other class was augmented with new samples as well, reaching almost 450 minutes in duration (in uncompressed audio format, 22,050Hz/16bit/mono) and offering more than 35,000 samples when split into 750ms-long blocks. Additionally, trying to overcome the problem of data scarcity, two data augmentation techniques to the whole dataset that have been reported to increase model accuracy for sound classification tasks are also recommended: time stretching by factors {0.85, 1.15} and pitch shifting by values (in semi-tones) {-4, 4}. The deformation factors are selected so that the processed data preserve their semantic meaning. The idea is that if a model is trained on additional deformed data it can better generalize on unseen data. Moreover, the amount of available data is five times the original, ending up to 175,000 750ms-long samples. The augmentations can be applied using the command line tool Sound eXchange (SoX). To make results comparable across algorithms of different creators, a linear data splitting and 3-fold cross validation is mandatory. Split data for each class linearly at points 0.33 and 0.66 to form the folds.

Publications
  • Vrysis, L., Tsipas, N., Thoidis, I., & Dimoulas, C. (2020). 1D/2D Deep CNNs vs. Temporal Feature Integration for General Audio Classification. Journal of the Audio Engineering Society, 68(1/2), 66-77.
Licence

If you wish to use the dataset for academic purposes, please cite the aforementioned publication. Furthermore, the dataset contains data from GTZAN Music-Speech, GTZAN Genre Collection and AESDD datasets. You should also include the additional references.

 

LVLib SMO v3 for Generic Audio Classification

This dataset is formulated to highlight data normalization issues within CNN topologies. It is an exact copy of the LVLib-SMO-v1, but with modified gain. In particular, the LVLib-SMO-v3 was generated after applying random gain ({0, -10, -20, -30} dBFS) for each fold and class, following a 3-fold setup. For this reason, and to make results comparable across algorithms of different creators, a linear data splitting and 3-fold cross validation is mandatory. Split data for each class linearly at spit points 0.33 and 0.66 to form the folds.

Publications
  • Vrysis, L., Tsipas, N., Thoidis, I., & Dimoulas, C. (2020). 1D/2D Deep CNNs vs. Temporal Feature Integration for General Audio Classification. Journal of the Audio Engineering Society, 68(1/2), 66-77.
Licence

If you wish to use the dataset for academic purposes, please cite the aforementioned publications. Data from GTZAN Music-Speech dataset is also included. You should also add required reference.