LVLib SMO v2 for General Audio Classification

Aiming at creating a larger, yet compact, database for facilitating deep learning demands, the original LVLib SMO v1 has been expanded by embedding the GTZAN Music Genre and AESDD repositories to Music and Speech classes respectively. To keep the balance between classes and common recording specifications, Other class was augmented with new samples as well, reaching almost 450 minutes in duration (in uncompressed audio format, 22,050Hz/16bit/mono) and offering more than 35,000 samples when split into 750ms-long blocks. Additionally, trying to overcome the problem of data scarcity, two data augmentation techniques to the whole dataset that have been reported to increase model accuracy for sound classification tasks are also recommended: time stretching by factors {0.85, 1.15} and pitch shifting by values (in semi-tones) {-4, 4}. The deformation factors are selected so that the processed data preserve their semantic meaning. The idea is that if a model is trained on additional deformed data it can better generalize on unseen data. Moreover, the amount of available data is five times the original, ending up to 175,000 750ms-long samples. The augmentations can be applied using the command line tool Sound eXchange (SoX). To make results comparable across algorithms of different creators, a linear data splitting and 3-fold cross validation is mandatory. Split data for each class linearly at points 0.33 and 0.66 to form the folds.


Vrysis, L., Tsipas, N., Thoidis, I., & Dimoulas, C. (2020). 1D/2D Deep CNNs vs. Temporal Feature Integration for General Audio Classification. Journal of the Audio Engineering Society, 68(1/2), 66-77.


If you wish to use the dataset for academic purposes, please cite the aforementioned publication. Furthermore, the dataset contains data from GTZAN Music-Speech, GTZAN Genre Collection and AESDD datasets. You should also include the additional references.