With the development of big data, the analysis of the audio content is more and more important, especially for the storage and the utilization of large multimedia data. But as the audio data is becoming more and more complex, the existing audio features cannot well represent the audio content, and using these features for classification also cannot meet the requirement of high classification accuracy. In this thesis, based on the comprehensive investigation on the state-of-art methods of audio classification and feature transforming. A series of algorithms based on task related feature modeling have been proposed, and the corresponding experiments have also been conducted. By feature modeling, the low-level features are mapped into another feature space, then we can get the high-level feature which are more suitable for target classification task. The main contributions and novelties of this thesis include:
(1) For environment sound classification task, we propose a novel audio ``element unit'' learning method based feature modeling. We think that the audio content can be described by some micro descriptors which called ``element units'', and different combination of these ``element units'' can represent different kinds of audio. By recognizing the distribution of these ``element units'', we can classify the audio data. For automatic discovering these ``element units'', we propose a method based on bag-of-word model to generate these ``element units''. We use the neural nodes of self-organizing feature map (SOFM) network to represent the ``element units'', it is better than using the centers of clusters by k-means algorithm. It can ameliorate the local optimization problem by enough training steps and good training methods. Then we propose a n-competition encoding strategy to form the probability histogram which can describe the distribution of ``element units''. Comparing to the vector quantization encoding strategy, n-competition encoding strategy is more robust to the boundary points.
(2) For low resource classification tasks, we propose a feature modeling method for improving the classification accuracy of low resource database. Usually, large amount of data is needed to train a large and robust deep neural network, while only a small amount of labeled data can be acquired in practice. Though we have a lot of unlabeled date, but we cannot use them as the data annotation needs so much time and money. Although we can use some unsupervised methods, but for many tasks they cannot work. The lacking of data will lead to the problem of over fitting. For solving this problem, we propose a model which combine transfer learning and bottleneck DNN for feature modeling. We use large out-of-domain unlabeled data for the semi-supervised training of transfer neural network, then combine the idea of transfer leaning and the bottleneck DNN for extracting bottleneck feature. The results have a big achievement, and outperform the state-of-art best performance on the test database.
(3) For music genre classification task, we propose the combination of sequential knowledge and statistical feature based segment feature modeling. Many researchers think that the sequential knowledge is an important character for audio data, and can benefit to the representation of audio content. In order to get the representation which contains sequence information, we use the Long Short-term Memory Recurrent Neural Network (LSTM RNN) for sequence modeling. However, because of the disadvantage of sequence training, it will lead to that the segment feature cannot improve much when we use majority vote to get the segment labels. To solve this problem, we combine the frame sequential knowledge feature and frame initial feature. But as the initial feature doesn't contain sequence information and the sequence feature contains sequence information, we cannot simply fuse them. So the fusional segment feature methods is proposed. We get the statistic segment features of frame features, and then fuse them. Experimental results show that the fusional feature can improve the classification accuracy and can perform better than sequence modeling method using LSTM RNN.
(4) For music genre classification task, we propose the segment feature modeling method using ``element unit''. The prefect representation of music signals must contain a lot of characters, such as beats, rhythm, loudness, singer information, musical instrument information and so on. But it is difficult to capture these characters. In this paper, we capture these characters of music signal by using some micro descriptors which called ``element units''. We use the GMM (Gaussian Mixture Model) model to learn and generate the ``element units''. In this model, a GMM represents a music kind, and each Gaussian component in GMM represents one ``element unit''. The model is based on I-vector, we use GMM-UBM (GMM based Universal Background Model) for the unsupervised learning of these ``element units''. In theory, if we have enough data and Gaussian mixtures, GMM can be fitted to any data distribution. So, in addition to learn enough ``element units'', the Multilingual I-vector model is proposed. The experimental results shows that the multilingual I-vector based ``element unit'' learning method can better represent the music signal.