Building feature extraction approaches that can effectively characterize natural environment sounds is challenging due to the dynamic nature. In this paper, we develop a framework for feature extraction and obtaining semantic inferences from such data. In particular, we propose a new pooling strategy for deep architectures, that can preserve the temporal dynamics in the resulting representation. By constructing an ensemble of semantic embeddings, we employ an l-reconstruction based prediction algorithm for estimating the relevant tags. We evaluate our approach on challenging environmental sound recognition datasets, and show that the proposed features outperform traditional spectral features.
展开▼