Characterizing Audio Events for Video Soundtrack Analysis.

机译：表征音频事件以进行视频配乐分析。

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

There is an entire emerging ecosystem of amateur video recordings on the internet today, in addition to the abundance of more professionally produced content. The ability to automatically scan and evaluate the content of these recordings would be very useful for search and indexing, especially as amateur content tends to be more poorly labeled and tagged than professional content. Although the visual content is often considered to be of primary importance, the audio modality contains rich information which may be very helpful in the context of video search and understanding. Any technology that could help to interpret video soundtrack data would also be applicable in a number of other scenarios, such as mobile device audio awareness, surveillance, and robotics. In this thesis we approach the problem of extracting information from these kinds of unconstrained audio recordings. Specifically we focus on techniques for characterizing discrete audio events within the soundtrack (e.g. a dog bark or door slam), since we expect events to be particularly informative about content. Our task is made more complicated by the extremely variable recording quality and noise present in this type of audio.;Initially we explore the idea of using the matching pursuit algorithm to decompose and isolate components of audio events. Using these components we develop an approach for non-exact (approximate) fingerprinting as a way to search audio data for similar recurring events. We demonstrate a proof of concept for this idea.;Subsequently we extend the use of matching pursuit to build an actual audio fingerprinting system, with the goal of identifying simultaneously recorded amateur videos (i.e. videos taken in the same place at the same time by different people, which contain overlapping audio). Automatic discovery of these simultaneous recordings is one particularly interesting facet of general video indexing. We evaluate this fingerprinting system on a database of 733 internet videos.;Next we return to searching for features to directly characterize soundtrack events. We develop a system to detect transient sounds and represent audio clips as a histogram of the transients it contains. We use this representation for video classification over a database of 1873 internet videos. When we combine these features with a spectral feature baseline system we achieve a relative improvement of 7.5% in mean average precision over the baseline.;In another attempt to devise features to better describe and compare events, we investigate decomposing audio using a convolutional form of non-negative matrix factorization, resulting in event-like spectro-temporal patches. We use the resulting representation to build an event detection system that is more robust to additive noise than a comparative baseline system.;Lastly we investigate a promising feature representation that has been used by others previously to describe event-like sound effect clips. These features derive from an auditory model and are meant to capture fine time structure in sound events. We compare these features and a related but simpler feature set on the task of video classification over 9317 internet videos. We find that combinations of these features with baseline spectral features produce a significant improvement in mean average precision over the baseline.

机译：除了大量专业制作的内容外，当今互联网上还有一个新兴的业余录像生态系统。自动扫描和评估这些录音内容的功能对于搜索和索引编制将非常有用，尤其是业余内容比专业内容更容易被标记和标记。尽管视觉内容通常被认为是最重要的，但音频形式包含丰富的信息，这可能在视频搜索和理解的背景下非常有帮助。任何可以帮助解释视频音轨数据的技术也将适用于许多其他情况，例如移动设备的音频感知，监视和机器人技术。在本文中，我们探讨了从这类无约束的录音中提取信息的问题。具体来说，我们专注于表征声道中离散音频事件（例如狗吠或门砰）的技术，因为我们希望事件能对内容特别有用。这种类型的音频中存在极高的录制质量和噪声，使我们的任务变得更加复杂。最初，我们探索使用匹配追踪算法分解和隔离音频事件成分的想法。使用这些组件，我们开发了一种用于非精确（近似）指纹识别的方法，作为一种在音频数据中搜索类似重复事件的方法。我们证明了这一想法的概念证明。；随后，我们扩大了匹配追踪的使用范围，以建立实际的音频指纹识别系统，目的是识别同时录制的业余视频（即，不同地点在同一时间同时拍摄的视频）人，其中包含重叠的音频）。这些同时记录的自动发现是一般视频索引的一个特别有趣的方面。我们在733个互联网视频的数据库上评估了这个指纹识别系统。接下来，我们返回搜索功能以直接表征配乐事件。我们开发了一个检测瞬态声音的系统，并将音频片段表示为其中包含的瞬态的直方图。我们将此表示形式用于1873个互联网视频数据库中的视频分类。当我们将这些特征与频谱特征基线系统相结合时，相对于基线，平均平均精度可实现7.5％的相对提高。在另一种尝试设计特征以更好地描述和比较事件的尝试中，我们研究了使用卷积形式的音频分解非负矩阵分解，从而产生类似事件的光谱时斑。我们使用结果表示法来构建一个事件检测系统，该系统对加性噪声的抵抗力要强于比较基准系统。最后，我们研究了一种有前途的特征表示法，该特征表示法之前已被其他人用来描述类似事件的声音效果片段。这些功能源自听觉模型，旨在捕获声音事件中的精细时间结构。我们对9317个互联网视频的视频分类任务中的这些功能和相关但更简单的功能集进行了比较。我们发现这些特征与基线光谱特征的组合在平均平均精度上比基线产生了显着的提高。

著录项

作者
Cotton, Courtenay V.;
展开▼
作者单位

Columbia University.;

展开▼
授予单位 Columbia University.;
学科 Engineering Electronics and Electrical.;Computer Science.
学位 Ph.D.
年度 2013
页码 95 p.
总页数 95
原文格式 PDF
正文语种 eng
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. On the Applicability of Speaker Diarization to Audio Indexing of Non-Speech and Mixed Non-Speech/ Speech Video Soundtracks [J] . Robert Mertens, Po-Sen Huang, Luke Gottlieb, International journal of multimedia data engineering & management . 2012,第3期

机译：说话者差异化在非语音和非语音/语音混合视频音轨的音频索引中的适用性
2. Semantic integration of differently asynchronous audio-visual information in videos of real-world events in cognitive processing: an ERP study. [J] . Liu B, Wu G, Wang Z, Neuroscience Letters: An International Multidisciplinary Journal Devoted to the Rapid Publication of Basic Research in the Brain Sciences . 2011,第1期

机译：认知处理中真实事件视频中的不同异步视听信息的语义集成：一项ERP研究。
3. Applications of ENF criterion in forensic audio, video, computer and telecommunication analysis. [J] . Grigoras C Forensic science international . 2007,第2a3期

机译：ENF标准在法医音频，视频，计算机和电信分析中的应用。
4. Affective video segment retrieval for consumer generated videos based on correlation between emotions and emotional audio events [C] . IEEE International Conference on Multimedia and Expo . 2009

机译：基于情绪与情感音频事件的相关性的消费者生成视频的情感视频段检索
5. Video event recognition and prediction based on temporal structure analysis. [D] . Li, Kang. 2015

机译：基于时间结构分析的视频事件识别和预测。
6. Comparison of audio vs. audio + video for the rating of shared decision making in oncology using the observer OPTION5 instrument: an exploratory analysis [O] . Michael R. Gionfriddo, Megan E. Branda, Cara Fernandez, 2018

机译：使用观察者OPTION5仪器对音频与音频+视频进行比较以评估肿瘤学中的共享决策：探索性分析
7. On the Applicability of Speaker Diarization to Audio Indexing of Non-Speech and Mixed Non-Speech/Speech Video Soundtracks [O] . 2016

机译：扬声器二值化对非语音和混合非语音/语音视频音轨的音频索引的适用性

Characterizing Audio Events for Video Soundtrack Analysis.

摘要

著录项

相似文献

相关主题

期刊订阅