首页> 外文期刊>International journal on digital libraries >From subtitles to substantial metadata: examining characteristics of named entities and their role in indexing
【24h】

From subtitles to substantial metadata: examining characteristics of named entities and their role in indexing

机译:从字幕到大量元数据:检查命名实体的特征及其在索引中的作用

获取原文
获取原文并翻译 | 示例
           

摘要

This paper explores the possible role of named entities extracted from text in subtitles in automatic indexing of TV programs. This is done by analyzing entity types, name density and name frequencies in subtitles and metadata records from different genres of TV programs. The name density in metadata records is much higher than the name density in subtitles, and named entities with high frequencies in the subtitles are more likely to be mentioned in the metadata records. Further analysis of the metadata records indicates an increase in use of named entities in metadata in accordance with the frequency the entities have in the subtitles. The most substantial difference was between a frequency of one or two, where the named entities with a frequency of two in the subtitles were twice as likely to be present in the metadata records. Personal names, geographical names and names of organizations were the most prominent entity types in both the news subtitles and news metadata, while persons, creative works and locations are the most prominent in culture programs. It is not possible to extract all the named entities in the manually created metadata records by applying named entity recognition to the subtitles for the same programs, but it is possible to find a large subset of named entities for some categories in certain genres. The results reported in this paper show that subtitles are a good source for personal names for all the genres covered in our study, and for creative works in literature programs. In total, it was possible to find 38% of the named entities in metadata records for news programs, 32% for literature programs, while 21 % of the named entities in metadata records for talk shows were also present in the subtitles for the programs.
机译:本文探讨了从字幕中的文本中提取的命名实体在电视程的自动索引中的可能作用。这是通过分析来自不同类型的电视节目的字幕和元数据记录中的实体类型,名称密度和名称频率来完成的。元数据记录中的名称密度远高于字幕中的名称密度,并且在元数据记录中更有可能提及字幕中具有高频率的命名实体。对元数据记录的进一步分析表示根据实体在字幕中使用的频率在元数据中使用命名实体的增加。最实质的差异在于一个或两个的频率之间,其中字幕中具有两个频率的命名实体是元数据记录中存在的两倍。组织的个人名称,地理名称和名称是新闻字幕和新闻元数据中最突出的实体类型,而人员,创造性作品和地点在文化计划中最突出。无法通过将命名实体识别应用于同一程序的字幕来提取手动创建的元数据记录中的所有命名实体,但是对于某些类型中的某些类别,可以找到一个大小的命名实体子集。本文报告的结果表明,对于我们研究中涵盖的所有类型的个人名称,并为文学计划中的创意工作是个人名称的良好来源。总共可以在新闻节目的元数据记录中找到38%的命名实体,32%的文献程序,而谈话节目的元数据记录中的21%的命名实体也存在于这些程序的字幕中。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号