首页> 外文会议>Annual meeting of the Association for Computational Linguistics >Feature Hashing for Language and Dialect Identification
【24h】

Feature Hashing for Language and Dialect Identification

机译:语言和方言识别的特征散列

获取原文

摘要

We evaluate feature hashing for language identification (LID), a method not previously used for this task. Using a standard dataset, we first show that while feature performance is high, LID data is highly dimensional and mostly sparse (>99.5%) as it includes large vocabularies for many languages; memory requirements grow as languages are added. Next we apply hashing using various hash sizes, demonstrating that there is no performance loss with dimensionality reductions of up to 86%. We also show that using an ensemble of low-dimension hash-based classifiers further boosts performance. Feature hashing is highly useful for LID and holds great promise for future work in this area.
机译:我们评估用于语言识别(LID)的功能哈希,这是以前未用于此任务的方法。使用标准数据集,我们首先显示出虽然功能性能很高,但是LID数据具有很高的维数,并且稀疏(> 99.5%),因为它包含许多语言的大量词汇。内存需求随着语言的添加而增长。接下来,我们使用各种散列大小应用散列,这表明在降维高达86%的情况下不会出现性能损失。我们还表明,使用低维基于散列的分类器集合可进一步提高性能。特征散列对于LID非常有用,并为该领域的未来工作带来了广阔前景。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号