Text mining meets community curation: a newly designed curation platform to improve author experience and participation at WormBase

Valerio Arnaboldi; Daniela Raciti; Kimberly Van Auken; Juancarlos N Chan; Hans-Michael Müller; Paul W Sternberg

摘要

Biological knowledgebases rely on expert biocuration of the research literature to maintain up-to-date collections of data organized in machine-readable form. To enter information into knowledgebases, curators need to follow three steps: (i) identify papers containing relevant data, a process called triaging; (ii) recognize named entities; and (iii) extract and curate data in accordance with the underlying data models. WormBase (WB), the authoritative repository for research data on Caenorhabditis elegans and other nematodes, uses text mining (TM) to semi-automate its curation pipeline. In addition, WB engages its community, via an Author First Pass (AFP) system, to help recognize entities and classify data types in their recently published papers. In this paper, we present a new WB AFP system that combines TM and AFP into a single application to enhance community curation. The system employs string-searching algorithms and statistical methods (e.g. support vector machines (SVMs)) to extract biological entities and classify data types, and it presents the results to authors in a web form where they validate the extracted information, rather than enter it de novo as the previous form required. With this new system, we lessen the burden for authors, while at the same time receive valuable feedback on the performance of our TM tools. The new user interface also links out to specific structured data submission forms, e.g. for phenotype or expression pattern data, giving the authors the opportunity to contribute a more detailed curation that can be incorporated into WB with minimal curator review. Our approach is generalizable and could be applied to additional knowledgebases that would like to engage their user community in assisting with the curation. In the five months succeeding the launch of the new system, the response rate has been comparable with that of the previous AFP version, but the quality and quantity of the data received has greatly improved.

机译：生物知识库依赖于研究文献的专家酶，以维持以机器可读形式组织的最新数据收集。要进入知识库中的信息，策展人需要遵循三个步骤：（i）识别含有相关数据的论文，一个名为Trijing的过程; （ii）承认命名实体; （iii）根据底层数据模型提取和策划数据。 Wormbase（WB）是Caenorhabditis elegans和其他线虫的研究数据的权威储存，使用文本挖掘（TM）半自动化其策择流水线。此外，WB通过作者首次通行证（AFP）系统，帮助识别实体并在其最近发表的论文中进行分类数据类型。在本文中，我们提出了一个新的WB AFP系统，将TM和AFP结合到一个应用程序中以增强社区策策。该系统采用字符串搜索算法和统计方法（例如，支持向量机（SVM））来提取生物实体和分类数据类型，并且它将结果呈现给Web形式的作者，其中它们验证提取的信息，而不是输入它De Novo作为之前的表格。通过这款新系统，我们减轻了作者的负担，同时接受有关我们TM工具的表现的宝贵反馈。新的用户界面还将特定的结构化数据提交表单链接出，例如，对于表型或表达模式数据，给出作者有机会贡献更详细的策择，可以用最小的策纳审查纳入WB。我们的方法是概括的，可以应用于额外的知识库，希望在协助策委中参与其用户社区。在五个月内成功推出新系统时，响应率与以前的AFP版本相当，但收到的数据的质量和数量大大提高。

Text mining meets community curation: a newly designed curation platform to improve author experience and participation at WormBase

摘要

著录项

相关主题

期刊订阅