首页> 美国卫生研究院文献>iScience >Pergola: Boosting Visualization and Analysis of Longitudinal Data by Unlocking Genomic Analysis Tools
【2h】

Pergola: Boosting Visualization and Analysis of Longitudinal Data by Unlocking Genomic Analysis Tools

机译:凉棚架:通过解锁基因组分析工具来提高纵向数据的可视化和分析

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

class="head no_bottom_margin" id="sec1title">IntroductionFine-grained longitudinal recordings are one of the fastest growing collections of biological and societal data (, ). Behavior is a prime target for these types of measurements () since it constitutes a high-level phenotype that links genetics, development, neurobiology, evolution, environment, and social influences (). Novel high-throughput platforms for monitoring behavior have recently enabled unprecedented amounts of data to be collected (). However, these data present novel challenges, and their effective comparison and reproducible analysis across different systems and platforms will require improved interoperability (, ). Here, we show how standards established for genomics can be repurposed to handle behavioral data in a fully interoperable fashion. This simple solution made it possible to analyze behavioral neuroscience datasets gathered on three model systems (, , ) using standard genomic formats and software tools.High-throughput behavior-monitoring platforms can be classified into two categories: sensor- and video-based systems. Sensor-based systems measure selected parameters such as vocalizations, activity, feeding, and oxygen consumption. For mice, such devices include PHECOMP (), PhenoMaster () and CLAMS (). Video-based systems use computer vision techniques to track movements. Optimized video setups have been developed for the main model organisms, including mouse (), worm (), fly (href="#bib39" rid="bib39" class=" bibr popnode">Robie et al., 2017), and zebrafish (href="#bib33" rid="bib33" class=" bibr popnode">Pérez-Escudero et al., 2014). Even though the recorded signals are different across systems, the outputs are comparable in the sense that they consist of discrete or continuous time series of behavioral quantitative or qualitative readouts. These time series are usually processed using commercial software shipped with the platform, often in combination with ad hoc implemented analyses. Although this approach is entirely suitable for establishing key biological results, the lack of common standards for longitudinal data processing hampers the ability to share established computational analysis procedures and data (href="#bib47" rid="bib47" class=" bibr popnode">Wilkinson et al., 2016), thus potentially limiting comparisons and reproducibility across different systems. Improving interoperability is not, however, an easy task, because it requires the whole community to agree on common standards, formats, and procedures. In genomics, the establishment of such standards has taken about 15 years and work is still in progress (href="#bib17" rid="bib17" class=" bibr popnode">Field et al., 2011, href="#bib28" rid="bib28" class=" bibr popnode">Karsch-Mizrachi et al., 2018). In this work, we show how mature standards can be recycled across research fields to save development time. To the best of our knowledge we document here the first case of a standard repurposing procedure between genomics and behavioral neuroscience.The rationale for this work was our original observation that the storage of genomics data and the way these data are subsequently analyzed supports all the requirements of longitudinal data since both data types share a sequential structure. In a genomic sequence, nucleotides are ordered as a discrete series to which various layers of qualitative and quantitative annotation can be attached. This data structure, which is essentially a large table with a position in each line and various attributes in each column, can hold any sequentially organized information, including longitudinal recordings. By simply treating each nucleotide coordinate as a unit of time, we show that it is possible to use the most common genomic data formats for storing, visualizing, and analyzing behavioral information (including metadata) in a lossless fashion. What makes this data structure attractive for the handling of longitudinal information is not its specification but rather its ubiquitous use in genomics. In this field, the early adoption of strict data storage standards (href="#bib29" rid="bib29" class=" bibr popnode">Kent et al., 2002) has resulted in a huge number of tools built around formats agreed upon and supported by a community of thousands of laboratories around the world. Formats developed to define the position of genomic annotations therefore provide a perfect scaffold to store discrete time series of behavior. In a similar fashion, file formats implemented to represent scores along genomic sequences allow the storage of continuous time series of behavior (href="/pmc/articles/PMC6231116/figure/fig1/" target="figure" class="fig-table-link figpopup" rid-figpopup="fig1" rid-ob="ob-fig1" co-legend-rid="lgnd_fig1">Figure 1). The available genomic tools (href="#bib1" rid="bib1" class=" bibr popnode">Afgan et al., 2018, href="#bib14" rid="bib14" class=" bibr popnode">Ernst and Kellis, 2012, href="#bib22" rid="bib22" class=" bibr popnode">Gentleman et al., 2004, href="#bib36" rid="bib36" class=" bibr popnode">Quinlan and Hall, 2010, href="#bib37" rid="bib37" class=" bibr popnode">Ramírez et al., 2016, href="#bib40" rid="bib40" class=" bibr popnode">Robinson and Thorvaldsdóttir, 2011, href="#bib51" rid="bib51" class=" bibr popnode">Zerbino et al., 2014) are also suitable to deal with time series because they allow both sophisticated multi-scale visualization, genomic arithmetic operations required to conditionally extract and combine data portions, and complex post-processing techniques such as hidden Markov model (HMM) analysis.href="/pmc/articles/PMC6231116/figure/fig1/" target="figure" rid-figpopup="fig1" rid-ob="ob-fig1">class="inline_block ts_canvas" href="/core/lw/2.0/html/tileshop_pmc/tileshop_pmc_inline.html?title=Click%20on%20image%20to%20zoom&p=PMC3&id=6231116_gr1.jpg" target="tileshopwindow">target="object" href="/pmc/articles/PMC6231116/figure/fig1/?report=objectonly">Open in a separate windowclass="figpopup" href="/pmc/articles/PMC6231116/figure/fig1/" target="figure" rid-figpopup="fig1" rid-ob="ob-fig1">Figure 1Formatting of Common Behavioral Recordings into BED as an Example of the Pergola Approach to Format Data(A) Typical tabulated (CSV, TSV, or XLSX) behavioral recording with three events (lines 2–4) being coded with an animal tag (id column), start (event_start column) and end time (event_end column), a typology (type_of_event), and a quantitative value (value column).(B) Once mapped onto the Pergola generic ontology, these column labels are translated into a BED file format where the nucleotide relative index positions (columns 2 and 3) are used to specify the start and termination of the event whose nature is coded in the attribute column (#4) and quantified in the quantification column (#6). The BED format also supports specific coloring for the considered event (RGB values in the last column). A single BED file will contain all the events of an individual animal ID. Pergola will generate a FASTA file name chromosome 1 (“chr1” on the file) that is used to map the time points onto the nucleotide positions.(C) Once formatted this way, data can then be displayed and processed using popular genomics tools such as the Integrative Genomic Viewer (IGV).
机译:<!-fig ft0-> <!-fig @ position =“ anchor” mode =文章f4-> <!-fig mode =“ anchred” f5-> <!-fig / graphic | fig / alternatives / graphic mode =“ anchored” m1-> class =“ head no_bottom_margin” id =“ sec1title”>简介细粒度纵向记录是增长最快的生物学和社会数据收集之一( ,)。行为是这些类型的测量的主要目标(),因为行为构成了将遗传学,发育,神经生物学,进化,环境和社会影响联系在一起的高级表型()。新型的用于监视行为的高吞吐量平台最近使得能够收集到前所未有的数据量()。但是,这些数据提出了新的挑战,它们在不同系统和平台之间的有效比较和可重现的分析将需要改进的互操作性(,)。在这里,我们展示了如何将为基因组学建立的标准重新用于以完全可互操作的方式处理行为数据。这种简单的解决方案使使用标准基因组格式和软件工具分析在三个模型系统(,)上收集的行为神经科学数据集成为可能。高通量行为监控平台可分为两类:基于传感器和视频的系统。基于传感器的系统测量选定的参数,例如发声,活动,进食和氧气消耗。对于小鼠,此类设备包括PHECOMP(),PhenoMaster()和CLAMS()。基于视频的系统使用计算机视觉技术来跟踪运动。已针对主要模型生物开发了优化的视频设置,包括鼠标(),蠕虫(),苍蝇(href="#bib39" rid="bib39" class=" bibr popnode"> Robie等,2017 < / a>)和斑马鱼(href="#bib33" rid="bib33" class=" bibr popnode">佩雷斯-埃斯库德罗等人,2014 )。即使记录的信号在整个系统中是不同的,但从它们由行为定量或定性读数的离散或连续时间序列组成的意义上说,输出是可比较的。通常使用平台随附的商业软件来处理这些时间序列,通常将其与临时实施的分析结合使用。尽管此方法完全适合于建立关键的生物学结果,但缺乏用于纵向数据处理的通用标准阻碍了共享已建立的计算分析程序和数据的能力(href =“#bib47” rid =“ bib47” class =“ bibr popnode“> Wilkinson等人,2016 ),因此有可能限制不同系统之间的比较和可重复性。但是,提高互操作性并不是一件容易的事,因为它要求整个社区就通用的标准,格式和程序达成共识。在基因组学中,建立此类标准已花费了大约15年的时间,并且工作仍在进行中(href="#bib17" rid="bib17" class=" bibr popnode"> Field et al。,2011 ,href="#bib28" rid="bib28" class=" bibr popnode"> Karsch-Mizrachi等人,2018 )。在这项工作中,我们展示了如何在各个研究领域回收成熟的标准,以节省开发时间。据我们所知,我们在这里记录了基因组学和行为神经科学之间的标准重用程序的第一种情况。这项工作的基本原理是我们最初的观察,即基因组学数据的存储以及随后分析这些数据的方式可以满足所有要求。因为这两种数据类型共享一个顺序结构,所以它们是纵向数据。在基因组序列中,核苷酸按离散序列排序,可以在其上附加定性和定量注释的各个层。这种数据结构本质上是一张大表,每行都有一个位置,每一列都有各种属性,它可以保存任何顺序组织的信息,包括纵向记录。通过简单地将每个核苷酸坐标视为一个时间单位,我们表明可以使用最常见的基因组数据格式以无损方式存储,可视化和分析行为信息(包括元数据)。使该数据结构吸引纵向信息的因素不是其规范,而是其在基因组学中的普遍使用。在这一领域,严格的数据存储标准(href="#bib29" rid="bib29" class=" bibr popnode"> Kent等,2002 )的早期采用导致了数量庞大围绕世界上成千上万个实验室的社区同意并支持的格式构建的工具。因此,为定义基因组注释的位置而开发的格式提供了一个完美的支架来存储行为的离散时间序列。以类似的方式,实现了代表基因组序列得分的文件格式,可以存储行为的连续时间序列(href =“ / pmc / articles / PMC6231116 / figure / fig1 /” target =“ figure” class =“ fig-table-link figpopup“ rid-figpopup =” fig1“ rid-ob =” ob-fig1“ co-legend-rid =” lgnd_fig1“>图1 )。可用的基因组工具(href="#bib1" rid="bib1" class=" bibr popnode"> Afgan等人,2018 ,href =“#bib14” rid =“ bib14”类=“ bibr popnode”>恩斯特和凯利斯,2012 ,href="#bib22" rid="bib22" class=" bibr popnode">绅士等人,2004 ,href =“#bib36” rid =“ bib36” class =“ bibr popnode”>昆兰和霍尔,2010 ,href="#bib37" rid="bib37" class=" bibr popnode">拉米雷斯(Ramírezet al) 。,2016 ,href="#bib40" rid="bib40" class=" bibr popnode">罗宾逊和索瓦尔兹多蒂尔,2011 ,href =“#bib51” rid =“ bib51 “ class =” bibr popnode“> Zerbino等人,2014 )也适合处理时间序列,因为它们既可以进行复杂的多尺度可视化,也可以进行有条件地提取和组合数据部分所需的基因组算术运算,并且复杂的后处理技术,例如隐马尔可夫模型(HMM)分析。<!-fig ft0-> <!-fig mode = article f1-> href =“ / pmc / articles / PMC6231116 / figure / fig1 /“ target =” figure“ rid-figpopup =” fig1“ rid-ob =” ob- fig1“> <!-fig / graphic | fig / alternatives / graphic mode =” anchored“ m1-> class =” inline_block ts_canvas“ href =” / core / lw / 2.0 / html / tileshop_pmc / tileshop_pmc_inline.html ?title = Click%20on%20image%20to%20zoom&p = PMC3&id = 6231116_gr1.jpg“ target =” tileshopwindow“> target =” object“ href =” / pmc / articles / PMC6231116 / figure / fig1 / ?report = objectonly“>在单独的窗口中打开 class =” figpopup“ href =” / pmc / articles / PMC6231116 / figure / fig1 /“ target =” figure“ rid-figpopup = “ fig1” rid-ob =“ ob-fig1”>图1 <!-标题a7->将常见行为记录的格式格式化为BED,以凉亭方法格式化数据为例(A)典型列表(CSV,TSV或XLSX)行为记录,其中三个事件(第2–4行)使用动物标签(id列),开始(event_start列)和结束时间(event_end列)进行编码,类型学(type_of_event)和(B)一旦映射到Pergola通用本体上,这些列标签就会转换为BED文件格式核苷酸相对索引位置(第2列和第3列)用于指定事件的开始和结束,其性质在属性列(#4)中进行编码,并在定量列(#6)中进行量化。 BED格式还支持所考虑事件的特定颜色(最后一列中的RGB值)。单个BED文件将包含单个动物ID的所有事件。凉棚架将生成一个FASTA文件名染色体1(文件上的“ chr1”),用于将时间点映射到核苷酸位置。(C)以这种方式格式化后,可以使用流行的基因组学工具(例如,作为集成基因组查看器(IGV)。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号