New Public Data Set: YRI Trio
The YRI Trio Public Data Set provides complete genome sequence data for three Yoruba individuals from Ibadan, Nigeria, which represent the first human genomes sequenced using Illuminas next generation Sequence-by-Synthesis technology.
This data represents some of the first individual human genomes to be sequenced and peer-reviewed (the full story is here). This article contains full information about this remarkable and ground-breaking effort.
The data is described as “containing paired 35-based reads of over 30x average depth.” Basically this means that the data contains a large number of relatively short genome sequences, and that each base is present in at least 30 separate sequences. I asked my colleague Deepak Singh for a better explanation and this is what he told me:
In order to get better assembly and data accuracy you determine the order of bases n times. With older sequencing technologies you collected longer reads and coverage was typically in the n=4-6 range. The sequencing process also took a very long time (several months) to collect sufficient data. Modern, or next generation, sequencing technologies yield shorter reads but you get results much faster (days to weeks) and at much lower cost, so you can repeat the experiment many times to get better coverage. Higher coverage depth gives you the ability to detect low frequency common variations (which is how we are differentiated from one another, and can be characteristic of certain diseases) and improved genome assemblies.
Suggested uses for this data include:
- The development of alignment algorithms.
- The development of de novo assembly algorithms.
- The development of algorithms that define genetic regions of interest, sequence motifs, structural variants, copy number variations, and site-specific polymorphisms.
- To test the viability of annotation engines that start with raw sequence data.