Close

Presentation

Student Technical Paper
Technical Paper
:
Searching the Sequence Read Archive using Jetstream and Wrangler
Event Type
Student Technical Paper
Technical Paper
Application Tags
Applications
HPC Applications
Technical Paper Tags
Technical Paper
Student Technical Paper
TimeWednesday, July 2511am - 11:15am
DescriptionThe Sequence Read Archive (SRA), the world’s largest database of sequences, hosts ~10 petabases (10^16 bp) of sequence data and is growing at the alarming rate of 10 TB of sequence per day. The data is barely curated and relies on the depositors to describe the contents of the sequences they upload. This rich trove of data is inaccessible to most researchers: searching through the SRA requires large storage and computing facilities that are beyond the capacity of most laboratories. Enabling scientists to analyze existing sequence data will provide insight into ecology, medicine, and industrial applications. For example, we used it to identify a phage present in about half of human gut microbiomes. The SRA encompasses all kinds of sequencing data, including cancer data sets, eukaryotic and microbial genomes, and environmental metagenomes. In this proposal we specifically focus on metagenomic sequences (whole community data sets from different environments), the expertise of the team we have assembled.
We are developing a set of tools to enable biologists to mine the metagenomes in the SRA using the NSF-funded cloud computing resources, Jetstream and Wrangler. We have developed a proof-of-principle pipeline to demonstrate the feasibility of the approach. We are leveraging our existing infrastructure to enable all scientists to access the SRA metagenomes regardless of their computational ability. We are creating a stable pipeline with a science gateway front-end that is accessible to all researchers.