Building Genomics Pipelines with AWS Lambda and Apache Spark
Lynn Langit shares lessons learned and cloud data pipeline patterns via examples from work she’s doing with CSIRO Bioinformatics Australia. The team there, led by Dr. Denis Bauer, is analyzing a number of large genomic datasets.
First, Lynn examines real-time analysis with cloud-based solutions. Keeping runtime constant can be challenging for problems that vary in complexity, such as genome engineering. The CSIRO GT-Scan2 tool works by instantaneously recruiting additional Lambda functions as the complexity increases. It was built using a microservices pattern (serverless) using AWS services.
Next, Lynn will demo a Jupyter notebook which shows how genomic research can leverage Apache Spark to massively parallelize the generation of random forests to identify disease genes efficiently.She’ll discuss the pipeline’s use of an OSS library written by the team at CSIRO (VariantSpark).
VariantSpark can analyze 3,000 samples with 80 million features in under 30 minutes. This pipeline enables real-time diagnosis by finding similar patients. This platform is contributing to motor neuron disease research (publicized by the Ice Bucket Challenge) in Australia.
Data analysts, software engineers, architects and anyone with an interest in cloud based solutions, lambda functions or genome research.