Reproducible Data Analysis Pipelines for Precision Medicine

Fjukstad B, Dumeaux V, Hallett M, Bongo L 2018. BioRxiv

Abstract

Precision medicine brings the promise of more precise diagnosis and individualized therapeutic strategies from analyzing a cancer’s genomic signature. Technologies such as high-throughput sequencing enable cheaper data collection at higher speed, but rely on modern data analysis platforms to extract knowledge from these high dimensional datasets. Since this is a rapidly advancing field, new diagnoses and therapies often require tailoring of the analysis. These pipelines are therefore developed iteratively, continuously modifying analysis parameters before arriving at the final results. To enable reproducible results it is important to record all these modifications and decisions made during the analysis process. We built a system, walrus, to support reproducible analyses for iteratively developed analysis pipelines. The approach is based on our experiences developing and using deep analysis pipelines to provide insights and recommendations for treatment in an actual breast cancer case. We designed walrus for the single servers or small compute clusters typically available for novel treatments in the clinical setting. walrus leverages software containers to provide reproducible execution environments, and integrates with modern version control systems to capture provenance of data and pipeline parameters. We have used walrus to analyze a patient’s primary tumor and adjacent normal tissue, including subsequent metastatic lesions. Although we have used walrus for specialized analyses of whole-exome sequencing datasets, it is a general data analysis tool that can be applied in a variety of scientific disciplines. We have open sourced walrus along with example data analysis pipelines at github.com/uit-bdps/walrus.