Skip to main content

140.636.01
Scalable Computational Bioinformatics

Location
East Baltimore
Term
1st Term
Department
Biostatistics
Credit(s)
4
Academic Year
2024 - 2025
Instruction Method
In-person
Class Time(s)
M, W, F, 1:30 - 2:20pm
Auditors Allowed
Yes, with instructor consent
Available to Undergraduate
No
Grading Restriction
Letter Grade or Pass/Fail
Course Instructor(s)
Contact Name
Frequency Schedule
Every Year
Prerequisite

Students are recommended to have previous experience programming in at least one language and know the basics of coding such as iteration, recursion, arrays, matrix. Knowledge in Python is recommended but not required.

Description
As the size of genomic cohorts continues to expand in size and complexity, many organizations are turning to cloud and high-performance computing (HPC) environments to alleviate computational load. While cloud computing promises elasticity and scalability, traditional bioinformatics tools are designed for single system and single node architectures and cannot effectively leverage cloud computing environments at scale and speed. This leads bioinformatics researchers to spend much of their time wrangling data, crafting complex algorithmic techniques, and single task pipelines which are slow to run and difficult to optimize. Researchers have now turned to scalable systems like Apache Spark.
Discusses a distributed programming paradigm, high level APIs, and scalable analytics platforms that simplify implementing algorithms for analyzing large genomic datasets. Discusses tools built on Apache Spark, enabling students to scale to thousands of cores, achieving a balance necessary for processing genomics data. Discusses how to solve some of these problems by bridging bioinformatics, data science, machine learning and the big data ecosystem. Enables students to leverage statistical methods of bioinformaticians and computational biologists in combination with best practices used by data engineers and data scientists across industry.
Learning Objectives
Upon successfully completing this course, students will be able to:
  1. Develop just enough experience with Python to begin using the Apache Spark programming APIs including Spark SQL, Spark R, and PySpark
  2. Develop experience with Jupyter Notebook, AnVIL and Terra
  3. Describe the Apache Spark architecture, the DataFrames API and SparkR, covering the fundamentals of the Apache Spark framework
  4. Describe processes of tuning Spark applications, developing best practices and avoiding many of the common pitfalls associated with developing Spark applications
  5. Acquire knowledge with fundamental concepts in machine learning: linear regression, logistic regression, cross-validation, random forest, etc.
  6. Develop code to analyze Aggregate genetic variants using the GATK’s GenotypeGVCF implemented on Apache Spark and extract, transform and load (ETL) genomic variant data into Spark DataFrames, enabling seamless manipulation, filtering, quality control and transformation between file formats
  7. Use Machine Learning fundamentals and Data Science techniques to analyze healthcare and genetics/genomics datasets as well as an overview of deep learning and how to scale it with Apache Spark
Methods of Assessment
This course is evaluated as follows:
  • 100% Assignments