Introduction to Biomedical Data Science
and Health Informatics

June 8th, 2020 - June 12th, 2020



Join us for an introduction to basic biomedical data science knowledge and health informatics skills. This course is targeted for beginners in informatics. No previous experience is required. However, if programming is completely new to you, we encourage you to check out the introductory lecture for Harvard's CS50 course: Computational Thinking.

Most lectures are available in advance, although Thursday's lecture and Friday's lecture will only be available as a live-stream on the respective mornings. Morning office hours (9am to 12pm) will be available via Zoom to answer questions about the lectures. Afternoons (1pm to 5pm) will begin with a brief Q&A, followed by hands-on exercises via Zoom.


We will be using Piazza for class discussion. The system is highly catered to getting you help fast and efficiently from classmates, the TAs, and ourselves. Rather than emailing questions to the teaching staff, post your questions on Piazza. We'll be conducting all class-related discussion here. The quicker you begin asking questions on Piazza (rather than via emails), the quicker you'll benefit from the collective knowledge of your classmates and instructors. We encourage you to ask questions when you're struggling to understand a concept - you can even do so anonymously. You can find our class signup link here:

If you have problems signing up for Piazza (e.g. if you only have a email), please write to the course instructors, and we can add you to the course manually.

Communication in Piazza is facilitated by notes. Notes are organized into folders, with each folder corresponding to a content area for the course. Please browse through the questions that have already been asked before asking a question, as someone else may have already asked your question!

Posts should be typed as questions, asking the class about a specific question you have. If you know the answer, feel free to answer questions and collaborate with your fellow students, however, instructors will be monitoring the Q&A to assist during the course.

Class Structure

What is a "5x5" course?

This course is structured as 5x5, in that it is over five days comprising approximately five hours of instruction per day. The American Medical Informatics Association offers continuing education in the form of 10x10 courses, which are seen as equivalent to a 3 credit hour semester-long course. As such, this course is approximately half as much in both breadth and depth.

Online Delivery

We are using what is known as an inverted classroom structure, where class time is focused on applying the material. An inverted classroom is one in which the class time is spent doing those tasks which would traditionally take place outside of the classroom and vice versa. As such, class time is spent reviewing problem sets, expounding on concepts with input from students, rather than the didactic relaying of information. That information transmission is relegated to the students' non-instructional time, through such methods as pre-recorded lectures, exercises or reading material. This shift in structure requires a different investiture of resources from the student, and as a result the incentives required have been found to shift in response:

  1. Reviewing the material provided for each lecture, be it readings, exercises or pre-recorded lectures is crucial for functioning in an inverted classroom.
  2. Classroom time is predominantly an exercise in information integration, following both the thread of discussion and relating it to the student's own skill level. Active learning is a must.
  3. As such, student reflection and conversing with instructors and making sure skill development is taking apace with the content delivered is a key to successful learning.

Course Materials

The data for the course can be found in this Google Drive folder. To add it to your own Google Drive, click on the course folder at the top next to the "Shared with me" header, select "Add shortcut to drive" from the dropdown and then create the shortcut by selecting "My Drive" (or your subfolder of choice).

Solutions for each set of exercises will be posted in the evenings after each class.

Monday: An Introduction to Python for Data Science
Basic calculations, variables, data types Lecture (20m 2s) Colab notebook Exercises Solutions
Functions, Methods, f-strings Lecture (24m 12s) Colab notebook Exercises Solutions
Looping (for loops) and making choices (if statements) Lecture (30m 23s) Colab notebook Exercises Solutions
Loading and using libraries (modules) Lecture (9m 34s) Slides Exercises Solutions
Loading and manipulating data with pandas Lecture (36m 37s) Slides Exercises Solutions
Visualizing data with ggplot Lecture (15m 3s) Slides Exercises Solutions
Tuesday: Data Management and Databases

You can download a transcript for the recorded presentation by clicking on this next link.

Wednesday: Data Cleaning and Data Visualization
Lectures and Slides
Exploratory Data Analysis Lecture (41m 42s) Slides    
Applied Visualization Lecture (15m 4s) Slides    
The Grammar of Graphics Lecture (6m 41s) Slides    
Data Cleaning I: Overview Lecture (6m 4s) Slides    
Data Cleaning II: Common challenges Lecture (28m 17s) Slides    
Data Cleaning III: Missing variables I: What does it mean Lecture (7m 27s) Slides    
Data Cleaning IV: Missing variables II: Why is it missing and what can be done Lecture (16m 10s) Slides    
Thursday: Machine Learning and Bioinformatics

Note: lectures on Thursday are live via Zoom at 9am EDT and not pre-recorded. (recording)

Machine Learning   Slides    
Bioinformatics   Slides    
Friday: Natural Language Processing and Putting It All Together

Note: lectures on Friday are live via Zoom at 9am EDT and not pre-recorded.

Natural Language Processing Colab notebook Slides Exercises Solutions
Putting It All Together   Slides Exercises Solutions