Becoming a Healthcare Data Scientist, Data Scientist, Healthcare Predictive Analytics, Northwestern University MSPA

Physician Data Scientist Part II. The Why.

I was recently reminded by a reader of my blog (thanks Al) that I had not followed up on a comment that I was going to post a second part to a blog that was posted on 7.7.2015 – “Physician Data Scientists – Why and What Type? Part I“.  Now that I am in between classes, I have the time to work on this.   Looking back at this original post, I am somewhat amazed at all that has happened in the last 1 1/2 years.

I am currently the interim Chief Information Officer (CIO) and Chief Medical Information Officer (CMIO) for our integrated healthcare system.   I stepped into the interim CIO role (helped in part by my Northwestern University MSPA Master of Science in Predictive Analytics coursework) after the departure of our previous CIO last year.  Prior to that I had been one of our systems CMIO’s – facilitating and communicating the needs for technology to help improve clinical outcomes to IT, while communicating back to Physicians and Leadership the limitations of current technologies.  I never really aspired to become either the interim CIO or a CMIO, these opportunities simply arose because of my journey to become better educated about the use of data and analytics to improve clinical outcomes – ie to become a Physician Data Scientist.  I will explain how I ended up in my current role.

My interest in data and analytics is a fairly recent phenomenon, occurring because of a chance meeting with someone who has since become one of my closest friends – Curt Lindberg – who has a PhD in Complexity Science, and is the Director of our Complexity in Healthcare Center.  I met him during a project to improve our process for getting patients into our healthcare system from outside facilities more efficiently.  At that time I was a practicing Emergency Physician and the Medical Director of our MedFlight Air Ambulance service.  Curt introduced me to complexity science and my life has not been the same – it was a transformational career moment for me.  I ended up being part of a small group of researchers who were trying to develop smarter patient monitoring systems.  Their work has inspired me to try and contribute in my own way to this field – called predictive monitoring.

Predictive monitoring is an unofficial term for what this group is trying to accomplish.  While the technology inside the monitors has changed drastically since the 1970’s, what the monitors do has not.  These monitors display certain physiologic markers of interest – blood pressure, pulse rate, temperature, oxygen level, ekg pattern, etc.  You can see what is happening to the patient right at that time, or you can go back and review what happened to them in the past (minimally), but there is no information about predicting what will happen to them in the future (are they predicted to get better, go into sudden cardiac arrest, stop breathing, or develop an overwhelming infection called sepsis, etc).  The goal is to incorporate predictive algorithms into these monitoring systems.

I have been fortunate to meet some giants in this field.  Dr. J. Randall Moorman  from the University of Virginia, who developed the first commercial predictive monitoring system – the HeRO monitor.  The largest ever randomized clinical trial in neonatal patients (premature babies) was conducted using this monitor.  It showed that the monitor was able to identify certain physiological patterns, and translate those patterns into a risk for developing an overwhelming infection (late onset neonatal sepsis).  This risk was detected an average of 18 hours before a clinical diagnosis was made, allowing for earlier treatments and interventions.  This translated into a 22% reduction in mortality.  Dr. Andrew Seely  is a Thoracic Surgeon at the University of Ottawa who has developed a model to predict the success of removing a breathing tube from a patient and not have to replace it because they weren’t ready to have it removed.   We got to participate in that clinical trial.  We also got to participate in a trial conducted by Ryan Arnold, now at Christiana Care in Newark Delaware, on trying to predict clinical outcomes using heart rate variability analyses.

In addition to collaborating with these researchers working on their projects, I became especially fascinated with a research article written by one of the countries leading trauma surgeons, Dr. Mitchell Cohen and his colleagues at San Francisco General Hospital and the University of California San Francisco – Identification of complex metabolic states in critically injured patients using bioinformatic cluster analysis.  I will confess that I felt frustrated when I talked with the researchers about the underlying mathematical concepts and analytical techniques they were using, because I just did not understand them well.  This ignorance ignited what I will freely admit is now an obsession to understand these concepts and techniques.

I started off trying to educate myself using text books, taking on-line MOOC’s – Massive Online Open Courses, and enrolling in courses offered on the web.  I still felt very frustrated because these courses didn’t really go into the depth that I thought I needed.  When I look at the giants in this field of predictive analytics, these few researchers seemed to have both the clinical knowledge and understanding of why this research was so important, and they were also able to understand the mathematical and analytical concepts and techniques necessary to do research in this field.  I wanted to be like them.

I became very interested in becoming a data scientist at that point.  I eventually enrolled in Northwestern University’s Master of Science in Predictive Analytics (MSPA) program.  I have not regretted this decision.  I currently am halfway through the program, and am finally into the especially relevant coursework.  I just finished the major foundational course – Linear Regression and Multivariate Analysis.  The courses up until then had been preparing me to take this course.   I realized I had come full circle when I re-read Mitchell Cohen’s article, and realized that I now finally understood the concepts and results.  That was an extremely satisfying moment for me.

This has been quite the educational journey for me.   I feel like I have a much better understanding of statistics. I am getting somewhat competent in a few programming languages – R, Python, and SAS.  I am using Jupyter Notebooks for my programming work.   I have dabbled with data science platforms like KNIME, and this quarter will be learning to use virtual machines, IBM Watson Analytics, ANGOSS, and Microsoft Azure machine learning – as part of my next class on Generalized Linear Models.

I finally feel as if I am able to start applying what I have been learning for the last 1 1/2 years – to start developing predictive models to improve clinical outcomes.  A few of my goals are to help our organization become more data driven, and to continue to work on developing predictive algorithms that could be incorporated into beside monitoring systems, further improving the outcomes of patients.

This is my journey to date from becoming a practicing Emergency Physician with no interest in data or analytics, to where I am now, halfway finished with my Master’s program.  The real journey of applying what I have learned to real world problems has just started but will get more robust as I learn more.





Northwestern University MSPA, Predictive Analytics

Northwestern University MSPA Predict 410, Regression and Multivariate Analysis Course Review

This was the most demanding of Northwestern University’s MSPA (Master of Science in Predictive Analytics) courses I have taken so far, and also the most rewarding.   This course is the backbone of the predictive analytics program and foundational to becoming a predictive modeler.  The course covers Linear Regression (Simple Linear Regression and Multiple Linear Regression) and Multivariate Analysis (Principal Component Analysis, Factor Analysis, and Cluster Analysis).

This course is an applied course, so you have to understand the mathematics, but don’t have to do in-depth calculations using matrix algebra (it could be much, much worse!).  This is in keeping with the philosophy that the MSPA program is an applied program, preparing students to go out and start working in this field.

I took this course from Professor Syamala Srinivasan, Ph.D.   She is at the top of the list of Northwestern Professors, who already are very high quality.  I would highly recommend her if you are considering this class.  She has gone above and beyond, with her textbook lectures, additional lectures on topics of interest, SAS tutorials, SAS demo’s for each week’s assignments, sync sessions, response to questions both on the discussion boards and by e-mail.  Her level of involvement in creating the course work and in teaching the class are phenomenal.  I can’t say enough good things about her.

The course structure is as follows.  Every week there is required reading from a variety of textbooks and articles from the library.  There are PowerPoint lectures with audio for each textbook chapter.  In addition there are usually several other special topic PowerPoint/audio lectures.  A recorded video session then goes over the assignment for the week, and the SAS code used for the assignment.   Participation in the week’s discussion board is mandatory and extremely helpful.  The assignments build upon each other and get more complex.  There are intermittent quizzes.  The final exam is a two-part, one a take home exam, the second an online-proctored exam.

Now for the particulars.   This course isn’t for those who are time challenged already.  I would not recommend taking a second course with this one, unless you have a lot of spare time.  I spent a good 20-30 hours per week on this course, and wished I actually had more time to devote to it.  I read almost every mandatory reading assignment and optional reading assignment, so you could cut corners and devote less time, but I would worry about not learning the content of this foundational material.


Regression Textbooks Required

  1. Montgomery, D.C., Peck, E.A., and Vining, G.G. (2012). Introduction to Linear Regression Analysis. (5th Edition). New York, NY: Wiley [ISBN-13: 978-0470542811]

This textbook is the main one used throughout the course.  It has sections that are difficult to get through, but the foundational material is there.

2.  Everitt, B. (2009). Multivariable Modeling and Multivariate Analysis for the Behavioral Sciences. Boca Raton, FL: CRC Press [ISBN-13: 978-1439807699]

This is a good supplement to the Montgomery textbook, with coding examples – both in the book and on-line – using R.

Regression Textbooks Optional

  1.  Pardoe, I. (2012). Applied Regression Modeling. (2nd Edition). New York, NY: Wiley [ISBN-13: 978-1118097281]

This was my favorite textbook, and is definitely more understandable and is written from an applied standpoint.

2.  Ryan, A. G. Montgomery, D.C., Peck, E.A., and Vining, G.G. (2013). Solutions Manual to Introduction to Linear Regression Analysis. New York, NY: Wiley [ISBN-13: 978-1118471463]  To be honest, I didn’t use this a lot.

3.  Sheather, S. (2009). A Modern Approach to Regression with R. Springer [ISBN-13 978-1441918727]  I didn’t use this at all, but it will be handy when working through real world problems using R later.


SAS Textbooks

  1. Cody, R. (2011). SAS Statistics By Example. Cary, N.C.: SAS Publishing. [ISBN-13 978-1607648000]
  2. Delwiche, L., and Slaughter, S. (2012). The Little SAS Book: A Primer. (5th Edition). Cary, NC: SAS Publishing. [ISBN-13: 978-1612903439]

I used both of these books as references a fair amount.

In addition there were quite a few reference articles in the library.  Some of these were very good, some were very detailed.


This course uses SAS for all analysis and visualization.  You could use R, but the course is built around SAS.  I will say I came into the course with a bias against SAS (from ignorance mainly – but also due to the cost of the license for this, and a move away from these closed systems to open systems like Python and R.  I am a huge Python proponent.)  However, I have come to like SAS for how easy it was to learn, and how easy it is to do data analysis and visualization.

An imperative is start learning SAS before the course starts.  You will get an email and syllabus from Dr. Srinivasan early listing what you need to study.  There are SAS tutorials and readings.  I also did the learning within SAS.  I completed the on-line SAS Programming 1: Essentials e-Course, which was very helpful.  There are also multiple additional free courses that you can take.

You can use SAS through the SSCC – Social Sciences Computing Cluster (no additional charge), through the web based SAS Studio (no additional charge), or you can purchase a license.  I exclusively used SAS Studio and had no problems.


The Learning Goals of the course are:

  • Develop statistical modeling as a three step process consisting of: (1) exploratory data analysis, (2) model identification, and (3) model validation.
  • Understand how to use automated variable selection as a tool for model identification and as a tool for exploratory data analysis in the presence of a large number of predictor variables or a set of unlabeled predictors.
  • Develop a working understanding of the conceptual (theoretical) foundations of linear regression, principal components analysis, factor analysis, and cluster analysis with the objective of being capable of applying these techniques appropriately and validating their results.
  • Develop a conceptual and practical understanding of the difference between statistical inference and predictive modeling and how it affects our choices and actions in the statistical modeling process.
  • Learn the basics of the SAS Data Step, data manipulation with SAS, and SAS procedures (PROCS) for fitting statistical models.

Weekly Reading and Video assignments

Each week there are required textbook readings, optional textbook readings, course reserve readings, and lecture videos.  The weekly videos are PowerPoint presentations with audio, and go over the textbook readings for that week.  In addition, there are other lectures on special topics.

The special topics include:

Statistical Preliminaries and Notation

Statistical Assumptions for OLS Regression

Estimation and Inference for OLS Regression

Analysis of Variance and Related topics in OLS Regression

Hat Matrix Lecture

Statistical Inference vs Predictive Modeling in OLS Regression

Special Topic: Dummy Variables Hypothesis Testing

Special Topic Lecture (Degrees of Freedom)

Special Topic Lecture (Likelihood Function)

Special Topic Lecture (Mallow’s Cp)

Hypothesis Testing Multiple Linear Regression

Factor Analysis Example Lecture


Sync Sessions

There are a total of 4 Sync sessions.  These are invaluable as Dr. Srinivasan reviews the recent material, but then puts it all into a larger context.


There are a total of 8 assignments.  These are a combination of using SAS to do analysis and visualization, as well as having to provide an analysis of the produced outcomes.  The code is pretty much already written.  You will have to make a few modifications, but the focus is on using SAS, and the assignments are designed to test your ability to perform regression and multivariate analysis, not struggle producing code from scratch.  This was a very nice feature.  Each week there is a SAS demo video lecture where the Professor runs through the code and the assignment – extremely helpful.

Here are the titles of the assignments:

Assignment 1: Getting to know your data.

Assignment 2: Regression model building.

Assignment 3: Data analysis and regression.

Assignment 4: Statistical inference in linear regression.

Assignment 5: Automated variable selection, multicollinearity, and predictive modeling.

Assignment 6: Principal components in predictive modeling.

Assignment 7: Factor analysis

Assignment 8: Cluster analysis.


Discussion Boards

These were extremely robust.  You have to answer 3 questions posed by Professor Srinivasan, and then actively engage in discussions around what other people posted.  The questions were relevant and helped the learning process.  The discussions were robust and enhanced the learning.

Follow up by Dr. Srinivasan

Each week she would send out several e-mails – on how the assignments went, to clarify issues presented in the discussion boards, and to follow up on quizzes.  These were very helpful.

Quizzes and Tests

There were a total of 5 open book quizzes.  These were very doable,  but somewhat demanding.

There were 2 final examinations – a take home exam (1 hour) and a proctored 2 hour exam.  These were challenging but doable.

Final Thoughts

This has been the highlight of the MSPA’s courses so far, as this course is the foundation for building predictive models.  The other courses I have taken were a lead up to this course.  Dr. Srinivasan has gone above and beyond and delivers a high quality product.  My favorite course and Professor so far.