Becoming a Healthcare Data Scientist, Data Scientist, Healthcare Predictive Analytics, Northwestern University MSPA

Physician Data Scientist Part II. The Why.

I was recently reminded by a reader of my blog (thanks Al) that I had not followed up on a comment that I was going to post a second part to a blog that was posted on 7.7.2015 – “Physician Data Scientists – Why and What Type? Part I“.  Now that I am in between classes, I have the time to work on this.   Looking back at this original post, I am somewhat amazed at all that has happened in the last 1 1/2 years.

I am currently the interim Chief Information Officer (CIO) and Chief Medical Information Officer (CMIO) for our integrated healthcare system.   I stepped into the interim CIO role (helped in part by my Northwestern University MSPA Master of Science in Predictive Analytics coursework) after the departure of our previous CIO last year.  Prior to that I had been one of our systems CMIO’s – facilitating and communicating the needs for technology to help improve clinical outcomes to IT, while communicating back to Physicians and Leadership the limitations of current technologies.  I never really aspired to become either the interim CIO or a CMIO, these opportunities simply arose because of my journey to become better educated about the use of data and analytics to improve clinical outcomes – ie to become a Physician Data Scientist.  I will explain how I ended up in my current role.

My interest in data and analytics is a fairly recent phenomenon, occurring because of a chance meeting with someone who has since become one of my closest friends – Curt Lindberg – who has a PhD in Complexity Science, and is the Director of our Complexity in Healthcare Center.  I met him during a project to improve our process for getting patients into our healthcare system from outside facilities more efficiently.  At that time I was a practicing Emergency Physician and the Medical Director of our MedFlight Air Ambulance service.  Curt introduced me to complexity science and my life has not been the same – it was a transformational career moment for me.  I ended up being part of a small group of researchers who were trying to develop smarter patient monitoring systems.  Their work has inspired me to try and contribute in my own way to this field – called predictive monitoring.

Predictive monitoring is an unofficial term for what this group is trying to accomplish.  While the technology inside the monitors has changed drastically since the 1970’s, what the monitors do has not.  These monitors display certain physiologic markers of interest – blood pressure, pulse rate, temperature, oxygen level, ekg pattern, etc.  You can see what is happening to the patient right at that time, or you can go back and review what happened to them in the past (minimally), but there is no information about predicting what will happen to them in the future (are they predicted to get better, go into sudden cardiac arrest, stop breathing, or develop an overwhelming infection called sepsis, etc).  The goal is to incorporate predictive algorithms into these monitoring systems.

I have been fortunate to meet some giants in this field.  Dr. J. Randall Moorman  from the University of Virginia, who developed the first commercial predictive monitoring system – the HeRO monitor.  The largest ever randomized clinical trial in neonatal patients (premature babies) was conducted using this monitor.  It showed that the monitor was able to identify certain physiological patterns, and translate those patterns into a risk for developing an overwhelming infection (late onset neonatal sepsis).  This risk was detected an average of 18 hours before a clinical diagnosis was made, allowing for earlier treatments and interventions.  This translated into a 22% reduction in mortality.  Dr. Andrew Seely  is a Thoracic Surgeon at the University of Ottawa who has developed a model to predict the success of removing a breathing tube from a patient and not have to replace it because they weren’t ready to have it removed.   We got to participate in that clinical trial.  We also got to participate in a trial conducted by Ryan Arnold, now at Christiana Care in Newark Delaware, on trying to predict clinical outcomes using heart rate variability analyses.

In addition to collaborating with these researchers working on their projects, I became especially fascinated with a research article written by one of the countries leading trauma surgeons, Dr. Mitchell Cohen and his colleagues at San Francisco General Hospital and the University of California San Francisco – Identification of complex metabolic states in critically injured patients using bioinformatic cluster analysis.  I will confess that I felt frustrated when I talked with the researchers about the underlying mathematical concepts and analytical techniques they were using, because I just did not understand them well.  This ignorance ignited what I will freely admit is now an obsession to understand these concepts and techniques.

I started off trying to educate myself using text books, taking on-line MOOC’s – Massive Online Open Courses, and enrolling in courses offered on the web.  I still felt very frustrated because these courses didn’t really go into the depth that I thought I needed.  When I look at the giants in this field of predictive analytics, these few researchers seemed to have both the clinical knowledge and understanding of why this research was so important, and they were also able to understand the mathematical and analytical concepts and techniques necessary to do research in this field.  I wanted to be like them.

I became very interested in becoming a data scientist at that point.  I eventually enrolled in Northwestern University’s Master of Science in Predictive Analytics (MSPA) program.  I have not regretted this decision.  I currently am halfway through the program, and am finally into the especially relevant coursework.  I just finished the major foundational course – Linear Regression and Multivariate Analysis.  The courses up until then had been preparing me to take this course.   I realized I had come full circle when I re-read Mitchell Cohen’s article, and realized that I now finally understood the concepts and results.  That was an extremely satisfying moment for me.

This has been quite the educational journey for me.   I feel like I have a much better understanding of statistics. I am getting somewhat competent in a few programming languages – R, Python, and SAS.  I am using Jupyter Notebooks for my programming work.   I have dabbled with data science platforms like KNIME, and this quarter will be learning to use virtual machines, IBM Watson Analytics, ANGOSS, and Microsoft Azure machine learning – as part of my next class on Generalized Linear Models.

I finally feel as if I am able to start applying what I have been learning for the last 1 1/2 years – to start developing predictive models to improve clinical outcomes.  A few of my goals are to help our organization become more data driven, and to continue to work on developing predictive algorithms that could be incorporated into beside monitoring systems, further improving the outcomes of patients.

This is my journey to date from becoming a practicing Emergency Physician with no interest in data or analytics, to where I am now, halfway finished with my Master’s program.  The real journey of applying what I have learned to real world problems has just started but will get more robust as I learn more.

 

 

 

 

Data Science, Machine Learning

The world of machine learning algorithms – a summary infographic.

This is a very nice infographic that shows the basic types of machine learning algorithm categories.   It is somewhat informative to follow the path of how the algorithm got posted on twitter, where I saw it.  It was somewhat misleading (although not intentional I believe) about who actually created this infographic.  To me this highlights the importance of making sure we are crediting our information sources correctly.  This topic was also broached in this FiveThirtyEight article “Who Will Debunk The Debunkers” by Daniel Engber.  The article discusses many myths, one of them being a myth of how spinach was credited with having too much iron content.  It mentions that an unscholarly and unsourced article became “the ultimate authority for all the citations that followed”.  I have run across this as well, when I was trying to find the source of quotation about what a “Learning Health System” was defined as.  This definition was cited by at least twenty scholarly articles, but there was not reference for the citation, only circular references to the other articles that used this definition.  This highlights the importance of making sure we correctly cite the source of information, so it can be critically analyzed by other people interested in using the data.

I noticed this infographic after it had been tweeted by Evan Sinar (@EvanSinar).  The tweet cited an article in @DataScienceCentral.  That article “12 Algorithms Every Data Scientist Should Know” by Emmanuelle Rieuf, mentions an article posted by Mark van Rijmenan, with the same title – 12 Algorithms Every Data Scientist Should Know“, and then shows the infographic, giving the impression that this was the source of the algorithm.  That article mentions that the “guys from Think Big Data developed the infographic” and provided a link.  That links to the article “Which are the best known machine learning algorithms? Infographic” by Anubhav Srivastava.  It “mentioned over a dozen algorithms, segregated by their application intent, that should be in the repertoire of every data scientist”.  The bottom line, try to be careful with your source citations so it is not hard for people to follow the source backwards in time.  I was able to do this in this case, it just took a little while.  But there are many times where it is impossible to do this.

Now, for the infographic.

12algorithmseverydatascientistshouldknow

 

 

Data Science, Data Scientist

Who is Doing What/Earning What in Data Science Infographic

Are  you confused yet about the different roles/titles that people can have in the data analytics industry?   I think this might help add to your confusion.  This is a very nicely done infographic by DataCamp (http://blog.datacamp.com/data-science-industry-infographic/).  It is presented for your viewing pleasure and consideration.   Where do you fit into this categorization?  And does your compensation match your title match your responsibilities match your usefulness to your organization?

DataScientist

 

Becoming a Healthcare Data Scientist

Physician Data Scientist – Why and What Type? Part I.

Why would a practicing Emergency Medicine Physician want to become a Data Scientist, and what type of Data Scientist could I become?

I will provide my answers to those two questions, starting with what type of Data Scientist in this post, followed by Why I want to become a Data Scientist in Part 2.

First – What kind of Data Scientist do I see myself becoming?

Types of Data Scientists

I am going to use the framework that Bill Voorhies referenced in his blog post “How to Become a Data Scientist” (http://data-magnum.com/how-to-become-a-data-scientist/).  He used the framework developed by Harris, Murphy and Vaisman in their 2013 O’Reilly report “Analyzing the Analyzers.  An Introspective Survey of Data Scientists and Their Work“, available for free at http://www.oreilly.com/data/free/analyzing-the-analyzers.csp.  They describe 4 different subtypes of Data Scientists – Data Businessperson, Data Creative, Data Developers, and Data Researchers.  Figure 3-3 shows the skill sets strengths in each group. Below figure 3-3 I will provide a synopsis of how they described each subset.

2015-07-07_20-01-14

Data Businesspeople are most focused on the organization and how data projects yield profit.  They are leaders and entrepreneurs.  They have technical skills and work with real data.  They are the most likely group to have an MBA, and have an undergraduate Engineering degree.

Data Creatives are seen as the broadest of the Data Scientists, excelling at applying a wide range of tools and technologies to a problem, or creating innovative prototypes at hackathons, the quintessential Jack of All Trades.    They are seen as Artists.  They have substantial business experience.

Data Developers are focused on the technical problem of managing data – how to get it, store it, learn from it.  They are writing a lot of code, and have substantial computer science backgrounds.  They have more of the machine learning/big data skills than the other groups.

Data Researchers have a strong background in statistics, and have an academic background.

What type of Data Scientist do I see myself becoming?

I see myself fitting into two categories – a mix of the “Data Businesspeople” and the “Data Creative” subtypes of data scientists.   Although it will be easiest to become the Data Businesspeople type, I have aspirations of becoming more of a Data Creative or Jack of All Trades type as well.  I will discuss the different skill sets used in the analysis, and where I see my current strengths, and where my future strengths need to be developed in order to achieve these goals.

In terms of business skills, I have a broad general understanding of medicine in general, and emergency medicine in particular.   I also understand the Prehospital Emergency Medical Services environment, having started my career as an EMT-Paramedic, and having served as a Medical Director for several EMS services.   I am currently the Medical Director for our Air Ambulance service.  In addition, as a Chief Medical Information Officer, I understand the IT needs of clinicians and health care workers, and the technical realities of what IT can deliver.    I also serve as the Physician Liaison to our BI/Enterprise Analytics Division.   I see my experience and knowledge as a subject matter expert for clinical medicine driving the kinds of research questions that our data science/data analytics team attempt to answer.

I already have a deep interest in developing predictive algorithms that could be incorporated into bedside monitoring technologies that would be used to predict future states and detect early clinical deterioration.   This information could be used to guide triage decisions for clinicians;  is the patient safe to be discharged home, or do they need to be admitted to the hospital?   If they need to be admitted, do they need to be in the ICU, or is an unmonitored bed going to be ok?  Is the patient predicted to recover uneventfully, or do they have a high probability of deterioration requiring high resource utilization and admission to the ICU?  Does a patient at a small rural critical access hospital need to be transferred to a tertiary care facility that might be hundreds of miles away, taking them away from their family support network, and exposing them to the dangers of transfer and the costs of transfer (currently between $25,000 – $75,000), or can they be safely treated at their hometown facility.   Will the Internet of Things help us to remotely monitor patients at home, or even in the hospital, to detect either improvement or deterioration, before it is clinically apparent, thereby allowing earlier treatments and interventions and improving outcomes?  These are some of the important unanswered questions in my mind.

My weakest current skill, and continued weakest skill going forward I see as programming or hacking.   That is why I will never be a pure Data Creative type.   I do want to get competent at more than a basic level, in order to be able to do some of the work myself, and hand off the really complicated code to a true programmer.   I am currently working on learning Python, having finished the Codecademy course, and am almost finished with Zed A. Shaw’s “Learn Python the Hard Way”.   I know some R as well, mainly for statistical analysis.  Having said that, I am a novice coder at best.

I am extremely interested in machine learning and big data.  I would really like to become adept at analyzing big data because I see the potential of this approach in analyzing healthcare data.  This will be a big focus of mine.

I have a basic background in math and statistics, and am actually looking forward to relearning them again.   I think I will learn a tremendous amount now that I understand the importance of having this background.  I am currently working my way through the textbook we will be using in the fall for the math for modelers course.

When you consider all of the factors, my largest skill set is my business or subject matter experience.   I think this will allow me to be a better leader in choosing which analytics projects we pursue.   Having a good background in what types of analyses are possible, and which type are good for what situation, will help me make better decisions, and understand the results.   I am hopeful that I will then be able to translate the insights learned into understandable and actionable information that can be presented to the various stakeholders.

I am also hopeful that I can help drive the changes that are needed across the organization, based on the insight learned.  That is the basis for the “Learning Health System” concept.   A Learning Health System has to be able to capture important data, analyze it, gain insights, diffuse these insights, and rapidly change behavior incorporating these insights.  Our institution is currently trying to understand the meaning/basic concepts of a Learning Health System and put in place the framework and people necessary to achieve the goals of this system.  I hope to contribute to this in a meaningful manner.  There are also national initiatives on becoming Learning Health Systems.  The Learning Health Community (http://www.learninghealth.org/home/) is an excellent resource listing  core values, and some of the organizations also working on this goal.

In my next post, I will answer the question of Why I want to become a data scientist.

Becoming a Healthcare Data Scientist

My Current Baseline Data Scientist Skill Set

It will be interesting to compare my skill set once I finish the predictive analytics program to my current skill set.  I will outline my current skills so I can come back later and compare the two.

I will organize my skills using the format presented by Mitch Sanders in his blog article posted on 8.27.13 “Data Science – Capturing, Analyzing, and Presenting Data Skills”.  (http://datareality.blogspot.com/2013/08/data-scientist-core-skills.html).

1.  Capturing Data

Programming and Database skills:

I am weak in this area.  I have used R a bit to do some statistical analysis in the past.  I am currently learning Python  as I write this.  So far, I have found that Codecademy’s Python course is the best learning platform for me.  My next favorite resource is Zed Shaw’s book, “Learn Python the Hard Way”.  I really like his practical approach.  “Introducing Python.  Modern computing in simple packages” by Bill Lubanovic is also good, but but a bit more advanced.  Finally, the Visual Quickstart Guide “Python” by Toby Donaldson is a quick reference guide.  Going past basic programming, my skills are near or below zero.  I do not know how to use Hadoop, Java, SQL, Hive or Pig.

Business Domain Expertise and Knowledge

This is my strongest area of expertise.  I started off in medicine in 1984 as a basic EMT, became a EMT-Paramedic, and then Paramedic Educator.  I finished medical school (University of Illinois College of Medicine in Peoria Illinois) in 1994, and my Emergency Medicine Residency at Saint Francis Hospital in Peoria Illinois in 1997.   I have practiced academic and community based emergency medicine since then.   I have been a medical director for both ground based EMS and for a flight program.  I am also one of our health system’s Chief Medical Information Officers (CMIO), so have had to learn the field of Healthcare Information Technology as well.   In my current role I have a special interest in Business Intelligence and Analytics, including predictive analytics.  My passion is for developing smarter systems that can provide information about a patients risk of developing certain diseases/conditions, risk of deterioration/death, early detection of sub-clinical illness, and information about a patient’s response to treatment and therapy.  Hence my interest in predictive analytics.

Data Modeling, Warehouse, and Unstructured Data Skills.

I have minimal skills in this category.

2.  Analyzing Data

Math Skills.

I have basic math skills, but it has been a long time since I have had to do more than basic math, including calculus and linear algebra.  After I finish getting a basic foundation in Python, my next step is to refresh my knowledge of math/calculus/linear algebra before starting my “Math for Modelers” course this fall.

Statistical  and Analytical Skills

I do have a little better grasp of descriptive and inferential statistics.   But I will need to increase my knowledge of the advanced statistical techniques not commonly used in medicine today.  These would include predictive analytics, regression, multivariate analysis, linear models, time series analysis, machine learning, etc.

3.  Presenting Data

I am really excited to learn about and improve my data visualization skills.  I am really pushing hard for our organization to move away from excel and PowerPoint based presentations of data, to more relevant methods.

Storytelling Skills

I am a pretty good storyteller, but would like to improve my skills, especially in presenting the data and stories around the data.  I would like to help people  understand the insight created by the data analysis, and then help them move to operationalizing that insight, and driving organization change to improve patient outcomes.

In summary, my strongest skills are my love of data and analytics, my (obsessive) desire to become a data scientist, and my domain knowledge as it pertains to healthcare.  My other skills will have to be works in progress.

I would love to hear comments on what you think, and any recommendations/advice for students just starting this journey.

June 10, 2015