Big Data, Data Science, Healthcare Analytics

REMAP Clinical Trials – Combining the best of Randomized Clinical Trials and Big Data Analyics

I thought I would post some information on a new type of clinical trial that has been created that is a fusion of the Randomized Clinical Trial (RCT) and big data analytics.  This is based on a discussion that occurred in my Northwestern University Master of Science in Predictive Analytics statistics class (PREDICT 401).  The discussion centered around understanding the importance of “correlation is not causation”.  (As an aside if you want to see some great examples of absurd correlations, go to  tylervigen.com for some hilarious examples).

I am hoping eventually to understand at a much deeper level how to go from establishing correlation to declaring causation.   This is a huge issue, not just in medicine, but across all disciplines.

The major method that is used to establish causation is through a randomized clinical trial or RCT.  In the RCT you attempt to control all of the variables so that you can look at the variables of interest.  These are usually performed with a pre-existing hypothesis in mind, ie we think that A may cause a change in B, so we control for all of the other things that we think change B.  Then if we see changes in A that correspond to changes in B, then A is not only correlated with B, there is a causal inference that A causes the changes in B.

There are many problems with RCT’s though.  They are very expensive and difficult, their findings are too broad (average treatment effect not representative of benefit for any given individual), they exclude many real-life situations so that by the time the final population for study is defined, it is no longer of any practical significance for real-life application, and there are long delays before the results of the RCT’s make it into clinical practice. (Angus, 2015).

There is another way to look at data called by various titles including data mining.  This is where you start with the data, and then develop the hypothesis to be tested later, after seeing what the data shows.  So you would perform an exploratory analysis on a data set, using advanced analytical methods, and see where the correlations arise.  Once you see the correlations, then you can start to define whether these are spurious, or possibly real, and whether there is a possibility that these could be causal.  At that point you could develop a RCT to study this issue and try to establish causation.

There is a new type of RCT being developed.  It is called a REMAP trial. This stands for Randomized, Embedded, Multi-factorial, Adaptive Platform trial.  You won’t find a lot about it described in the literature yet, but I have attached a link to a podcast that describes it, and the citation below is from an investigator involved with these studies, Dr. Derek Angus, MD, at the University of Pittsburgh.

Basically, the trial combines the best of a RCT with big data analytics.  It uses machine learning techniques to study these complex problems.  There is a study starting called REMAP Pneumonia, that is enrolling patients in Europe, Australia, and New Zealand.  This is a perpetually running platform for the study of interventions in patients with severe pneumonia who need admission to an Intensive Care Unit.  There is a randomizing algorithm that randomizes patients to one of 48 different treatment arms.  Yes, this study has 48 different questions to answer, rather than one.  The weightings of the randomization change over time as the platform “learns” which treatment arms are doing better or worse.  The arms that are showing improvement have the randomization weights increased so more patients can be studied.  Once an arm reaches a certain pre-established threshold for effectiveness, that arm “graduates” and that treatment becomes standard therapy.

This is an exciting advancement in the field of healthcare analytics.  You can also read about the “Adaptive Trial Design” used in the I-SPY 2 trial studying emerging and promising new agents for the treatment of breast cancer.  Here is the link. (trial information link ).  The touted benefits of the adaptive trial design are that they “use patient outcomes to immediately inform treatment assignments for subsequent trial participants—I-SPY 2 can test new treatments in half the time, at a fraction of the cost and with significantly fewer participants.

I think that once these techniques become more widely known, these types of trials will rapidly transform the face of healthcare research, and improve the capacity for healthcare organizations to become “Learning Health Systems”.

 

References

Angus D.  Fusing Randomized Trials With Big Data:  The Key to Self-Learning Health Care Systems?  Journal of American Medical Association (JAMA). 2015:314(8):767-768.

REMAP podcast link: http://www.sccm.org/Podcasts/SCCMPod306.mp3 (Links to an external site.)

Presentation by Dr. Angus link: http://iom.nationalacademies.org/~/media/Files/Activity%20Files/Research/DrugForum/2010-MAR-3/Session%201%20Angus.pdf

I-SPY 2 link: http://ispy2.org/

 

 

Data Science

Using Jupyter Notebooks to learn R, Python

I love using Jupyter Notebooks to learn R and Python.  I only wish I would have discovered them when I first started to learn Python.  The notebooks are a great way to take notes, run code, see the output of the code, and then visualize the output.  The notebooks can be organized by language – ie Python vs R, and also by the course you are taking, or book you are working your way through.  You can then go back and view your notes and code for future reference.

Project Jupyter was developed from the IPython Project in 2014, and IPython notebooks are now Jupyter notebooks.  Jupyter Notebooks are described as “a web application for interactive data science and scientific computing” . These notebooks support over 40 programming languages, and you can create notebooks with a Python kernel, or ones with an R kernel, amongst others.  These are great for learning programming languages and several academic institutions are using these in their CS courses.  They are also great for “reproducibility” – the ability to reproduce the findings that other people report.  By publishing the notebook on GitHub, Dropbox, or Jupyter Notebook Viewer, others can see exactly what was performed, and run the code themselves.

Here is how I use Jupyter Notebooks.  When I start a new course, whether an official course in my Northwestern University Master of Science in Predictive Analytics, or a web based course like the ones I have been taking from DataCamp and Udemy, or from a book that I am working my way through – I will create a new Jupyter notebook.

You first have to open up a Jupyter Notebook by typing “Jupyter notebook” in your shell (I use Windows PowerShell).  This then opens up a browser page “Home”.

2016-01-03_13-26-32

If I want to open up an existing notebook, I scroll down to the  notebook of interest and open it.  Here is a screen shot showing some of my notebooks.

2016-01-03_13-29-50

If I want to start a new notebook, I go to the top, select “New”, and then either a Python or R notebook.  They come with the Python kernel installed (you go to IRkernel on GitHub to install the R kernel).  This opens up a new notebook.

2016-01-03_13-31-52

You type commands or text into “cells” and can run the cells individually or all together.  The two most common cells I use are “Markdown” and “Code”.  You do have to learn a few easy Markdown commands, for headers/etc.  The Markdown cells are used for taking notes, and inserting text.  The Code cells are used to input and run the code.

2016-01-03_13-39-43

Once you have inputted your code, you can run the cell several ways.  The most convenient is to hit “Shift-Enter”, which will run the code in that cell, and bring up a new blank cell.

These are great for creating and saving visualizations, as you can make minor changes and then compare plots.  Here are a few examples.

2016-01-03_13-45-24

2016-01-03_13-46-56

There are a few things to don’t run smoothly yet, like loading packages in R.  I have found the easiest way to load a package is to load it using RStudio, and then use the library command in Jupyter to load it into the Jupyter notebook.  Alternatively you could use the following command each time:

install.packages(“package name”, repos = c(“https://rweb.crmda.ku.edu/cran/”)) # You can select your CRAN mirror and insert into the repos command).

Overall, I love using Jupyter to both take notes, run code while learning, and organize my learning so I can easily find it later.  I see it’s huge potential in sharing data, and being able to easily reproduce results.  Give it a try!

 

Data Science, Data Visualization

Data Science Skill Network Visualization

I came across this great visualization by Ferris Jumah (see link Ferris Jumah’s blog post) about the relationships between data science skills listed by “Data Scientists” on their LinkedIn profiles.

data science skill networkbor55data science skill networkTo view a higher resolution image go to: http://imgur.com/hoyFT4t

How many of these skills have you mastered?

Ferris’s conclusions about a few key themes:

  1.  Approach data with a mathematical mindset.
  2. Use a common language to access, explore and model data.
  3. Develop strong computer science and software engineering backgrounds.

 

 

 

 

Becoming a Healthcare Data Scientist, Northwestern University MSPA, Predictive Analytics

Interim Review of Northwestern University’s MSPA Math for Modelers course.

Predict 400, Math for Modelers Course, Northwestern University MSPA

I am going to summarize my experience to date with Northwestern University’s Master of Science in Predictive Analytics program. I am past the halfway point (week 7 of 9) of my first trimester in this program. I am enrolled in one course, Predict 400, Math for Modelers. This is being taught by Professor Philip Goldfeder.

I will first describe the outline of how the course works. This is an asynchronous learning experience, for the most part. We have had one live session with Prof. Goldfeder. The coursework is presented through the online platform called Canvas. There are three main components to the class, which I will describe in greater detail below. The first component is learning the actual math. The second is participating in discussions about questions posed each week by Prof. Goldfeder. The third is learning Python.

What I really love about this program is how it brings together the book work, homework, learning python, and getting help for problems/questions, into one place. I have been trying to informally do this on my own, and it was frustrating for me to try and learn math/machine learning/etc. using either books or other online courses, learn python a separate way, and then have difficulty getting my questions answered. It is a 1000% easier when this is all rolled into one. Even though this is a lot more expensive than doing it on your own, to me it is worth every penny.

Professor Philip Goldfeder.  He is a great Professor for this course. He received great reviews in the CTEC (Course Teacher and Evaluation Council, these are visible when signing up for classes) and I see why. Not only is he extremely knowledgeable, he is also very engaged with the students and seems genuinely interested in making sure we learn and understand the material. He is also great at challenging the students to think of ways to apply the concepts learned to real world examples. I highly recommend him.

Canvas platform. This is where you go to do everything. It has sections for Announcements, Syllabus, Modules (describe each week’s assignments and is where you download things), Grades, People (section where everyone gets to describe themselves and you get to know your classmates), and Discussions.

The math. The introductory course, Math for Modelers is designed to be a “Review of fundamental concepts from calculus, linear algebra, and probability with a focus upon applications in statistics and predictive modeling. The topics covered will include systems of linear equations and matrices, linear programming, concepts of probability underlying both classical and Bayesian statistics, differential calculus and integration.” This is a very aggressive review of linear algebra, probability, differential calculus and integral calculus. This would be easier for someone who has taken these courses recently, but is challenging for me since it has been decades since I learned this (not sure I really learned some of this the first time around). You are assigned 1-2 chapters a week from the textbook “ Lial, Greenwell and Ritchey (2012). Student Solutions Manual for Finite Mathematics and Calculus with Applications, 9th Ed.”  Prof. Goldfeder prepares a high level video that reviews the material in the chapters. He also posts PowerPoint presentations of the material in each chapter.

Homework. You are then required to complete a homework assignment each week, which covers the material in the chapters. This is typically 20-30 questions. This is completed through the Pearson educational application. This is a FANTASTIC resource. The textbook is online here. Each chapter has it’s own section, and you can do problems in each sub-chapter of each chapter. If you struggle with the solution, you can actually have the application walk you step by step through the problem, and show you similar problems. There are links out to the textbook that takes you right to the section dealing with the problem you are working on. There are also videos available to view on the topic as well. I almost always do all of the study problems. The homework is another section in here, and that is how you submit your homework. Homework is worth 25% of your grade.

Discussions. This is a surprisingly difficult section. The NU MSPA program is designed as an applied program, and designed to use real world examples and learning. To that end, Dr. Goldfeder challenges us each week to come up with real world examples or explanations of the material we are learning. To formulate a response to this can take a surprising amount of time if you take it seriously, but in doing so I have learned a lot. The process makes you think about how these concepts could be used in the real world. You are supposed to post your discussion response by the middle of the week so that you can participate in the discussions about what you posted, as well as what your classmates posted. The kicker is that you can’t see what other students have posted, until you post your submission. I have learned a lot from these discussions. The other students in the course have such a wide background, that they can weigh on the topics in a meaningful way. We have students with backgrounds in sports analytics, actuaries, people working in industry, medicine, computer science, etc. The discussions are worth 25% of your grade.

Python. This could be extremely challenging if you have not had any exposure to Python or programming. I knew this would be a challenge, so I did take a few Python courses (Codecademy’s Python course at http://www.codecademy.com/learn/python, How to think like a Computer Scientist at interactivepython.org) prior to enrolling in the class. However, I would label myself still a beginner in Python, and the exercises challenged me to expand my knowledge of Python. However, I personally think this is one of the most gratifying portions of the course. I really enjoy combining what we are learning with Python. We cover the basics of Python, creating graphs and plots, using NumPy and SciPy. I love this part of the course. This is done through the Enthought Canopy platform. This has the interactive editor, the package manager, and of great value, the “Training on demand”, which is a very comprehensive series of instructional videos. These cover basic and advanced functionalities. Well worth the money, just for access to these videos. There is no grade each week for the Python assignments, however, you need to keep up with these. There were questions on the midterm that specifically required the use of Python to analyze the question and display the results. We have a Python TA assigned to the class who is very responsive to questions. In addition, students post code and help provide input on any questions.

Tests. The midterm is worth 25% of the grade as is the final examination. The midterm was a take home test, and required a substantial investment of time to complete. In addition there was the regular homework/reading for that week, although the discussion that week was optional. This is a week when you would want to cut yourself some slack and allow extra time. I had a heavy work week that week, and regretted not thinking about this ahead of time to give myself a lighter work schedule.

Time requirement. I am finding that I am devoting 20-30 hours per week to do all of this. You could devote less time if you were more up to date on the math or Python. But remember, I am doing this to learn and retain the information. So I am doing all of the reading in the textbook, doing all of the example problems and “your turn problems”, and almost all of the chapter problems in the Pearson application. I have not had time to do all the problems in back of the textbook however. I also try to provide meaningful input into the discussions, both in my submission, and commenting on what other students have posted. I have also been trying to continue to dive deeper into learning Python.

Typical week. I usually try to do the textbook reading on Monday and Tuesday. (All of the assignments are due midnight Sunday night, so Monday starts a new week). I don’t do a lot of problems initially as I want to get through the reading, so I can apply it to my discussion. Then on Wednesday I like to start working on my discussion submission and try to get it in by Wednesday, or Thursday at the latest. That way I can participate in the discussions in a meaningful way. After I get my discussion submitted, I go back and work through the chapter problems in Pearson. I like to get to the homework section on Saturday. Ideally I like to have Sunday to do the Python reading and assignments.

My overall assessment of this course is that I am extremely satisfied. I think this is very professionally done, I am learning the math, I am being challenged to think about applying this to the real world, and I am learning Python. There is definitely a lot going on, but that is why I signed up for this. I feel as if I am getting my money’s worth.

Healthcare Predictive Analytics

“The Formula” – great summer reading and some implications for healthcare predictive analytics.

I would like to recommend “The Formula” by Luke Dormehl for a good summer read.   I am enjoying this book so far.  I think it should be a must read for all of those interested in predictive analytics and predictive modelling.  A couple of passages from the beginning of the book are provided below.

9780399170539_p0_v2_s260x420

“Algorithms sort, filter and select the information that is presented to us on a daily basis.”  “… are changing the way that we view … life, the universe, and everything.”

“To make sense of a big picture, we reduce it …  To take an abstract concept such as human intelligence and turn it into something quantifiable, we abstract it further, stripping away complexity and assigning it a seemingly arbitrary number, which becomes a person’s IQ.”

“What is new is the scale that this idea is now being enacted upon , to the point that it is difficult to think of a field of work or leisure that is not subject to algorithmization and The Formula.  This book is about how we reached this point, and how the age of the algorithm impacts and shapes subjects as varied as human creativity, human relationships, notions of identity, and matters of law.”

“Algorithms are very good at providing us with answers in all of these cases.  The real question is whether they give us the answers we want (my emphasis).”

This takes us back to George E.P. Box’s famous quote “all models are wrong, but some are useful”.   We can create algorithms for almost anything, but how useful are they.   Accurate models can be created that work really well on deterministic systems, but are much harder to develop on complex systems.   As you strip away features to be studied from that complex system, you lose the impact of that feature on the system. You try to select features that do not have a huge impact on the performance of the system, but this is often unknowable in advance.

One of the great challenges in clinical medicine is trying to determine or predict what is going to happen to a patient in the future.   We know generally that smoking is bad, too much alcohol is bad, being overweight is bad, not exercising is bad, not sleeping enough is bad.  We know these are bad for the overall population of people.  However we do not know how each of these effect a single patient, nor how they are interrelated.   We would like to develop models that can predict what will happen if you have certain conditions (predictive modeling), and then look at what would happen if you took certain courses of action/treatments/preventative actions(prescriptive modeling).  The results of these models would allow clinicians and patients to be better informed and choose the best pathway forward.

Of particular interest to me, I would like to be able to predict real-time what is going to happen to a patient I am seeing in the emergency room.    This is a complex situation.   Their current state – physiologic vital signs (level of consciousness, blood pressure, pulse, respiratory rate, temperature, blood oxygen level, respiratory variability, heart rate variability, ekg,  etc.), along with their current laboratory and radiological imaging findings will define their current problem or diagnosis.  The patients past medical history, medications, allergies, social support, living environment, etc.,  will have major impacts on how they respond to their current illness or injury.  We would like to aggregate all of this information into predictive and prescriptive models that could predict future states.   Are the patients safe to be discharged home or do they need to be admitted?  If they need to be admitted, can they go to the short stay unit, a bed with cardiac monitoring, a bed with cardiac monitoring, or the intensive care  unit?  Given the current treatment, what will their response to this treatment be – will they get better or worse?  Will they develop sepsis?  Will they develop respiratory failure and require a tube be placed down their throat and a ventilator to breathe for them?

A particularly exciting area ripe for development is the internet of things.   The internet of things is going to revolutionize how we collect data, both at home and in the hospital.   This much-needed capability will allow us to monitor patients at home,  detect illnesses much earlier, monitor responses to therapies, etc.,  and will be useful for a whole host of things we haven’t even imagined yet.

These are some of the complex questions that face us now in medicine.  I am excited to participate in this quest to answer some of these vexing questions using all of the analytical tools that are currently available – whether “small data”  using standard descriptive and inferential statistics, predictive analytics, and big data analytics.