Big Data, Data Science, Healthcare Analytics

REMAP Clinical Trials – Combining the best of Randomized Clinical Trials and Big Data Analyics

I thought I would post some information on a new type of clinical trial that has been created that is a fusion of the Randomized Clinical Trial (RCT) and big data analytics.  This is based on a discussion that occurred in my Northwestern University Master of Science in Predictive Analytics statistics class (PREDICT 401).  The discussion centered around understanding the importance of “correlation is not causation”.  (As an aside if you want to see some great examples of absurd correlations, go to for some hilarious examples).

I am hoping eventually to understand at a much deeper level how to go from establishing correlation to declaring causation.   This is a huge issue, not just in medicine, but across all disciplines.

The major method that is used to establish causation is through a randomized clinical trial or RCT.  In the RCT you attempt to control all of the variables so that you can look at the variables of interest.  These are usually performed with a pre-existing hypothesis in mind, ie we think that A may cause a change in B, so we control for all of the other things that we think change B.  Then if we see changes in A that correspond to changes in B, then A is not only correlated with B, there is a causal inference that A causes the changes in B.

There are many problems with RCT’s though.  They are very expensive and difficult, their findings are too broad (average treatment effect not representative of benefit for any given individual), they exclude many real-life situations so that by the time the final population for study is defined, it is no longer of any practical significance for real-life application, and there are long delays before the results of the RCT’s make it into clinical practice. (Angus, 2015).

There is another way to look at data called by various titles including data mining.  This is where you start with the data, and then develop the hypothesis to be tested later, after seeing what the data shows.  So you would perform an exploratory analysis on a data set, using advanced analytical methods, and see where the correlations arise.  Once you see the correlations, then you can start to define whether these are spurious, or possibly real, and whether there is a possibility that these could be causal.  At that point you could develop a RCT to study this issue and try to establish causation.

There is a new type of RCT being developed.  It is called a REMAP trial. This stands for Randomized, Embedded, Multi-factorial, Adaptive Platform trial.  You won’t find a lot about it described in the literature yet, but I have attached a link to a podcast that describes it, and the citation below is from an investigator involved with these studies, Dr. Derek Angus, MD, at the University of Pittsburgh.

Basically, the trial combines the best of a RCT with big data analytics.  It uses machine learning techniques to study these complex problems.  There is a study starting called REMAP Pneumonia, that is enrolling patients in Europe, Australia, and New Zealand.  This is a perpetually running platform for the study of interventions in patients with severe pneumonia who need admission to an Intensive Care Unit.  There is a randomizing algorithm that randomizes patients to one of 48 different treatment arms.  Yes, this study has 48 different questions to answer, rather than one.  The weightings of the randomization change over time as the platform “learns” which treatment arms are doing better or worse.  The arms that are showing improvement have the randomization weights increased so more patients can be studied.  Once an arm reaches a certain pre-established threshold for effectiveness, that arm “graduates” and that treatment becomes standard therapy.

This is an exciting advancement in the field of healthcare analytics.  You can also read about the “Adaptive Trial Design” used in the I-SPY 2 trial studying emerging and promising new agents for the treatment of breast cancer.  Here is the link. (trial information link ).  The touted benefits of the adaptive trial design are that they “use patient outcomes to immediately inform treatment assignments for subsequent trial participants—I-SPY 2 can test new treatments in half the time, at a fraction of the cost and with significantly fewer participants.

I think that once these techniques become more widely known, these types of trials will rapidly transform the face of healthcare research, and improve the capacity for healthcare organizations to become “Learning Health Systems”.



Angus D.  Fusing Randomized Trials With Big Data:  The Key to Self-Learning Health Care Systems?  Journal of American Medical Association (JAMA). 2015:314(8):767-768.

REMAP podcast link: (Links to an external site.)

Presentation by Dr. Angus link:

I-SPY 2 link:



Data Science

Using Jupyter Notebooks to learn R, Python

I love using Jupyter Notebooks to learn R and Python.  I only wish I would have discovered them when I first started to learn Python.  The notebooks are a great way to take notes, run code, see the output of the code, and then visualize the output.  The notebooks can be organized by language – ie Python vs R, and also by the course you are taking, or book you are working your way through.  You can then go back and view your notes and code for future reference.

Project Jupyter was developed from the IPython Project in 2014, and IPython notebooks are now Jupyter notebooks.  Jupyter Notebooks are described as “a web application for interactive data science and scientific computing” . These notebooks support over 40 programming languages, and you can create notebooks with a Python kernel, or ones with an R kernel, amongst others.  These are great for learning programming languages and several academic institutions are using these in their CS courses.  They are also great for “reproducibility” – the ability to reproduce the findings that other people report.  By publishing the notebook on GitHub, Dropbox, or Jupyter Notebook Viewer, others can see exactly what was performed, and run the code themselves.

Here is how I use Jupyter Notebooks.  When I start a new course, whether an official course in my Northwestern University Master of Science in Predictive Analytics, or a web based course like the ones I have been taking from DataCamp and Udemy, or from a book that I am working my way through – I will create a new Jupyter notebook.

You first have to open up a Jupyter Notebook by typing “Jupyter notebook” in your shell (I use Windows PowerShell).  This then opens up a browser page “Home”.


If I want to open up an existing notebook, I scroll down to the  notebook of interest and open it.  Here is a screen shot showing some of my notebooks.


If I want to start a new notebook, I go to the top, select “New”, and then either a Python or R notebook.  They come with the Python kernel installed (you go to IRkernel on GitHub to install the R kernel).  This opens up a new notebook.


You type commands or text into “cells” and can run the cells individually or all together.  The two most common cells I use are “Markdown” and “Code”.  You do have to learn a few easy Markdown commands, for headers/etc.  The Markdown cells are used for taking notes, and inserting text.  The Code cells are used to input and run the code.


Once you have inputted your code, you can run the cell several ways.  The most convenient is to hit “Shift-Enter”, which will run the code in that cell, and bring up a new blank cell.

These are great for creating and saving visualizations, as you can make minor changes and then compare plots.  Here are a few examples.



There are a few things to don’t run smoothly yet, like loading packages in R.  I have found the easiest way to load a package is to load it using RStudio, and then use the library command in Jupyter to load it into the Jupyter notebook.  Alternatively you could use the following command each time:

install.packages(“package name”, repos = c(“”)) # You can select your CRAN mirror and insert into the repos command).

Overall, I love using Jupyter to both take notes, run code while learning, and organize my learning so I can easily find it later.  I see it’s huge potential in sharing data, and being able to easily reproduce results.  Give it a try!