Data Science, Data Visualization

Altair – A Declarative Statistical Visualization Library for Python – Unveiled at SciPy 2016 Keynote Speech by Brain Granger.

You should check out Altair, an API designed to make data visualization much easier in Python.  Altair was introduced today during a keynote speech by Brian Granger during the opening day of SciPy 2016 (Scientific Computing with Python). Brian is the leader of the IPython project and co-founder of Project Jupyter (Jupyter notebooks are my favorite way to code in Python or R).

Matplotlib has been the cornerstone of data visualization in Python, and as Brian Granger pointed out, you can do anything you want to in matplotlib, but there is a price to pay for that, and that is time and effort.

Altair is designed as “a declarative statistical visualization library for Python”.  Here is the link to Brian Granger’s GitHub site which houses the Altair files.  Altair is designed to be a very simple API, with minimal coding required to produce really nice visualizations.  A point Brian made in his talk was that Altair is a declarative API, which specifies what should be done, but not how it should be done.  The source of the data is a pandas DataFrame, that is in a “tidy format”.  The end result is a JSON data structure that follows the Vega-Lite specifications.

Here is my understanding of this relationship from a very high level Altair to Vega-Lite to Vega to D3.  (For more information, follow this link)  D3 (Data-Driven Documents) is a web-based visualization tool, but this is a low-level system.  Vega is designed as a higher-level visualization specification language built on top of D3.  Vega-Lite is a high-level visualization grammar, and a higher level language than Vega.  It provides a concise JSON syntax, which can be compiled to Vega specifications (link).  Altair is an even higher-level, and emits JSON data structures following the Vega-Lite specifications.   The idea is that as you get higher up, the complexity and difficulty of producing a graphic goes down.

On the GitHub site there are a number of Jupyter notebook tutorials.  There is a somewhat restricted library of data visualizations available, and they currently list scatter charts, bar charts, line charts, area charts, layered charts, and grouped regression charts.

The fundamental object in Altair is the “Chart”, which takes a pandas dataframe as a single argument.  You then start specifying what you want: what kind of “mark” and visual encodings ( X,Y, Color, Opacity, Shape, Size, etc.) you want.  There are a variety of data transformations available, such as aggregation, values, count, valid, missing, distinct, sum, average, variance, stdev, median, min, max, etc.  It is also easy to export the charts and publish them on web as Vega-Lite plots.

This looks like a very exciting and much easier to use data visualization API, and I look forward to exploring it more soon.

Data Science

Udemy.com has great courses for learning Python, R, Data Science.

Just a quick blog post to highlight the numerous courses available on Udemy.com.  I just completed Data Analysis in Python with Pandas, and found it very informative, especially with some of the advanced functions in DataFrames.

It is worthwhile keeping an eye on this site, because they have intermittent sales where these courses are deeply discounted.  I currently have 35 courses that cover Python, R, Data Science, MongoDB, SQL, MapReduce, Hadoop, teaching kids to code, Machine Learning, Data Vis, Time Series Analysis, Linear Modeling, Graphs, Rattle, Linear Regression, Statistics, Simulation, Monte Carlo Methods, Multivariate Analysis, Bayesian Computational Analyses, and more, most of which were purchased during these sales.

These are great course to learn the  underlying languages and concepts and to brush up when you have not used them for awhile.

I highly recommend these courses, just wish I had time to do more of them.

 

Becoming a Healthcare Data Scientist, Data Science, Data Scientist, Data Visualization, Northwestern University MSPA, Predictive Analytics

Northwestern University MSPA 402, Intro to Predictive Analytics Review

Summing this course up in one word = WOW.  This course should be taken early on because it is extremely motivating, and will help motivate  you to get through the other beginning courses such as Math for Modelers and Stats.  This course is a high level overview of why and how analytics should be performed.  It describes not only predictive analytics but the whole analytics spectrum and what it means to be an “analytical competitor”.  While you do not perform any actual analytics, you will understand why getting good at this is so important.

I took this course from Dr. Gordon Swartz, and highly recommend him.  Interestingly, he has bachelor degrees in nuclear engineering and political science from MIT, an  MBA from Northeastern University and a doctorate in business administration from Harvard.  His sync sessions were very informative and practical, and he provided on-going commentary in the discussion boards.

The course description is –  “This course introduces the field of predictive analytics, which combines business strategy, information technology, and modeling methods. The course reviews the benefits and opportunities of data science, organizational and implementation issues, ethical, regulatory, and compliance issues. It discusses business problems and solutions regarding traditional and contemporary data management systems and the selection of appropriate tools for data collection and analysis. It reviews approaches to business research, sampling, and survey design.”

The course is structured around required textbook reading, assigned articles, assigned videos, weekly discussions, one movie (Moneyball) and 4 projects.

Readings

The reading requirements are daunting, but doable.  You will (should) read 6 books in 10 weeks – a total of 1,590 pages.  There are 14 articles to read.  Each week has a short video as well.

These are the assigned books.  At first glance, this list will not seem to be a little odd with seemingly unrelated books.  However, they all help create the overall picture of analytics, and are all valuable.  I will provide just a brief overview of each, and plan to post more in-depth reviews of them later this summer.

Davenport TH, Harris JG.  2007. Competing on Analytics:  The New Science of Winning.  Boston Massachusetts: Harvard Business School Publishing.

This is the first text you read, for good reason.  It provides the backbone for the course.  You will learn about what it means to be an analytical competitor, how to evaluate an organizations analytical maturity, and then how to build an analytical capability.

Siegel E.  2013.  Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie or Die.  Hoboken New Jersey: John H Wiley and Sons, Inc.

This is a must read for anyone going into predictive analytics, by one of the pioneers of this field.  It describes in detail what predictive analytics is, and gives numerous real life examples of organizations using these predictive models.

Few S.  2013.  Information Dashboard Design: Displaying data for at-a-glance monitoring.  Burlingame California: Analytics Press.

I will admit that when I first got this book I was very confused about why it was being included in a course on predictive analytics.  However, this turned out to be one of the best reads of the course.  For anyone who is in analytics and has to display information, especially in a dashboard format,  this is a must read.  This describes what dashboards are really for, and the science behind creating effective dashboards.  You will never look at a dashboard the same way in the future, and you will be critical of most commercially developed dashboards, as they are more about displaying flashiness and fancy bells and whistles rather than the functional display of pertinent data in the most effective format.  I can’t say enough good things about this book, a classic.

Laursen GHN, Thorlund J.  2010.  Business Analytics for Managers: Taking Business Intelligence Beyond Reporting.  Hoboken New Jersey: John H Wiley and Sons, Inc.

This is a great overview of business analytics.  This is especially valuable in it’s explanation of how the analytics needs to support the strategy of the organization.

Franks B.  2012.  Taming the Big Data Tidal Wave: Finding opportunities in huge data streams with advanced analytics.  Hoboken New Jersey: John H Wiley and Sons, Inc.

This was an  optional read, but I recommend reading it.  It is written in a very understandable way, and provides a great overview of the big data analytics ecosystem.

Groves RM, Fowler FJ, Couper MP, Lepkowski JM, Singer E, Tourangeau R.  2009.  Survey Methodology.  Hoboken New Jersey: John H Wiley and Sons, Inc.

I will admit this was my least favorite book, but having said that, I learned a ton from it.  For anyone who will even think about using survey’s to collect data, this is a must read. However the 419 pages make this a chore.  It would be nice to have an abridged version.  What it does, though, is wake  you up to how complex the process of creating, deploying, and analyzing surveys is.  I grudgingly admit this was a valuable read.

Articles

There are some really great articles included in the reading list.

Videos

There are videos that were developed by another professor that review the weeks material.  I did not find these especially helpful, but they did provide an overview of the weeks information, and might be  helpful if you are having some trouble understanding the material.

Weekly Discussions

Again, the weekly discussion are where it happens.  There are one or more topics that are posted.  There are usually some really great comments posted, and you can gain a lot of insight if you actually think about what you are posting, and what other people have posted.  If you post on the last day a brief paragraph, then you are missing out on some valuable information.

Moneyball

The first course I have taken where a movie was required.  There are discussions around this movie and one of the assignments involves creating an analysis of the Oakland A’s and how they used analytics.  I enjoyed the movie and thinking about this.

Assignments

There are four assignments where you must create a paper of varying lengths.  You must create this using the appropriate APA format, so it is useful for refining these skills.

I found these to be challenging, fun, motivating, and extremely enlightening.  These called for the application of what we learned to some real world situations.  For one of these, I performed an in-depth analysis of our organizations analytics which involved interviewing our senior leadership.  As a result of these interviews, it really started the process of moving our organization to the next analytical maturity level in a very meaningful way.

Another project involved the creation of a draft dashboard using the best practices outlined by Stephen Few in his text.  This was a great learning experience for me, and one that will translate into much better dashboards at our organization.

The last project involved creating a meaningful and valid survey.  This was informative as well, and I actually might send out my survey.

Summary

Overall, this was a fantastic course.  This will make it clear why we need to do this well, and what doing this well looks like.  After this, the actual work of understanding and developing predictive models begins.  Again, I feel as if got my money’s worth (not an easy thing to say since these courses are pricey!).

Summer Activities

I am taking the summer off and am trying to catch up on the projects that have been piling up.  For fun I am learning SQL (great book – Head First SQL by Lynn Beighley) and working my way through several Python Udemy courses.  I will be attending the SciPy 2016 Conference in Austin Texas in July as well, and am super excited about this. I will be going to tutorials on Network Science, Data Science is software, Time Series analysis and Pandas. If you are attending, give me a shout out.

 

 

 

 

 

 

 

 

 

Data Science, Northwestern University MSPA, Predictive Analytics

Northwestern University MSPA 401, Introduction to Statistics Review

I finished this course last week, and thought I would post my thoughts before I forget them.

I was in Professor Roy Sanford’s section, and I HIGHLY recommend him.  He is an extremely experienced practitioner, and very knowledgeable of statistics and in using R for statistical analysis.

The course is focused on several aspects – learning basic statistics, learning R to perform statistical analysis, and engaging the students to participate in discussions that are pertinent to the material being learned.

Learning Statistics

The core text for the course is Ken Black’s Business Statistics For Contemporary Decision Making, 8th Edition.  It is a loose leaf binder text so you can remove the sections you are studying, which makes it nice.  It is a very down to earth text, with plenty of examples and problems.  Their is a companion website called WileyPlus that has videos to watch and a variety of problems/exercises.

A second supplemental statistical text is Rand R. Wilcox’s Basic Statistics: Understanding Conventional Methods and Modern Insights.  There are selected readings which highlight some contemporary issues.  Not as easy to read as Black’s text, but still informative.

Learning R

The coursework is presented using R.  You don’t HAVE to learn to use R, but you would be an idiot not to take advantage of this opportunity.  There is a great deal of effort putting into devising the curriculum to help you learn R.   This is well thought out, and I feel very confident that I have obtained a good working knowledge of R on which to build.  I was astounded to read a comment on the LinkedIn group – Networking Group for Northwestern University’s MS in Predictive Analytics Program –  from a previous student who took this course, who commented he didn’t really learn any R because he didn’t do any of the R reading or assignments.  To me, learning R was just as important as learning the statistics.  Plus I don’t know how you could do the Data Analysis Projects without learning R. Learning R is accomplished through reading various text’s, watching weekly video’s on R produced by Prof. Sanford, and then doing exercises.  Plus there are R resources and lessons, including links to Lynda.com.

I did the work in both RStudio and in a Jupyter Notebook using the R kernel. The Jupyter Notebook was my favorite way of doing the assignments because I could refer back to them.  But some things are way easier to do in RStudio, like installing packages and data sets, so sometimes I switched between the two.  See my other blog posts for information about Jupyter Notebooks.

The first R text is Winston Chang’s R Graphics Cookbook.  This takes you through the R basics and gets you up to speed quickly visualizing data.  There is a little bit about using the base plotting function in R, but most of the book is about visualizing using the ggplot2 package.  If you follow the exercises, you will get good at plotting and visualizing data.  You will learn scatter plots, line graphs, bar graphs, histograms, box plots (a lot – I finally understand what to do with a box plot), functions, QQ plots (I finally understand these as well).  All of these are extremely helpful in what you will spend a lot of time learning, Exploratory Data Analysis (EDA).

The second R text is Jared P. Lander’s R for Everyone: Advanced Analytics and Graphics.  This dives more deeply into using R for things other than data visualization and graphics, although it includes this as well.  This is a very easy to read and follow text.

The third R text is John Verzani’s Using R for Introductory Statistics: 2nd Edition.  This book is a very deep dive into R’s capability to do statistical analysis.  Although very detailed, it is understandable with great examples.

The last R text is downloadable from the site, Sarah Stowell’s Using R for Statistics.  This is also a very practical book on both statistics and visualization.

Don’t be overwhelmed by the number of text’s and reading, it is doable, and I would do it all.  If you do that, you will not be able to say you did not get your money’s worth.

In addition there are beginning videos and lessons about learning R, including links to Lynda.com.   There are weekly Calculations with R assignment, which include a video with examples.  There are exercises with these weekly assignments as well.  Finally there are R lessons which take you through learning R in an organized manner.

Sync Sessions and Videos

Professor Sanford holds a sync session every other week.  These are extremely informative and helpful.  You don’t have to watch live, but you need to watch later.  The sync sessions in Predict 400 were optional and you could get by fine without watching them.  Not the case here.  You will learn a lot from these.

The same holds for the videos he has created to go along with the weekly R exercises.  These are must watch videos.

Data Analysis Projects

There are two data analysis projects.  You will learn how to apply what you are learning to a hypothetical data analysis project.  These are pretty challenging, but VERY worthwhile.  These show the applied focus of the MSPA program, and I found them beneficial.  The first one really focused on doing some exploratory data analysis.  The second one was twice as long as the first, and you applied what you learned later in the course, including the creation of a linear regression model.  You will definitely want to start early on these, and put in the effort to do these correctly, as together they constitute 2/5’s of your grade.

Bi-weekly Tests

There are 4 bi-weekly tests which are very fair and doable.  Together they constitute 1/5 of your grade.

Final Exam

The final exam is also very fair and doable.  Much easier if you have paid attention to learning R, as you can use R to do the exam.  This is 1/5 of your grade.

Communications and Discussions

There are Communications discussion sections set up for statistics and R.  You can post a question anytime in either and get a rapid response from either Prof. Sandford or the R TA.  Our R TA was Todd Peterson, and he was extremely knowledgeable, helpful, and responsive.

Every week there are two discussions around topics you are learning.  These are student driven, and if taken seriously, you can learn a lot from each other.  There are some extremely bright and talented students in these classes who have great real world experience in a variety of sectors.   The final discussion section is a recap of what you learned that week, and Prof. Sanford participates in that discussion.

Overall

I spent between 20-30 hours per week doing the coursework.  You wouldn’t have to spend that much time, especially if this material is not new for you.  But I wanted to really learn the material, not just pass the class.

I really enjoyed this course on many fronts.  I found learning about statistics and R together was very complementary.  In fact, I cannot imagine doing any kind of statistical analysis without using a language such as R.  I am now trying to recreate what I learned in R using Python.  I really feel as if I got my money’s worth.

 

 

Big Data, Data Science, Healthcare Analytics

REMAP Clinical Trials – Combining the best of Randomized Clinical Trials and Big Data Analyics

I thought I would post some information on a new type of clinical trial that has been created that is a fusion of the Randomized Clinical Trial (RCT) and big data analytics.  This is based on a discussion that occurred in my Northwestern University Master of Science in Predictive Analytics statistics class (PREDICT 401).  The discussion centered around understanding the importance of “correlation is not causation”.  (As an aside if you want to see some great examples of absurd correlations, go to  tylervigen.com for some hilarious examples).

I am hoping eventually to understand at a much deeper level how to go from establishing correlation to declaring causation.   This is a huge issue, not just in medicine, but across all disciplines.

The major method that is used to establish causation is through a randomized clinical trial or RCT.  In the RCT you attempt to control all of the variables so that you can look at the variables of interest.  These are usually performed with a pre-existing hypothesis in mind, ie we think that A may cause a change in B, so we control for all of the other things that we think change B.  Then if we see changes in A that correspond to changes in B, then A is not only correlated with B, there is a causal inference that A causes the changes in B.

There are many problems with RCT’s though.  They are very expensive and difficult, their findings are too broad (average treatment effect not representative of benefit for any given individual), they exclude many real-life situations so that by the time the final population for study is defined, it is no longer of any practical significance for real-life application, and there are long delays before the results of the RCT’s make it into clinical practice. (Angus, 2015).

There is another way to look at data called by various titles including data mining.  This is where you start with the data, and then develop the hypothesis to be tested later, after seeing what the data shows.  So you would perform an exploratory analysis on a data set, using advanced analytical methods, and see where the correlations arise.  Once you see the correlations, then you can start to define whether these are spurious, or possibly real, and whether there is a possibility that these could be causal.  At that point you could develop a RCT to study this issue and try to establish causation.

There is a new type of RCT being developed.  It is called a REMAP trial. This stands for Randomized, Embedded, Multi-factorial, Adaptive Platform trial.  You won’t find a lot about it described in the literature yet, but I have attached a link to a podcast that describes it, and the citation below is from an investigator involved with these studies, Dr. Derek Angus, MD, at the University of Pittsburgh.

Basically, the trial combines the best of a RCT with big data analytics.  It uses machine learning techniques to study these complex problems.  There is a study starting called REMAP Pneumonia, that is enrolling patients in Europe, Australia, and New Zealand.  This is a perpetually running platform for the study of interventions in patients with severe pneumonia who need admission to an Intensive Care Unit.  There is a randomizing algorithm that randomizes patients to one of 48 different treatment arms.  Yes, this study has 48 different questions to answer, rather than one.  The weightings of the randomization change over time as the platform “learns” which treatment arms are doing better or worse.  The arms that are showing improvement have the randomization weights increased so more patients can be studied.  Once an arm reaches a certain pre-established threshold for effectiveness, that arm “graduates” and that treatment becomes standard therapy.

This is an exciting advancement in the field of healthcare analytics.  You can also read about the “Adaptive Trial Design” used in the I-SPY 2 trial studying emerging and promising new agents for the treatment of breast cancer.  Here is the link. (trial information link ).  The touted benefits of the adaptive trial design are that they “use patient outcomes to immediately inform treatment assignments for subsequent trial participants—I-SPY 2 can test new treatments in half the time, at a fraction of the cost and with significantly fewer participants.

I think that once these techniques become more widely known, these types of trials will rapidly transform the face of healthcare research, and improve the capacity for healthcare organizations to become “Learning Health Systems”.

 

References

Angus D.  Fusing Randomized Trials With Big Data:  The Key to Self-Learning Health Care Systems?  Journal of American Medical Association (JAMA). 2015:314(8):767-768.

REMAP podcast link: http://www.sccm.org/Podcasts/SCCMPod306.mp3 (Links to an external site.)

Presentation by Dr. Angus link: http://iom.nationalacademies.org/~/media/Files/Activity%20Files/Research/DrugForum/2010-MAR-3/Session%201%20Angus.pdf

I-SPY 2 link: http://ispy2.org/

 

 

Data Science

Using Jupyter Notebooks to learn R, Python

I love using Jupyter Notebooks to learn R and Python.  I only wish I would have discovered them when I first started to learn Python.  The notebooks are a great way to take notes, run code, see the output of the code, and then visualize the output.  The notebooks can be organized by language – ie Python vs R, and also by the course you are taking, or book you are working your way through.  You can then go back and view your notes and code for future reference.

Project Jupyter was developed from the IPython Project in 2014, and IPython notebooks are now Jupyter notebooks.  Jupyter Notebooks are described as “a web application for interactive data science and scientific computing” . These notebooks support over 40 programming languages, and you can create notebooks with a Python kernel, or ones with an R kernel, amongst others.  These are great for learning programming languages and several academic institutions are using these in their CS courses.  They are also great for “reproducibility” – the ability to reproduce the findings that other people report.  By publishing the notebook on GitHub, Dropbox, or Jupyter Notebook Viewer, others can see exactly what was performed, and run the code themselves.

Here is how I use Jupyter Notebooks.  When I start a new course, whether an official course in my Northwestern University Master of Science in Predictive Analytics, or a web based course like the ones I have been taking from DataCamp and Udemy, or from a book that I am working my way through – I will create a new Jupyter notebook.

You first have to open up a Jupyter Notebook by typing “Jupyter notebook” in your shell (I use Windows PowerShell).  This then opens up a browser page “Home”.

2016-01-03_13-26-32

If I want to open up an existing notebook, I scroll down to the  notebook of interest and open it.  Here is a screen shot showing some of my notebooks.

2016-01-03_13-29-50

If I want to start a new notebook, I go to the top, select “New”, and then either a Python or R notebook.  They come with the Python kernel installed (you go to IRkernel on GitHub to install the R kernel).  This opens up a new notebook.

2016-01-03_13-31-52

You type commands or text into “cells” and can run the cells individually or all together.  The two most common cells I use are “Markdown” and “Code”.  You do have to learn a few easy Markdown commands, for headers/etc.  The Markdown cells are used for taking notes, and inserting text.  The Code cells are used to input and run the code.

2016-01-03_13-39-43

Once you have inputted your code, you can run the cell several ways.  The most convenient is to hit “Shift-Enter”, which will run the code in that cell, and bring up a new blank cell.

These are great for creating and saving visualizations, as you can make minor changes and then compare plots.  Here are a few examples.

2016-01-03_13-45-24

2016-01-03_13-46-56

There are a few things to don’t run smoothly yet, like loading packages in R.  I have found the easiest way to load a package is to load it using RStudio, and then use the library command in Jupyter to load it into the Jupyter notebook.  Alternatively you could use the following command each time:

install.packages(“package name”, repos = c(“https://rweb.crmda.ku.edu/cran/”)) # You can select your CRAN mirror and insert into the repos command).

Overall, I love using Jupyter to both take notes, run code while learning, and organize my learning so I can easily find it later.  I see it’s huge potential in sharing data, and being able to easily reproduce results.  Give it a try!

 

Data Science, Data Visualization

Data Science Skill Network Visualization

I came across this great visualization by Ferris Jumah (see link Ferris Jumah’s blog post) about the relationships between data science skills listed by “Data Scientists” on their LinkedIn profiles.

data science skill networkbor55data science skill networkTo view a higher resolution image go to: http://imgur.com/hoyFT4t

How many of these skills have you mastered?

Ferris’s conclusions about a few key themes:

  1.  Approach data with a mathematical mindset.
  2. Use a common language to access, explore and model data.
  3. Develop strong computer science and software engineering backgrounds.

 

 

 

 

Data Science, Data Scientist

Who is Doing What/Earning What in Data Science Infographic

Are  you confused yet about the different roles/titles that people can have in the data analytics industry?   I think this might help add to your confusion.  This is a very nicely done infographic by DataCamp (http://blog.datacamp.com/data-science-industry-infographic/).  It is presented for your viewing pleasure and consideration.   Where do you fit into this categorization?  And does your compensation match your title match your responsibilities match your usefulness to your organization?

DataScientist