Data Science

Udemy.com has great courses for learning Python, R, Data Science.

Just a quick blog post to highlight the numerous courses available on Udemy.com.  I just completed Data Analysis in Python with Pandas, and found it very informative, especially with some of the advanced functions in DataFrames.

It is worthwhile keeping an eye on this site, because they have intermittent sales where these courses are deeply discounted.  I currently have 35 courses that cover Python, R, Data Science, MongoDB, SQL, MapReduce, Hadoop, teaching kids to code, Machine Learning, Data Vis, Time Series Analysis, Linear Modeling, Graphs, Rattle, Linear Regression, Statistics, Simulation, Monte Carlo Methods, Multivariate Analysis, Bayesian Computational Analyses, and more, most of which were purchased during these sales.

These are great course to learn the  underlying languages and concepts and to brush up when you have not used them for awhile.

I highly recommend these courses, just wish I had time to do more of them.

 

Data Science

Using Jupyter Notebooks to learn R, Python

I love using Jupyter Notebooks to learn R and Python.  I only wish I would have discovered them when I first started to learn Python.  The notebooks are a great way to take notes, run code, see the output of the code, and then visualize the output.  The notebooks can be organized by language – ie Python vs R, and also by the course you are taking, or book you are working your way through.  You can then go back and view your notes and code for future reference.

Project Jupyter was developed from the IPython Project in 2014, and IPython notebooks are now Jupyter notebooks.  Jupyter Notebooks are described as “a web application for interactive data science and scientific computing” . These notebooks support over 40 programming languages, and you can create notebooks with a Python kernel, or ones with an R kernel, amongst others.  These are great for learning programming languages and several academic institutions are using these in their CS courses.  They are also great for “reproducibility” – the ability to reproduce the findings that other people report.  By publishing the notebook on GitHub, Dropbox, or Jupyter Notebook Viewer, others can see exactly what was performed, and run the code themselves.

Here is how I use Jupyter Notebooks.  When I start a new course, whether an official course in my Northwestern University Master of Science in Predictive Analytics, or a web based course like the ones I have been taking from DataCamp and Udemy, or from a book that I am working my way through – I will create a new Jupyter notebook.

You first have to open up a Jupyter Notebook by typing “Jupyter notebook” in your shell (I use Windows PowerShell).  This then opens up a browser page “Home”.

2016-01-03_13-26-32

If I want to open up an existing notebook, I scroll down to the  notebook of interest and open it.  Here is a screen shot showing some of my notebooks.

2016-01-03_13-29-50

If I want to start a new notebook, I go to the top, select “New”, and then either a Python or R notebook.  They come with the Python kernel installed (you go to IRkernel on GitHub to install the R kernel).  This opens up a new notebook.

2016-01-03_13-31-52

You type commands or text into “cells” and can run the cells individually or all together.  The two most common cells I use are “Markdown” and “Code”.  You do have to learn a few easy Markdown commands, for headers/etc.  The Markdown cells are used for taking notes, and inserting text.  The Code cells are used to input and run the code.

2016-01-03_13-39-43

Once you have inputted your code, you can run the cell several ways.  The most convenient is to hit “Shift-Enter”, which will run the code in that cell, and bring up a new blank cell.

These are great for creating and saving visualizations, as you can make minor changes and then compare plots.  Here are a few examples.

2016-01-03_13-45-24

2016-01-03_13-46-56

There are a few things to don’t run smoothly yet, like loading packages in R.  I have found the easiest way to load a package is to load it using RStudio, and then use the library command in Jupyter to load it into the Jupyter notebook.  Alternatively you could use the following command each time:

install.packages(“package name”, repos = c(“https://rweb.crmda.ku.edu/cran/”)) # You can select your CRAN mirror and insert into the repos command).

Overall, I love using Jupyter to both take notes, run code while learning, and organize my learning so I can easily find it later.  I see it’s huge potential in sharing data, and being able to easily reproduce results.  Give it a try!

 

Becoming a Healthcare Data Scientist, Northwestern University MSPA

Northwestern University MSPA 400 Math for Modelers course, final thoughts

 

 

I had previously posted my interim thoughts on this course, and now that the course is finished, thought I would add my final thoughts.

The final examination was fair and a mixture of the math and Python.  You could certainly pass the course if you didn’t keep up with the Python, and do the exercises, but would be much more difficult.

The last section of the course was on calculus.  Weeks 6 and 7 were devoted to a review of differential calculus and weeks 8 and 9 were devoted to integral calculus.   Dr. Goldfeder continued to stress the real world application of the concepts learned.

We had a week off over the Thanksgiving holiday, which allowed us to catch up and review before the final examination.  I took this time to both review the math (a little), and review Python (a lot).  I went back through each weeks Python assignments to make sure I understood the concepts and could work through the code.  I HIGHLY recommend this.  Looking back I wished I would have spent more independent time applying Python and writing code to do the problems as much as possible as we were going through the course.  I encourage future students in this class to attempt to do this.

After the class ended I started catching up on my to do list, which included how to use Jupyter Notebooks.   After doing more exploring of the Jupyter Notebook, I wished I would have found them earlier.  They are very useful for learning code, and taking notes at the same time.  I would encourage students to look at these when they start this course.  I wish Northwestern University would do what several other universities have done, and that is start teaching the class using these notebooks.  This would be extremely useful for the Python part of the course.  I have now been brushing up on R using the same Jupyter Notebooks, with an R kernel installed.  I plan on using this notebook as I go through my next class, statistical analysis, which uses R.  Here is the link to Project Jupyter’s webpage.

My overall assessment of the math for modelers course is highly positive, and I feel as if I learned what I set out to learn, and got my money’s worth.  It is a very demanding class time wise, but for those interested in analytics, this is a foundational set of knowledge that must be learned.

Data Science

Text Cleaning Using Python Infographic

Here is an infographic about using Python for text cleaining from the Analytics Vidya website (analyticsvidhya.com).

Here is the link: http://i2.wp.com/www.analyticsvidhya.com/wp-content/uploads/2015/06/New-Info.jpghttp://i2.wp.com/www.analyticsvidhya.com/wp-content/uploads/2015/06/New-Info.jpg

In addition to this information, Matt Crowson, the Python TA for my Math for Modelers course at Northwestern, suggested the following as well.

NLTK (Natural Language Tool Kit) http://www.nltk.org/

SciKit Learn http://scikit-learn.org/stable/

text-cleaning-python

Becoming a Healthcare Data Scientist

My Current Baseline Data Scientist Skill Set

It will be interesting to compare my skill set once I finish the predictive analytics program to my current skill set.  I will outline my current skills so I can come back later and compare the two.

I will organize my skills using the format presented by Mitch Sanders in his blog article posted on 8.27.13 “Data Science – Capturing, Analyzing, and Presenting Data Skills”.  (http://datareality.blogspot.com/2013/08/data-scientist-core-skills.html).

1.  Capturing Data

Programming and Database skills:

I am weak in this area.  I have used R a bit to do some statistical analysis in the past.  I am currently learning Python  as I write this.  So far, I have found that Codecademy’s Python course is the best learning platform for me.  My next favorite resource is Zed Shaw’s book, “Learn Python the Hard Way”.  I really like his practical approach.  “Introducing Python.  Modern computing in simple packages” by Bill Lubanovic is also good, but but a bit more advanced.  Finally, the Visual Quickstart Guide “Python” by Toby Donaldson is a quick reference guide.  Going past basic programming, my skills are near or below zero.  I do not know how to use Hadoop, Java, SQL, Hive or Pig.

Business Domain Expertise and Knowledge

This is my strongest area of expertise.  I started off in medicine in 1984 as a basic EMT, became a EMT-Paramedic, and then Paramedic Educator.  I finished medical school (University of Illinois College of Medicine in Peoria Illinois) in 1994, and my Emergency Medicine Residency at Saint Francis Hospital in Peoria Illinois in 1997.   I have practiced academic and community based emergency medicine since then.   I have been a medical director for both ground based EMS and for a flight program.  I am also one of our health system’s Chief Medical Information Officers (CMIO), so have had to learn the field of Healthcare Information Technology as well.   In my current role I have a special interest in Business Intelligence and Analytics, including predictive analytics.  My passion is for developing smarter systems that can provide information about a patients risk of developing certain diseases/conditions, risk of deterioration/death, early detection of sub-clinical illness, and information about a patient’s response to treatment and therapy.  Hence my interest in predictive analytics.

Data Modeling, Warehouse, and Unstructured Data Skills.

I have minimal skills in this category.

2.  Analyzing Data

Math Skills.

I have basic math skills, but it has been a long time since I have had to do more than basic math, including calculus and linear algebra.  After I finish getting a basic foundation in Python, my next step is to refresh my knowledge of math/calculus/linear algebra before starting my “Math for Modelers” course this fall.

Statistical  and Analytical Skills

I do have a little better grasp of descriptive and inferential statistics.   But I will need to increase my knowledge of the advanced statistical techniques not commonly used in medicine today.  These would include predictive analytics, regression, multivariate analysis, linear models, time series analysis, machine learning, etc.

3.  Presenting Data

I am really excited to learn about and improve my data visualization skills.  I am really pushing hard for our organization to move away from excel and PowerPoint based presentations of data, to more relevant methods.

Storytelling Skills

I am a pretty good storyteller, but would like to improve my skills, especially in presenting the data and stories around the data.  I would like to help people  understand the insight created by the data analysis, and then help them move to operationalizing that insight, and driving organization change to improve patient outcomes.

In summary, my strongest skills are my love of data and analytics, my (obsessive) desire to become a data scientist, and my domain knowledge as it pertains to healthcare.  My other skills will have to be works in progress.

I would love to hear comments on what you think, and any recommendations/advice for students just starting this journey.

June 10, 2015