Data Science, Northwestern University MSPA

Python Tops KDNuggets 2017 Data Science Software Poll

The results of KDNuggets’ 18th annual Software Poll should be fascinating reading for anyone involved in data science and analytics.  Some highlights – Python (52.6%) finally overtook R (52.1%), SQL remained at about 35%, and Spark and Tensorflow have increased to above 20%.

KDNugetts_poll

(Graph taken from http://www.kdnuggets.com/2017/05/poll-analytics-data-science-machine-learning-software-leaders.html/2)

I am about halfway through Northwestern University’s Master of Science in Predictive Analytics (MSPA) program.  I am very thankful that the program has made learning different languages a priority.  I have already learned Python, Jupyter Notebooks, R, SQL, some NoSQL (MongoDB), and SAS.  In my current class in Generalized Linear Models, I have also started to learn Angoss, SAS Enterprise Miner, and Microsoft Azure machine learning.  However, it looks like you can’t ever stop learning new things – and I am going to have to learn Spark and Tensorflow – to name a few more.

I highly recommend you read this article.

 

Data Science, Northwestern University MSPA

Northwestern University MSPA Program – learning R and python resources page creation

For new students coming into Northwestern University’s Master of Science in Predictive Analytics (MSPA) program, there is often considerable apprehension about learning the programming languages (mainly R and Python, and some SAS).   I have created a page on my blog site – Northwestern University MSPA Program – Learning R and Python resources – that lists some of the resources that are available, and my favorites.

I would encourage students to start taking the programming courses ahead of the particular classes, and whatever language you are required to use in that class.  There is enough time between the official courses to take some of these courses.  That way you don’t have to learn the course content and the programming language at the same time (if you don’t it is still doable, just will take more effort).

Data Science

Udemy.com has great courses for learning Python, R, Data Science.

Just a quick blog post to highlight the numerous courses available on Udemy.com.  I just completed Data Analysis in Python with Pandas, and found it very informative, especially with some of the advanced functions in DataFrames.

It is worthwhile keeping an eye on this site, because they have intermittent sales where these courses are deeply discounted.  I currently have 35 courses that cover Python, R, Data Science, MongoDB, SQL, MapReduce, Hadoop, teaching kids to code, Machine Learning, Data Vis, Time Series Analysis, Linear Modeling, Graphs, Rattle, Linear Regression, Statistics, Simulation, Monte Carlo Methods, Multivariate Analysis, Bayesian Computational Analyses, and more, most of which were purchased during these sales.

These are great course to learn the  underlying languages and concepts and to brush up when you have not used them for awhile.

I highly recommend these courses, just wish I had time to do more of them.

 

Data Science, Northwestern University MSPA, Predictive Analytics

Northwestern University MSPA 401, Introduction to Statistics Review

I finished this course last week, and thought I would post my thoughts before I forget them.

I was in Professor Roy Sanford’s section, and I HIGHLY recommend him.  He is an extremely experienced practitioner, and very knowledgeable of statistics and in using R for statistical analysis.

The course is focused on several aspects – learning basic statistics, learning R to perform statistical analysis, and engaging the students to participate in discussions that are pertinent to the material being learned.

Learning Statistics

The core text for the course is Ken Black’s Business Statistics For Contemporary Decision Making, 8th Edition.  It is a loose leaf binder text so you can remove the sections you are studying, which makes it nice.  It is a very down to earth text, with plenty of examples and problems.  Their is a companion website called WileyPlus that has videos to watch and a variety of problems/exercises.

A second supplemental statistical text is Rand R. Wilcox’s Basic Statistics: Understanding Conventional Methods and Modern Insights.  There are selected readings which highlight some contemporary issues.  Not as easy to read as Black’s text, but still informative.

Learning R

The coursework is presented using R.  You don’t HAVE to learn to use R, but you would be an idiot not to take advantage of this opportunity.  There is a great deal of effort putting into devising the curriculum to help you learn R.   This is well thought out, and I feel very confident that I have obtained a good working knowledge of R on which to build.  I was astounded to read a comment on the LinkedIn group – Networking Group for Northwestern University’s MS in Predictive Analytics Program –  from a previous student who took this course, who commented he didn’t really learn any R because he didn’t do any of the R reading or assignments.  To me, learning R was just as important as learning the statistics.  Plus I don’t know how you could do the Data Analysis Projects without learning R. Learning R is accomplished through reading various text’s, watching weekly video’s on R produced by Prof. Sanford, and then doing exercises.  Plus there are R resources and lessons, including links to Lynda.com.

I did the work in both RStudio and in a Jupyter Notebook using the R kernel. The Jupyter Notebook was my favorite way of doing the assignments because I could refer back to them.  But some things are way easier to do in RStudio, like installing packages and data sets, so sometimes I switched between the two.  See my other blog posts for information about Jupyter Notebooks.

The first R text is Winston Chang’s R Graphics Cookbook.  This takes you through the R basics and gets you up to speed quickly visualizing data.  There is a little bit about using the base plotting function in R, but most of the book is about visualizing using the ggplot2 package.  If you follow the exercises, you will get good at plotting and visualizing data.  You will learn scatter plots, line graphs, bar graphs, histograms, box plots (a lot – I finally understand what to do with a box plot), functions, QQ plots (I finally understand these as well).  All of these are extremely helpful in what you will spend a lot of time learning, Exploratory Data Analysis (EDA).

The second R text is Jared P. Lander’s R for Everyone: Advanced Analytics and Graphics.  This dives more deeply into using R for things other than data visualization and graphics, although it includes this as well.  This is a very easy to read and follow text.

The third R text is John Verzani’s Using R for Introductory Statistics: 2nd Edition.  This book is a very deep dive into R’s capability to do statistical analysis.  Although very detailed, it is understandable with great examples.

The last R text is downloadable from the site, Sarah Stowell’s Using R for Statistics.  This is also a very practical book on both statistics and visualization.

Don’t be overwhelmed by the number of text’s and reading, it is doable, and I would do it all.  If you do that, you will not be able to say you did not get your money’s worth.

In addition there are beginning videos and lessons about learning R, including links to Lynda.com.   There are weekly Calculations with R assignment, which include a video with examples.  There are exercises with these weekly assignments as well.  Finally there are R lessons which take you through learning R in an organized manner.

Sync Sessions and Videos

Professor Sanford holds a sync session every other week.  These are extremely informative and helpful.  You don’t have to watch live, but you need to watch later.  The sync sessions in Predict 400 were optional and you could get by fine without watching them.  Not the case here.  You will learn a lot from these.

The same holds for the videos he has created to go along with the weekly R exercises.  These are must watch videos.

Data Analysis Projects

There are two data analysis projects.  You will learn how to apply what you are learning to a hypothetical data analysis project.  These are pretty challenging, but VERY worthwhile.  These show the applied focus of the MSPA program, and I found them beneficial.  The first one really focused on doing some exploratory data analysis.  The second one was twice as long as the first, and you applied what you learned later in the course, including the creation of a linear regression model.  You will definitely want to start early on these, and put in the effort to do these correctly, as together they constitute 2/5’s of your grade.

Bi-weekly Tests

There are 4 bi-weekly tests which are very fair and doable.  Together they constitute 1/5 of your grade.

Final Exam

The final exam is also very fair and doable.  Much easier if you have paid attention to learning R, as you can use R to do the exam.  This is 1/5 of your grade.

Communications and Discussions

There are Communications discussion sections set up for statistics and R.  You can post a question anytime in either and get a rapid response from either Prof. Sandford or the R TA.  Our R TA was Todd Peterson, and he was extremely knowledgeable, helpful, and responsive.

Every week there are two discussions around topics you are learning.  These are student driven, and if taken seriously, you can learn a lot from each other.  There are some extremely bright and talented students in these classes who have great real world experience in a variety of sectors.   The final discussion section is a recap of what you learned that week, and Prof. Sanford participates in that discussion.

Overall

I spent between 20-30 hours per week doing the coursework.  You wouldn’t have to spend that much time, especially if this material is not new for you.  But I wanted to really learn the material, not just pass the class.

I really enjoyed this course on many fronts.  I found learning about statistics and R together was very complementary.  In fact, I cannot imagine doing any kind of statistical analysis without using a language such as R.  I am now trying to recreate what I learned in R using Python.  I really feel as if I got my money’s worth.

 

 

Data Science

Using Jupyter Notebooks to learn R, Python

I love using Jupyter Notebooks to learn R and Python.  I only wish I would have discovered them when I first started to learn Python.  The notebooks are a great way to take notes, run code, see the output of the code, and then visualize the output.  The notebooks can be organized by language – ie Python vs R, and also by the course you are taking, or book you are working your way through.  You can then go back and view your notes and code for future reference.

Project Jupyter was developed from the IPython Project in 2014, and IPython notebooks are now Jupyter notebooks.  Jupyter Notebooks are described as “a web application for interactive data science and scientific computing” . These notebooks support over 40 programming languages, and you can create notebooks with a Python kernel, or ones with an R kernel, amongst others.  These are great for learning programming languages and several academic institutions are using these in their CS courses.  They are also great for “reproducibility” – the ability to reproduce the findings that other people report.  By publishing the notebook on GitHub, Dropbox, or Jupyter Notebook Viewer, others can see exactly what was performed, and run the code themselves.

Here is how I use Jupyter Notebooks.  When I start a new course, whether an official course in my Northwestern University Master of Science in Predictive Analytics, or a web based course like the ones I have been taking from DataCamp and Udemy, or from a book that I am working my way through – I will create a new Jupyter notebook.

You first have to open up a Jupyter Notebook by typing “Jupyter notebook” in your shell (I use Windows PowerShell).  This then opens up a browser page “Home”.

2016-01-03_13-26-32

If I want to open up an existing notebook, I scroll down to the  notebook of interest and open it.  Here is a screen shot showing some of my notebooks.

2016-01-03_13-29-50

If I want to start a new notebook, I go to the top, select “New”, and then either a Python or R notebook.  They come with the Python kernel installed (you go to IRkernel on GitHub to install the R kernel).  This opens up a new notebook.

2016-01-03_13-31-52

You type commands or text into “cells” and can run the cells individually or all together.  The two most common cells I use are “Markdown” and “Code”.  You do have to learn a few easy Markdown commands, for headers/etc.  The Markdown cells are used for taking notes, and inserting text.  The Code cells are used to input and run the code.

2016-01-03_13-39-43

Once you have inputted your code, you can run the cell several ways.  The most convenient is to hit “Shift-Enter”, which will run the code in that cell, and bring up a new blank cell.

These are great for creating and saving visualizations, as you can make minor changes and then compare plots.  Here are a few examples.

2016-01-03_13-45-24

2016-01-03_13-46-56

There are a few things to don’t run smoothly yet, like loading packages in R.  I have found the easiest way to load a package is to load it using RStudio, and then use the library command in Jupyter to load it into the Jupyter notebook.  Alternatively you could use the following command each time:

install.packages(“package name”, repos = c(“https://rweb.crmda.ku.edu/cran/”)) # You can select your CRAN mirror and insert into the repos command).

Overall, I love using Jupyter to both take notes, run code while learning, and organize my learning so I can easily find it later.  I see it’s huge potential in sharing data, and being able to easily reproduce results.  Give it a try!