Data Science, Northwestern University MSPA

Python Tops KDNuggets 2017 Data Science Software Poll

The results of KDNuggets’ 18th annual Software Poll should be fascinating reading for anyone involved in data science and analytics.  Some highlights – Python (52.6%) finally overtook R (52.1%), SQL remained at about 35%, and Spark and Tensorflow have increased to above 20%.

KDNugetts_poll

(Graph taken from http://www.kdnuggets.com/2017/05/poll-analytics-data-science-machine-learning-software-leaders.html/2)

I am about halfway through Northwestern University’s Master of Science in Predictive Analytics (MSPA) program.  I am very thankful that the program has made learning different languages a priority.  I have already learned Python, Jupyter Notebooks, R, SQL, some NoSQL (MongoDB), and SAS.  In my current class in Generalized Linear Models, I have also started to learn Angoss, SAS Enterprise Miner, and Microsoft Azure machine learning.  However, it looks like you can’t ever stop learning new things – and I am going to have to learn Spark and Tensorflow – to name a few more.

I highly recommend you read this article.

 

Northwestern University MSPA, Predictive Analytics

Northwestern University MSPA Predict 410, Regression and Multivariate Analysis Course Review

This was the most demanding of Northwestern University’s MSPA (Master of Science in Predictive Analytics) courses I have taken so far, and also the most rewarding.   This course is the backbone of the predictive analytics program and foundational to becoming a predictive modeler.  The course covers Linear Regression (Simple Linear Regression and Multiple Linear Regression) and Multivariate Analysis (Principal Component Analysis, Factor Analysis, and Cluster Analysis).

This course is an applied course, so you have to understand the mathematics, but don’t have to do in-depth calculations using matrix algebra (it could be much, much worse!).  This is in keeping with the philosophy that the MSPA program is an applied program, preparing students to go out and start working in this field.

I took this course from Professor Syamala Srinivasan, Ph.D.   She is at the top of the list of Northwestern Professors, who already are very high quality.  I would highly recommend her if you are considering this class.  She has gone above and beyond, with her textbook lectures, additional lectures on topics of interest, SAS tutorials, SAS demo’s for each week’s assignments, sync sessions, response to questions both on the discussion boards and by e-mail.  Her level of involvement in creating the course work and in teaching the class are phenomenal.  I can’t say enough good things about her.

The course structure is as follows.  Every week there is required reading from a variety of textbooks and articles from the library.  There are PowerPoint lectures with audio for each textbook chapter.  In addition there are usually several other special topic PowerPoint/audio lectures.  A recorded video session then goes over the assignment for the week, and the SAS code used for the assignment.   Participation in the week’s discussion board is mandatory and extremely helpful.  The assignments build upon each other and get more complex.  There are intermittent quizzes.  The final exam is a two-part, one a take home exam, the second an online-proctored exam.

Now for the particulars.   This course isn’t for those who are time challenged already.  I would not recommend taking a second course with this one, unless you have a lot of spare time.  I spent a good 20-30 hours per week on this course, and wished I actually had more time to devote to it.  I read almost every mandatory reading assignment and optional reading assignment, so you could cut corners and devote less time, but I would worry about not learning the content of this foundational material.

Textbooks

Regression Textbooks Required

  1. Montgomery, D.C., Peck, E.A., and Vining, G.G. (2012). Introduction to Linear Regression Analysis. (5th Edition). New York, NY: Wiley [ISBN-13: 978-0470542811]

This textbook is the main one used throughout the course.  It has sections that are difficult to get through, but the foundational material is there.

2.  Everitt, B. (2009). Multivariable Modeling and Multivariate Analysis for the Behavioral Sciences. Boca Raton, FL: CRC Press [ISBN-13: 978-1439807699]

This is a good supplement to the Montgomery textbook, with coding examples – both in the book and on-line – using R.

Regression Textbooks Optional

  1.  Pardoe, I. (2012). Applied Regression Modeling. (2nd Edition). New York, NY: Wiley [ISBN-13: 978-1118097281]

This was my favorite textbook, and is definitely more understandable and is written from an applied standpoint.

2.  Ryan, A. G. Montgomery, D.C., Peck, E.A., and Vining, G.G. (2013). Solutions Manual to Introduction to Linear Regression Analysis. New York, NY: Wiley [ISBN-13: 978-1118471463]  To be honest, I didn’t use this a lot.

3.  Sheather, S. (2009). A Modern Approach to Regression with R. Springer [ISBN-13 978-1441918727]  I didn’t use this at all, but it will be handy when working through real world problems using R later.

 

SAS Textbooks

  1. Cody, R. (2011). SAS Statistics By Example. Cary, N.C.: SAS Publishing. [ISBN-13 978-1607648000]
  2. Delwiche, L., and Slaughter, S. (2012). The Little SAS Book: A Primer. (5th Edition). Cary, NC: SAS Publishing. [ISBN-13: 978-1612903439]

I used both of these books as references a fair amount.

In addition there were quite a few reference articles in the library.  Some of these were very good, some were very detailed.

SAS

This course uses SAS for all analysis and visualization.  You could use R, but the course is built around SAS.  I will say I came into the course with a bias against SAS (from ignorance mainly – but also due to the cost of the license for this, and a move away from these closed systems to open systems like Python and R.  I am a huge Python proponent.)  However, I have come to like SAS for how easy it was to learn, and how easy it is to do data analysis and visualization.

An imperative is start learning SAS before the course starts.  You will get an email and syllabus from Dr. Srinivasan early listing what you need to study.  There are SAS tutorials and readings.  I also did the learning within SAS.  I completed the on-line SAS Programming 1: Essentials e-Course, which was very helpful.  There are also multiple additional free courses that you can take.

You can use SAS through the SSCC – Social Sciences Computing Cluster (no additional charge), through the web based SAS Studio (no additional charge), or you can purchase a license.  I exclusively used SAS Studio and had no problems.

Coursework

The Learning Goals of the course are:

  • Develop statistical modeling as a three step process consisting of: (1) exploratory data analysis, (2) model identification, and (3) model validation.
  • Understand how to use automated variable selection as a tool for model identification and as a tool for exploratory data analysis in the presence of a large number of predictor variables or a set of unlabeled predictors.
  • Develop a working understanding of the conceptual (theoretical) foundations of linear regression, principal components analysis, factor analysis, and cluster analysis with the objective of being capable of applying these techniques appropriately and validating their results.
  • Develop a conceptual and practical understanding of the difference between statistical inference and predictive modeling and how it affects our choices and actions in the statistical modeling process.
  • Learn the basics of the SAS Data Step, data manipulation with SAS, and SAS procedures (PROCS) for fitting statistical models.

Weekly Reading and Video assignments

Each week there are required textbook readings, optional textbook readings, course reserve readings, and lecture videos.  The weekly videos are PowerPoint presentations with audio, and go over the textbook readings for that week.  In addition, there are other lectures on special topics.

The special topics include:

Statistical Preliminaries and Notation

Statistical Assumptions for OLS Regression

Estimation and Inference for OLS Regression

Analysis of Variance and Related topics in OLS Regression

Hat Matrix Lecture

Statistical Inference vs Predictive Modeling in OLS Regression

Special Topic: Dummy Variables Hypothesis Testing

Special Topic Lecture (Degrees of Freedom)

Special Topic Lecture (Likelihood Function)

Special Topic Lecture (Mallow’s Cp)

Hypothesis Testing Multiple Linear Regression

Factor Analysis Example Lecture

 

Sync Sessions

There are a total of 4 Sync sessions.  These are invaluable as Dr. Srinivasan reviews the recent material, but then puts it all into a larger context.

Assignments

There are a total of 8 assignments.  These are a combination of using SAS to do analysis and visualization, as well as having to provide an analysis of the produced outcomes.  The code is pretty much already written.  You will have to make a few modifications, but the focus is on using SAS, and the assignments are designed to test your ability to perform regression and multivariate analysis, not struggle producing code from scratch.  This was a very nice feature.  Each week there is a SAS demo video lecture where the Professor runs through the code and the assignment – extremely helpful.

Here are the titles of the assignments:

Assignment 1: Getting to know your data.

Assignment 2: Regression model building.

Assignment 3: Data analysis and regression.

Assignment 4: Statistical inference in linear regression.

Assignment 5: Automated variable selection, multicollinearity, and predictive modeling.

Assignment 6: Principal components in predictive modeling.

Assignment 7: Factor analysis

Assignment 8: Cluster analysis.

 

Discussion Boards

These were extremely robust.  You have to answer 3 questions posed by Professor Srinivasan, and then actively engage in discussions around what other people posted.  The questions were relevant and helped the learning process.  The discussions were robust and enhanced the learning.

Follow up by Dr. Srinivasan

Each week she would send out several e-mails – on how the assignments went, to clarify issues presented in the discussion boards, and to follow up on quizzes.  These were very helpful.

Quizzes and Tests

There were a total of 5 open book quizzes.  These were very doable,  but somewhat demanding.

There were 2 final examinations – a take home exam (1 hour) and a proctored 2 hour exam.  These were challenging but doable.

Final Thoughts

This has been the highlight of the MSPA’s courses so far, as this course is the foundation for building predictive models.  The other courses I have taken were a lead up to this course.  Dr. Srinivasan has gone above and beyond and delivers a high quality product.  My favorite course and Professor so far.