Data Scientist, Northwestern University MSDS Program, Northwestern University MSPA

Northwestern University’s Masters of Science in Predictive Analytics (MSPA) becomes the Masters of Science in Data Science (MSDS)

Starting in the Spring Quarter of 2018 the MSPA (Masters of Science in Predictive Analytics)  program became the MSDS (Masters of Science in Data Science) program.  This was announced in January of 2018 and the name change become official in the Spring Quarter of 2018.  Existing MSPA students had the options of staying in the MSPA program with it’s requirements, or transferring over to the MSDS program.  I elected to transfer to the MSDS program.  There is a webex on the MSDS program – click here for the webex.

In the webinar, Dr. Thomas Miller, the faculty director of the MSPA and now the MSDS programs, related that Northwestern University’s MSPA program started in the fall of 2011, before the term data science was a widely known or used term.  However, since then it has become mainstream, and has emerged as a discipline in it’s own right.   Therefore the decision to change the name of the program.

Data science was described by Dr. Miller as “an emerging, integrative academic discipline” encompassing Business needs (strategy, management, leadership, communication skills), Modeling (statistics, machine learning, and model building), and Information Technology (databases, etc).  Each of these is covered in the MSDS program.

Dr. Miller also commented that the main programming language moving forward would be Python.   Initially when the program was formed, SAS and SPSS were the main languages.  Python and R were brought in at a later date.   R will still be used in some courses in the Analytics and Modeling Specialization courses.   He did not make it clear whether SAS would still be an option though.

MSDS Program Overview

You need to successfully complete 12 courses.  There are core courses, elective courses, and specialization options.

Core Courses

MSDS 400 – Math for Data Scientists

MSDS 401 – Statistical Analysis

MSDS 402 – Introduction to Data Science

MSDS 420 – Database Systems and Data Preparation

MSDS 422 – Practical Machine Learning

MSDS 460 – Decision Analytics

MSDS 475 – Project Management or MSDS 480 Business Leadership and Communications

MSDS 498 – Capstone or MSDS 590 – Thesis

 

A new elective was created for students with limited programming background:

MSDS 430 – Python for Data Science

Specializations

 

Analytics and Modeling Specialization

Designed for data scientists seeking technical roles as data analysts, applied statisticians, and modelers. Courses focus on statistical inference and applications of predictive models.

Required Courses:

MSDS 410 – Regression and Multivariate Analysis

MSDS 411 – Generalized Linear Models

Plus 2 electives

 

Data Engineering Specialization

Designed for students seeking technical positions focused on designing, developing, implementing, and maintaining systems for data science.

Required Courses:

MSDS 432 – Foundations of Data Engineering

MSDS 434 – Analytics Application Development

Plus 2 electives

 

Analytics Management Specialization

Designed for students seeking technical leadership and data science management positions.

Required Courses:

MSDS 474 – Accounting and Finance for Analytics Managers

MSDS 475 – Project Management

MSDS 480 – Business Leadership and Communications

(Students in this specialization have to take both 475 and 480)

Plus 2 electives

 

*Artificial Intelligence and Deep Learning Specialization

*This has not been officially announced – this information is from comments that Dr. Thomas Miller made during  a MSDS 422 Sync session.  He said that this specialization is being developed – so take these comments as being preliminary.  I personally am really excited about this specialization, as I just finished MSDS 422 – Practical Machine Learning – and realize the growing importance of machine learning now and in the future.

Required Courses:

MSDS 453 – changing from Text Analytics to Natural Language Processing

MSDS 458 – Artificial Intelligence and Deep Learning

Plus 2 electives

These new electives are being created:

Computer Vision

Software Robotics

 

Listing of all current elective courses:

MSDS 410 – Regression Analysis

MSDS 411 – Generalized Linear Models

MSDS 413 – Times Series Analysis and Forecasting

MSDS 430 – Python for Data Science

MSDS 432 – Foundations of Data Engineering

MSDS 434 – Analytics Application Development

MSDS 436 – Analytics Systems Analysis

MSDS 450 – Marketing Analysis

MSDS 451 – Financial and Risk Analytics

MSDS 452 – Web and Network Data Science

MSDS 453 – Text Analytics – soon to become Natural Language Processing

MSDS 454 – Data Visualization

MSDS 456 – Sports Performance Analytics

MSDS 457 – Sports Management Analytics

MSDS 458 – Artificial Intelligence and Deep Learning

MSDS 459 – Information Retrieval and Real-Time Analytics

MSDS 470 – Analytics Entrepreneurship

MSDS 472 – Analytics Consulting

MSDS 474 – Accounting and Finance for Analytics Managers

MSDS 490 – Special Topics in Data Science

 

 

 

 

Machine Learning, Northwestern University MSDS Program, Northwestern University MSPA

Northwestern University MSDS (formerly MSPA) 422 – Practical Machine Learning Course Review

This course was taught by Dr. Thomas Miller, who is the faculty director of the Data Science program (formerly known as the Predictive Analytics program – I am going to post an article discussing the program name change from the Master of Science in Predictive Analytics (MSPA) to the Master of Science in Data Science (MSDS)).  Overall, this was an excellent review of machine learning, and is a required core course for all students in the program.  It is most definitely a foundational course for any student of data science in today’s world.  It is also a foundational course for the Artificial Intelligence and Deep Learning specialization, which is currently being developed (more on this in a subsequent post as well).  The course covers the following topics:

  • Supervised, Unsupervised, and Semi-supervised learning
  • Regression versus Classification
  • Decision Trees and Random Forests
  • Dimensionality Reduction techniques
  • Clustering Techniques
  • Feature Engineering
  • Artificial Neural Networks
  • Deep Neural Networks
  • Convolutional Neural Networks (CNN)
  • Recurrent Neural Networks (RNN)

This course uses Python and the Python Libraries Scikit-Learn and TensorFlow. In addition to using Jupyter Notebooks to run my code, I also learned how to run TensorFlow from the Command Line, which is a faster way of running neural networks through a large number of epochs. The course is currently offered in R as well, but they will be discontinuing the R course, and only offering the Python/TensorFlow course starting in the fall semester.   Dr. Miller commented that they will be using Python much more extensively going forward, especially in the AI/Deep Learning specialization courses.  R apparently will still be offered in the Analytics/Modeling courses – 410 (Regression Analysis) and 411 (Generalized Linear Models).   I did learn to use Python/Scikit-Learn/TensorFlow at an intermediate level, and feel like I have a great foundation to build upon, in terms of programming.

Course Structure

There is required reading every week, mainly from the two required textbooks, although there are a few articles to read as well.  There were a total of 5 sync sessions which reviewed various topics.   I wish the sync sessions had been a little more robust, and covered the current assignments and the coding required to complete the assignments.  I found this very helpful in previous courses.  There were weekly discussion board assignments, which covered basic concepts, and turned out to be very informative, especially since a lot of the topics covered on the final exam were covered in these discussions.  There are weekly assignments which must be completed, in which you either develop the code yourself, or use a skeletal code base provided and build upon it.   These ranged from very easy to very difficult, especially as you moved into the artificial neural networks.  There was a non-proctored final exam and a proctored final exam.

Textbooks

Primary Textbooks:

Géron, A. 2017. Hands-On Machine Learning with Scikit-Learn & TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. Sebastopol, Calif.: O’Reilly. [ISBN-13 978-1-491-96229-9] Source code available at https://github.com/ageron/handson-ml  This was the primary textbook for most of the course.  It is an excellent text with lots of great coding examples.

Müller, A. C. and Guido, S. 2017. Introduction to Machine Learning with Python: A Guide for Data Scientists. Sebastopol, Calif.: O’Reilly. [ISBN-13: 978-1449369415] Code examples at https://github.com/amueller/introduction_to_ml_with_python

Reference Textbook:

Izenman, A. J. 2008. Modern Multivariate Statistical Techniques: Regression, Classification, and Manifold Learning. New York: Springer. [ISBN-13: 978-0-387-78188-4] This was used very little.

Learning Outcomes (from syllabus):

Learning Outcomes Practical Machine Learning is a survey course with a long list of learning outcomes:

  • Explain the learning algorithm trade-offs, balancing performance within training data and robustness on unobserved test data.
  • Distinguish between supervised and unsupervised learning methods.
  • Distinguish between regression and classification problems
  • Explain bootstrap and cross-validation procedures
  • Explore and visualize data and perform basic statistical analysis
  • List alternative methods for evaluating classifiers.
  • List alternative methods for evaluating regression
  • Demonstrate the application of traditional statistical methods for classification and regression
  • Demonstrate the application of trees and random forests for classification and regression
  • Demonstrate principal components for dimension reduction.
  • Demonstrate principal components regression
  • Describe hierarchical and non-hierarchical clustering techniques
  • Describe how semi-supervised learning may be utilized in addressing classification and regression problems
  • Explain how measurement and feature engineering are relevant to modeling
  • Describe how artificial neural networks are constructed from logical connections of artificial neurons and activation functions
  • Demonstrate the use of artificial neural networks (including deep neural networks) in classification and regression
  • Describe how convolutional neural networks are constructed
  • Describe how recurrent neural networks are constructed
  • Distinguish between autoencoders and other forms of unsupervised learning
  • Describe applications of autoencoders
  • Explain how the results of machine learning can be useful to business managers
  • Transform data and research results into actionable insights

 

Weekly Assignments

Here are the weekly learning titles and assignments:

Week 1.  Introduction to Machine Learning

  • Assignment 1. Exploring and Visualizing Data

Week 2.  Supervised Learning for Classification

  • Assignment 2. Evaluating Classification Models

Week 3.  Supervised Learning for Regression

  • Assignment 3. Evaluating Regression Models

Week 4. Trees and Random Forests

  • Assignment 4. Random Forests

Week 5.  Unsupervised Learning

  • Assignment 5. Principal Components Analysis

Week 6. Neural Networks

  • Assignment 6. Neural Networks

Week 7.  Deep Learning for Computer Vision

  • Assignment 7. Deep Learning

Week 8.  Deep Learning for Natural Language Procession

  • Assignment 8 Natural Language Processing

Week 9.  Neural Networks Autoencoders

  • No assignment

 

Final Examinations

There were 2 final examinations, one being non-proctored and the other proctored.  The non-proctored exam was open book, and tested your ability to look at data and the various analytical techniques, and interpret the results of the analyses.  The proctored final exam was closed book and covered general concepts.

Final Thoughts

This was a great overview of some of the more important topics in machine learning.  I was able to get a good theoretical background in these topics, and learned the coding necessary to perform these.   This is a great foundation upon which to add more advanced and in-depth use of these techniques.  This course really challenged me to rethink what analytical techniques I should be learning and applying in the future, to the point that I am going to change my specialization to Artificial Intelligence and Deep Learning.

 

Data Science, Northwestern University MSPA

Python Tops KDNuggets 2017 Data Science Software Poll

The results of KDNuggets’ 18th annual Software Poll should be fascinating reading for anyone involved in data science and analytics.  Some highlights – Python (52.6%) finally overtook R (52.1%), SQL remained at about 35%, and Spark and Tensorflow have increased to above 20%.

KDNugetts_poll

(Graph taken from http://www.kdnuggets.com/2017/05/poll-analytics-data-science-machine-learning-software-leaders.html/2)

I am about halfway through Northwestern University’s Master of Science in Predictive Analytics (MSPA) program.  I am very thankful that the program has made learning different languages a priority.  I have already learned Python, Jupyter Notebooks, R, SQL, some NoSQL (MongoDB), and SAS.  In my current class in Generalized Linear Models, I have also started to learn Angoss, SAS Enterprise Miner, and Microsoft Azure machine learning.  However, it looks like you can’t ever stop learning new things – and I am going to have to learn Spark and Tensorflow – to name a few more.

I highly recommend you read this article.

 

Becoming a Healthcare Data Scientist, Data Scientist, Healthcare Predictive Analytics, Northwestern University MSPA

Physician Data Scientist Part II. The Why.

I was recently reminded by a reader of my blog (thanks Al) that I had not followed up on a comment that I was going to post a second part to a blog that was posted on 7.7.2015 – “Physician Data Scientists – Why and What Type? Part I“.  Now that I am in between classes, I have the time to work on this.   Looking back at this original post, I am somewhat amazed at all that has happened in the last 1 1/2 years.

I am currently the interim Chief Information Officer (CIO) and Chief Medical Information Officer (CMIO) for our integrated healthcare system.   I stepped into the interim CIO role (helped in part by my Northwestern University MSPA Master of Science in Predictive Analytics coursework) after the departure of our previous CIO last year.  Prior to that I had been one of our systems CMIO’s – facilitating and communicating the needs for technology to help improve clinical outcomes to IT, while communicating back to Physicians and Leadership the limitations of current technologies.  I never really aspired to become either the interim CIO or a CMIO, these opportunities simply arose because of my journey to become better educated about the use of data and analytics to improve clinical outcomes – ie to become a Physician Data Scientist.  I will explain how I ended up in my current role.

My interest in data and analytics is a fairly recent phenomenon, occurring because of a chance meeting with someone who has since become one of my closest friends – Curt Lindberg – who has a PhD in Complexity Science, and is the Director of our Complexity in Healthcare Center.  I met him during a project to improve our process for getting patients into our healthcare system from outside facilities more efficiently.  At that time I was a practicing Emergency Physician and the Medical Director of our MedFlight Air Ambulance service.  Curt introduced me to complexity science and my life has not been the same – it was a transformational career moment for me.  I ended up being part of a small group of researchers who were trying to develop smarter patient monitoring systems.  Their work has inspired me to try and contribute in my own way to this field – called predictive monitoring.

Predictive monitoring is an unofficial term for what this group is trying to accomplish.  While the technology inside the monitors has changed drastically since the 1970’s, what the monitors do has not.  These monitors display certain physiologic markers of interest – blood pressure, pulse rate, temperature, oxygen level, ekg pattern, etc.  You can see what is happening to the patient right at that time, or you can go back and review what happened to them in the past (minimally), but there is no information about predicting what will happen to them in the future (are they predicted to get better, go into sudden cardiac arrest, stop breathing, or develop an overwhelming infection called sepsis, etc).  The goal is to incorporate predictive algorithms into these monitoring systems.

I have been fortunate to meet some giants in this field.  Dr. J. Randall Moorman  from the University of Virginia, who developed the first commercial predictive monitoring system – the HeRO monitor.  The largest ever randomized clinical trial in neonatal patients (premature babies) was conducted using this monitor.  It showed that the monitor was able to identify certain physiological patterns, and translate those patterns into a risk for developing an overwhelming infection (late onset neonatal sepsis).  This risk was detected an average of 18 hours before a clinical diagnosis was made, allowing for earlier treatments and interventions.  This translated into a 22% reduction in mortality.  Dr. Andrew Seely  is a Thoracic Surgeon at the University of Ottawa who has developed a model to predict the success of removing a breathing tube from a patient and not have to replace it because they weren’t ready to have it removed.   We got to participate in that clinical trial.  We also got to participate in a trial conducted by Ryan Arnold, now at Christiana Care in Newark Delaware, on trying to predict clinical outcomes using heart rate variability analyses.

In addition to collaborating with these researchers working on their projects, I became especially fascinated with a research article written by one of the countries leading trauma surgeons, Dr. Mitchell Cohen and his colleagues at San Francisco General Hospital and the University of California San Francisco – Identification of complex metabolic states in critically injured patients using bioinformatic cluster analysis.  I will confess that I felt frustrated when I talked with the researchers about the underlying mathematical concepts and analytical techniques they were using, because I just did not understand them well.  This ignorance ignited what I will freely admit is now an obsession to understand these concepts and techniques.

I started off trying to educate myself using text books, taking on-line MOOC’s – Massive Online Open Courses, and enrolling in courses offered on the web.  I still felt very frustrated because these courses didn’t really go into the depth that I thought I needed.  When I look at the giants in this field of predictive analytics, these few researchers seemed to have both the clinical knowledge and understanding of why this research was so important, and they were also able to understand the mathematical and analytical concepts and techniques necessary to do research in this field.  I wanted to be like them.

I became very interested in becoming a data scientist at that point.  I eventually enrolled in Northwestern University’s Master of Science in Predictive Analytics (MSPA) program.  I have not regretted this decision.  I currently am halfway through the program, and am finally into the especially relevant coursework.  I just finished the major foundational course – Linear Regression and Multivariate Analysis.  The courses up until then had been preparing me to take this course.   I realized I had come full circle when I re-read Mitchell Cohen’s article, and realized that I now finally understood the concepts and results.  That was an extremely satisfying moment for me.

This has been quite the educational journey for me.   I feel like I have a much better understanding of statistics. I am getting somewhat competent in a few programming languages – R, Python, and SAS.  I am using Jupyter Notebooks for my programming work.   I have dabbled with data science platforms like KNIME, and this quarter will be learning to use virtual machines, IBM Watson Analytics, ANGOSS, and Microsoft Azure machine learning – as part of my next class on Generalized Linear Models.

I finally feel as if I am able to start applying what I have been learning for the last 1 1/2 years – to start developing predictive models to improve clinical outcomes.  A few of my goals are to help our organization become more data driven, and to continue to work on developing predictive algorithms that could be incorporated into beside monitoring systems, further improving the outcomes of patients.

This is my journey to date from becoming a practicing Emergency Physician with no interest in data or analytics, to where I am now, halfway finished with my Master’s program.  The real journey of applying what I have learned to real world problems has just started but will get more robust as I learn more.

 

 

 

 

Northwestern University MSPA, Predictive Analytics

Northwestern University MSPA Predict 410, Regression and Multivariate Analysis Course Review

This was the most demanding of Northwestern University’s MSPA (Master of Science in Predictive Analytics) courses I have taken so far, and also the most rewarding.   This course is the backbone of the predictive analytics program and foundational to becoming a predictive modeler.  The course covers Linear Regression (Simple Linear Regression and Multiple Linear Regression) and Multivariate Analysis (Principal Component Analysis, Factor Analysis, and Cluster Analysis).

This course is an applied course, so you have to understand the mathematics, but don’t have to do in-depth calculations using matrix algebra (it could be much, much worse!).  This is in keeping with the philosophy that the MSPA program is an applied program, preparing students to go out and start working in this field.

I took this course from Professor Syamala Srinivasan, Ph.D.   She is at the top of the list of Northwestern Professors, who already are very high quality.  I would highly recommend her if you are considering this class.  She has gone above and beyond, with her textbook lectures, additional lectures on topics of interest, SAS tutorials, SAS demo’s for each week’s assignments, sync sessions, response to questions both on the discussion boards and by e-mail.  Her level of involvement in creating the course work and in teaching the class are phenomenal.  I can’t say enough good things about her.

The course structure is as follows.  Every week there is required reading from a variety of textbooks and articles from the library.  There are PowerPoint lectures with audio for each textbook chapter.  In addition there are usually several other special topic PowerPoint/audio lectures.  A recorded video session then goes over the assignment for the week, and the SAS code used for the assignment.   Participation in the week’s discussion board is mandatory and extremely helpful.  The assignments build upon each other and get more complex.  There are intermittent quizzes.  The final exam is a two-part, one a take home exam, the second an online-proctored exam.

Now for the particulars.   This course isn’t for those who are time challenged already.  I would not recommend taking a second course with this one, unless you have a lot of spare time.  I spent a good 20-30 hours per week on this course, and wished I actually had more time to devote to it.  I read almost every mandatory reading assignment and optional reading assignment, so you could cut corners and devote less time, but I would worry about not learning the content of this foundational material.

Textbooks

Regression Textbooks Required

  1. Montgomery, D.C., Peck, E.A., and Vining, G.G. (2012). Introduction to Linear Regression Analysis. (5th Edition). New York, NY: Wiley [ISBN-13: 978-0470542811]

This textbook is the main one used throughout the course.  It has sections that are difficult to get through, but the foundational material is there.

2.  Everitt, B. (2009). Multivariable Modeling and Multivariate Analysis for the Behavioral Sciences. Boca Raton, FL: CRC Press [ISBN-13: 978-1439807699]

This is a good supplement to the Montgomery textbook, with coding examples – both in the book and on-line – using R.

Regression Textbooks Optional

  1.  Pardoe, I. (2012). Applied Regression Modeling. (2nd Edition). New York, NY: Wiley [ISBN-13: 978-1118097281]

This was my favorite textbook, and is definitely more understandable and is written from an applied standpoint.

2.  Ryan, A. G. Montgomery, D.C., Peck, E.A., and Vining, G.G. (2013). Solutions Manual to Introduction to Linear Regression Analysis. New York, NY: Wiley [ISBN-13: 978-1118471463]  To be honest, I didn’t use this a lot.

3.  Sheather, S. (2009). A Modern Approach to Regression with R. Springer [ISBN-13 978-1441918727]  I didn’t use this at all, but it will be handy when working through real world problems using R later.

 

SAS Textbooks

  1. Cody, R. (2011). SAS Statistics By Example. Cary, N.C.: SAS Publishing. [ISBN-13 978-1607648000]
  2. Delwiche, L., and Slaughter, S. (2012). The Little SAS Book: A Primer. (5th Edition). Cary, NC: SAS Publishing. [ISBN-13: 978-1612903439]

I used both of these books as references a fair amount.

In addition there were quite a few reference articles in the library.  Some of these were very good, some were very detailed.

SAS

This course uses SAS for all analysis and visualization.  You could use R, but the course is built around SAS.  I will say I came into the course with a bias against SAS (from ignorance mainly – but also due to the cost of the license for this, and a move away from these closed systems to open systems like Python and R.  I am a huge Python proponent.)  However, I have come to like SAS for how easy it was to learn, and how easy it is to do data analysis and visualization.

An imperative is start learning SAS before the course starts.  You will get an email and syllabus from Dr. Srinivasan early listing what you need to study.  There are SAS tutorials and readings.  I also did the learning within SAS.  I completed the on-line SAS Programming 1: Essentials e-Course, which was very helpful.  There are also multiple additional free courses that you can take.

You can use SAS through the SSCC – Social Sciences Computing Cluster (no additional charge), through the web based SAS Studio (no additional charge), or you can purchase a license.  I exclusively used SAS Studio and had no problems.

Coursework

The Learning Goals of the course are:

  • Develop statistical modeling as a three step process consisting of: (1) exploratory data analysis, (2) model identification, and (3) model validation.
  • Understand how to use automated variable selection as a tool for model identification and as a tool for exploratory data analysis in the presence of a large number of predictor variables or a set of unlabeled predictors.
  • Develop a working understanding of the conceptual (theoretical) foundations of linear regression, principal components analysis, factor analysis, and cluster analysis with the objective of being capable of applying these techniques appropriately and validating their results.
  • Develop a conceptual and practical understanding of the difference between statistical inference and predictive modeling and how it affects our choices and actions in the statistical modeling process.
  • Learn the basics of the SAS Data Step, data manipulation with SAS, and SAS procedures (PROCS) for fitting statistical models.

Weekly Reading and Video assignments

Each week there are required textbook readings, optional textbook readings, course reserve readings, and lecture videos.  The weekly videos are PowerPoint presentations with audio, and go over the textbook readings for that week.  In addition, there are other lectures on special topics.

The special topics include:

Statistical Preliminaries and Notation

Statistical Assumptions for OLS Regression

Estimation and Inference for OLS Regression

Analysis of Variance and Related topics in OLS Regression

Hat Matrix Lecture

Statistical Inference vs Predictive Modeling in OLS Regression

Special Topic: Dummy Variables Hypothesis Testing

Special Topic Lecture (Degrees of Freedom)

Special Topic Lecture (Likelihood Function)

Special Topic Lecture (Mallow’s Cp)

Hypothesis Testing Multiple Linear Regression

Factor Analysis Example Lecture

 

Sync Sessions

There are a total of 4 Sync sessions.  These are invaluable as Dr. Srinivasan reviews the recent material, but then puts it all into a larger context.

Assignments

There are a total of 8 assignments.  These are a combination of using SAS to do analysis and visualization, as well as having to provide an analysis of the produced outcomes.  The code is pretty much already written.  You will have to make a few modifications, but the focus is on using SAS, and the assignments are designed to test your ability to perform regression and multivariate analysis, not struggle producing code from scratch.  This was a very nice feature.  Each week there is a SAS demo video lecture where the Professor runs through the code and the assignment – extremely helpful.

Here are the titles of the assignments:

Assignment 1: Getting to know your data.

Assignment 2: Regression model building.

Assignment 3: Data analysis and regression.

Assignment 4: Statistical inference in linear regression.

Assignment 5: Automated variable selection, multicollinearity, and predictive modeling.

Assignment 6: Principal components in predictive modeling.

Assignment 7: Factor analysis

Assignment 8: Cluster analysis.

 

Discussion Boards

These were extremely robust.  You have to answer 3 questions posed by Professor Srinivasan, and then actively engage in discussions around what other people posted.  The questions were relevant and helped the learning process.  The discussions were robust and enhanced the learning.

Follow up by Dr. Srinivasan

Each week she would send out several e-mails – on how the assignments went, to clarify issues presented in the discussion boards, and to follow up on quizzes.  These were very helpful.

Quizzes and Tests

There were a total of 5 open book quizzes.  These were very doable,  but somewhat demanding.

There were 2 final examinations – a take home exam (1 hour) and a proctored 2 hour exam.  These were challenging but doable.

Final Thoughts

This has been the highlight of the MSPA’s courses so far, as this course is the foundation for building predictive models.  The other courses I have taken were a lead up to this course.  Dr. Srinivasan has gone above and beyond and delivers a high quality product.  My favorite course and Professor so far.

 

 

Data Science, Northwestern University MSPA, Uncategorized

DataCamp’s Importing Data in Python Part 1 and Part 2.

I recently finished these DataCamp  courses and really liked them.  I highly recommend them to students in general and especially to the students in Northwestern University’s Master of Science in Predictive Analytics (MSPA) program.

Importing Data in Python Part 1 is described as:

As a Data Scientist, on a daily basis you will need to clean data, wrangle and munge it, visualize it, build predictive models and interpret these models. Before doing any of these, however, you will need to know how to get data into Python. In this course, you’ll learn the many ways to import data into Python: (i) from flat files such as .txts and .csvs; (ii) from files native to other software such as Excel spreadsheets, Stata, SAS and MATLAB files; (iii) from relational databases such as SQLite & PostgreSQL.

Importing Data in Python Part 2 is described as:

As a Data Scientist, on a daily basis you will need to clean data, wrangle and munge it, visualize it, build predictive models and interpret these models. Before doing any of these, however, you will need to know how to get data into Python. In the prequel to this course, you have already learnt many ways to import data into Python: (i) from flat files such as .txts and .csvs; (ii) from files native to other software such as Excel spreadsheets, Stata, SAS and MATLAB files; (iii) from relational databases such as SQLite & PostgreSQL. In this course, you’ll extend this knowledge base by learning to import data (i) from the web and (ii) a special and essential case of this: pulling data from Application Programming Interfaces, also known as APIs, such as the Twitter streaming API, which allows us to stream real-time tweets.

 

Data Science, Northwestern University MSPA

Learning to Use Python’s SQLAlchemy in DataCamp’s “Introduction to Databases in Python” – useful for students taking Northwestern’s MSPA Predict 420.

I just completed DataCamp’s course titled “Introduction to Databases in Python“.  This is a very informative course, and is actually one of the few tutorials out there that I have run across on SQLAlchemy.

I just finished Northwestern University’s MSPA (Master of Science in Predictive Analytics) Predict 420 class – Database Systems and Data Preparation Review, and I wish I would have taken DataCamp’s course first.  It would have helped tremendously.  You have the opportunity to use SQLAlchemy to interact with SQL databases in Predict 420, but I looked and could not find a really good tuturial on this, until I ran across DataCamp’s course, after I finished Predict 420.  I highly recommend this DataCamp course to other MSPA students.

Introduction to Databases in Python is divided up into 5 sections, with the course’s description of each section attached.

  1.  Basics of Relational Database.  In this chapter, you will become acquainted with the fundamentals of Relational Databases and the Relational Model. You will learn how to connect to a database and then interact with it by writing basic SQL queries, both in raw SQL as well as with SQLAlchemy, which provides a Pythonic way of interacting with databases.
  2. Applying Filtering, Ordering, and Grouping to Queries.  In this chapter, you will build on the database knowledge you began acquiring in the previous chapter by writing more nuanced queries that allow you to filter, order, and count your data, all within the Pythonic framework provided by SQLAlchemy!
  3. Advanced SQLAlchemy Queries.  Herein, you will learn to perform advanced – and incredibly useful – queries that will enable you to interact with your data in powerful ways.
  4. Creating and Manipulating your own Databases.  In the previous chapters, you interacted with existing databases and queried them in various different ways. Now, you will learn how to build your own databases and keep them updated!
  5. Putting it all together.  Here, you will bring together all of the skills you acquired in the previous chapters to work on a real life project! From connecting to a database, to populating it, to reading and querying it, you will have a chance to apply all the key concepts you learned in this course.

 

 

 

 

Data Science, Northwestern University MSPA

Northwestern University MSPA Program – learning R and python resources page creation

For new students coming into Northwestern University’s Master of Science in Predictive Analytics (MSPA) program, there is often considerable apprehension about learning the programming languages (mainly R and Python, and some SAS).   I have created a page on my blog site – Northwestern University MSPA Program – Learning R and Python resources – that lists some of the resources that are available, and my favorites.

I would encourage students to start taking the programming courses ahead of the particular classes, and whatever language you are required to use in that class.  There is enough time between the official courses to take some of these courses.  That way you don’t have to learn the course content and the programming language at the same time (if you don’t it is still doable, just will take more effort).

Data Science, Northwestern University MSPA

Northwestern University MSPA 420, Database Systems and Data Preparation Review

This was the fourth course I took in the MSPA program. I took this course because I wanted to understand relational and non-relational databases better, and become adept at storing, manipulating, and retrieving data from databases.  I thought this skill would be very beneficial when it came to getting data for the other analytics courses.

My overall assessment is that this was a good course conceptually, with a solid curriculum requiring a lot of reading and self study.  However I felt it could have been improved upon by having more sync video sessions or videos prepared, by improving the discussion sections, and by providing the code solutions to the projects.  I will review how the course is organized, and then review my comments above.

I took the course from Dr. Lynd Bacon, a very knowledgeable instructor, and very helpful when engaged.

Course Goals

From the syllabus – The data “includes poorly structured and user-generated content data.”   “This course is primarily about manipulating data using Python tools.  SQL and noSQL technologies are used to some extent, and accessed using Python.”

The stated course goals are listed below:

  • “Articulate analytics as a core strategy using examples of successful predictive modeling/data mining applications in various industries.
  • Formulate and manage plans to address business issues with analytics.
  • Define key terms, concepts and issues in data management and database management systems with respect to predictive modeling
  • Evaluate the constraints, limitations and structure of data through data cleansing, preparation and exploratory analysis to create an analytical database.
  • Use object-oriented scripting software for data preparation.
  • Transform data into actionable insights through data exploration.”

In retrospect, the first three goals were not addressed explicitly in this course.  Part of the third goal was met in that key concepts and terms around database management systems and data management were dealt with in-depth.  There was a lot of conceptual work around extracting data from both relational (PostgreSQL) and non-relational (MongoDB) databases (MongoDB will not be used again as they are switching to ElasticSearch next semester).  The fifth and sixth goals were met through the project work.

Python was used as the programming language and there was a lot of reading devoted to developing Python skills.  Some people had never used python before, and were able to get through it.  I had used Python in previous courses, and felt I still learned a lot.  There was extensive use of pandas DataFrames.  We used the packages json, pymongo to interact with the MongoDB database,  and learned how to save DataFrames and objects by pickling them or putting them on a shelve.  I used the Jupyter Notebooks to do my Python coding.  We also learned some very basic Linux in order to interact with the servers in the Social Sciences Computing Cluster (SSCC) to extract the data from the relational and non-relational databases.

Like the other MSPA courses it was structured around the required textbook readings, assigned articles, weekly discussions, and 4 projects.

Readings

The actual textbooks were mainly for Python.  There was a very valuable text on data cleaning.  All of the reading regarding the relational and non-relational databases came from the assigned articles, some of which were chapters from textbooks.

Textbooks

Lubanovic, B. (2015). Introducing Python: Modern Computing in Simple Packages.  Sebastopol, Calif.: O’Reilly. [ISBN-13: 978-1-449-35936-2]

McKinney, W. (2013) Python for Data Analysis: Agile Tools for Real-World Data. Sebastopol, Calif O’Reilly. [ISBN-13: 978-1-449-31979-3]

Osborne, J. W. (2013). Best Practices in Data Cleaning: A Complete Guide to Everything You Need to Do Before and After Collecting Your Data. Thousand Oaks, Calif.: Sage. [ISBN-13: 978-1-4129-8801-8]

There were additional recommended reference books that I purchased, but did not really reference.

The first two texts were good reads with a lot of practical code to practice on to improve your Python skills.  The textbook on Best Practices in Data Cleaning is worth the read.  It makes you understand the significant importance of cleaning your data correctly, and then testing the underlying assumptions that most statistical analyses are based upon.  The author provides convincing evidence to debunk these myths – robustness, perfect measurement, categorization, distributional irrelevance, equality, and the motivated participant.

Weekly Discussions

To be honest, I was disappointed with this aspect of the course.  Some students were very active, and others participated very minimally and posted their one discussion the evening of the due date.  I did learn some things from the dedicated students, but I feel that if this were stressed more by the professor, then there could have been more robust postings.  This was the weakest discussion section of all the course I have taken so far.

Sync Sessions

Disappointingly there were only 2 sync sessions.  I feel this could be markedly improved.  I would like to see more involvement by the professor in creating either live sync sessions or create learning videos.  Ideally one would be created for each type of database system being studied, so you could see in person how to access, manipulate, and extract the data, and then apply the data cleaning techniques and then perform exploratory data analysis.  This was a huge disappoint for me.

Projects

There were a total of 4 projects.

The first project was around airline flight data, and being able to pull data into DataFrames, and then manipulating and analyzing the data.  The second project required extraction of data from a relational database, and then creating a local sqlite database, manipulating and analyzing the data, then saving the DataFrames by pickling them or Shelving them.  The third project required extracting hotel review information from json files.  The fourth and most challenging project involved extracting 501, 513 Enron emails, and then doing analyses on these emails.

I was disappointed with the more complex projects, and felt at times as if the course work did not adequately prepare me to succeed easily on these projects.   I was able to muck my way through these.  An extremely disappointing aspect of these projects is that good examples of the codes used by students were not referenced or shared by the professor.  I feel that I would have been able to close the loop on my knowledge deficiencies if I had been able to see other very successful code examples, and then been able to learn from them.

Summary

Overall this was an okay course.  It could be improved upon given my suggestions above.  I still learned a lot and will be able to use this knowledge in the future.  It did give me a good foundation upon which to add more knowledge in the future.

Northwestern University MSPA, Predictive Analytics

Northwestern University’s MSPA (Master of Science in Predictive Analytics) Program review by a recent student graduate.

I ran across this blog posting today by a student who finished the MSPA program.  My Thoughts on Northwestern University’s MSPA is written by Bhaskar Karambelkar, a student who graduated this summer.  He provides a comprehensive overview of the program, and rates each course on Course Content, Professor Engagement, Overall Value to the Program, and Overall Value to Me.  This is well written, and worth a read by anyone considering this program.

This prompted me to look for other bloggers who are in the course or who have finished the course.  I ran across a few who posted once, but did not post any follow up.  If anyone knows of any other active bloggers, please let me know.

The official Northwestern University MSPA site is:  http://sps.northwestern.edu/program-areas/graduate/predictive-analytics/

There are two Linked In groups that may interest you as well.

The “Northwestern University MS Predictive Analytics” group is “for current students and alumni of the Northwestern MSPA Program”.  There are useful articles posted, and questions posed to the group about which professors to take for the courses, sharing of syllabus, etc. It is very useful to browse when considering which class/professor to take.  There are 2,097 members currently.

The “Networking Group for Northwestern University’s MS in Predictive Analytics Program” is “an open group to allow student’s in Northwestern University’s MS in Predictive Analytics Program to network with each other. The group is open to others, including recruiters, who may be interested in networking with us.  The advantage of having a networking group are three fold. First, it will enable us to have a common communication point without have to be linked directly to each other. Second, it will enable us to have a lasting connection to current, future, and past students. And third, it will enable us to be easily found by recruiters.  Please note that this is not an “alumni” group and that this group has no official affiliation with Northwestern University.”  There are 3,455 members currently, and the content is pretty similar to the other Linked In group.