Data Science, Northwestern University MSPA

Python Tops KDNuggets 2017 Data Science Software Poll

The results of KDNuggets’ 18th annual Software Poll should be fascinating reading for anyone involved in data science and analytics.  Some highlights – Python (52.6%) finally overtook R (52.1%), SQL remained at about 35%, and Spark and Tensorflow have increased to above 20%.

KDNugetts_poll

(Graph taken from http://www.kdnuggets.com/2017/05/poll-analytics-data-science-machine-learning-software-leaders.html/2)

I am about halfway through Northwestern University’s Master of Science in Predictive Analytics (MSPA) program.  I am very thankful that the program has made learning different languages a priority.  I have already learned Python, Jupyter Notebooks, R, SQL, some NoSQL (MongoDB), and SAS.  In my current class in Generalized Linear Models, I have also started to learn Angoss, SAS Enterprise Miner, and Microsoft Azure machine learning.  However, it looks like you can’t ever stop learning new things – and I am going to have to learn Spark and Tensorflow – to name a few more.

I highly recommend you read this article.

 

Becoming a Healthcare Data Scientist, Data Scientist, Healthcare Predictive Analytics, Northwestern University MSPA

Physician Data Scientist Part II. The Why.

I was recently reminded by a reader of my blog (thanks Al) that I had not followed up on a comment that I was going to post a second part to a blog that was posted on 7.7.2015 – “Physician Data Scientists – Why and What Type? Part I“.  Now that I am in between classes, I have the time to work on this.   Looking back at this original post, I am somewhat amazed at all that has happened in the last 1 1/2 years.

I am currently the interim Chief Information Officer (CIO) and Chief Medical Information Officer (CMIO) for our integrated healthcare system.   I stepped into the interim CIO role (helped in part by my Northwestern University MSPA Master of Science in Predictive Analytics coursework) after the departure of our previous CIO last year.  Prior to that I had been one of our systems CMIO’s – facilitating and communicating the needs for technology to help improve clinical outcomes to IT, while communicating back to Physicians and Leadership the limitations of current technologies.  I never really aspired to become either the interim CIO or a CMIO, these opportunities simply arose because of my journey to become better educated about the use of data and analytics to improve clinical outcomes – ie to become a Physician Data Scientist.  I will explain how I ended up in my current role.

My interest in data and analytics is a fairly recent phenomenon, occurring because of a chance meeting with someone who has since become one of my closest friends – Curt Lindberg – who has a PhD in Complexity Science, and is the Director of our Complexity in Healthcare Center.  I met him during a project to improve our process for getting patients into our healthcare system from outside facilities more efficiently.  At that time I was a practicing Emergency Physician and the Medical Director of our MedFlight Air Ambulance service.  Curt introduced me to complexity science and my life has not been the same – it was a transformational career moment for me.  I ended up being part of a small group of researchers who were trying to develop smarter patient monitoring systems.  Their work has inspired me to try and contribute in my own way to this field – called predictive monitoring.

Predictive monitoring is an unofficial term for what this group is trying to accomplish.  While the technology inside the monitors has changed drastically since the 1970’s, what the monitors do has not.  These monitors display certain physiologic markers of interest – blood pressure, pulse rate, temperature, oxygen level, ekg pattern, etc.  You can see what is happening to the patient right at that time, or you can go back and review what happened to them in the past (minimally), but there is no information about predicting what will happen to them in the future (are they predicted to get better, go into sudden cardiac arrest, stop breathing, or develop an overwhelming infection called sepsis, etc).  The goal is to incorporate predictive algorithms into these monitoring systems.

I have been fortunate to meet some giants in this field.  Dr. J. Randall Moorman  from the University of Virginia, who developed the first commercial predictive monitoring system – the HeRO monitor.  The largest ever randomized clinical trial in neonatal patients (premature babies) was conducted using this monitor.  It showed that the monitor was able to identify certain physiological patterns, and translate those patterns into a risk for developing an overwhelming infection (late onset neonatal sepsis).  This risk was detected an average of 18 hours before a clinical diagnosis was made, allowing for earlier treatments and interventions.  This translated into a 22% reduction in mortality.  Dr. Andrew Seely  is a Thoracic Surgeon at the University of Ottawa who has developed a model to predict the success of removing a breathing tube from a patient and not have to replace it because they weren’t ready to have it removed.   We got to participate in that clinical trial.  We also got to participate in a trial conducted by Ryan Arnold, now at Christiana Care in Newark Delaware, on trying to predict clinical outcomes using heart rate variability analyses.

In addition to collaborating with these researchers working on their projects, I became especially fascinated with a research article written by one of the countries leading trauma surgeons, Dr. Mitchell Cohen and his colleagues at San Francisco General Hospital and the University of California San Francisco – Identification of complex metabolic states in critically injured patients using bioinformatic cluster analysis.  I will confess that I felt frustrated when I talked with the researchers about the underlying mathematical concepts and analytical techniques they were using, because I just did not understand them well.  This ignorance ignited what I will freely admit is now an obsession to understand these concepts and techniques.

I started off trying to educate myself using text books, taking on-line MOOC’s – Massive Online Open Courses, and enrolling in courses offered on the web.  I still felt very frustrated because these courses didn’t really go into the depth that I thought I needed.  When I look at the giants in this field of predictive analytics, these few researchers seemed to have both the clinical knowledge and understanding of why this research was so important, and they were also able to understand the mathematical and analytical concepts and techniques necessary to do research in this field.  I wanted to be like them.

I became very interested in becoming a data scientist at that point.  I eventually enrolled in Northwestern University’s Master of Science in Predictive Analytics (MSPA) program.  I have not regretted this decision.  I currently am halfway through the program, and am finally into the especially relevant coursework.  I just finished the major foundational course – Linear Regression and Multivariate Analysis.  The courses up until then had been preparing me to take this course.   I realized I had come full circle when I re-read Mitchell Cohen’s article, and realized that I now finally understood the concepts and results.  That was an extremely satisfying moment for me.

This has been quite the educational journey for me.   I feel like I have a much better understanding of statistics. I am getting somewhat competent in a few programming languages – R, Python, and SAS.  I am using Jupyter Notebooks for my programming work.   I have dabbled with data science platforms like KNIME, and this quarter will be learning to use virtual machines, IBM Watson Analytics, ANGOSS, and Microsoft Azure machine learning – as part of my next class on Generalized Linear Models.

I finally feel as if I am able to start applying what I have been learning for the last 1 1/2 years – to start developing predictive models to improve clinical outcomes.  A few of my goals are to help our organization become more data driven, and to continue to work on developing predictive algorithms that could be incorporated into beside monitoring systems, further improving the outcomes of patients.

This is my journey to date from becoming a practicing Emergency Physician with no interest in data or analytics, to where I am now, halfway finished with my Master’s program.  The real journey of applying what I have learned to real world problems has just started but will get more robust as I learn more.

 

 

 

 

Northwestern University MSPA, Predictive Analytics

Northwestern University MSPA Predict 410, Regression and Multivariate Analysis Course Review

This was the most demanding of Northwestern University’s MSPA (Master of Science in Predictive Analytics) courses I have taken so far, and also the most rewarding.   This course is the backbone of the predictive analytics program and foundational to becoming a predictive modeler.  The course covers Linear Regression (Simple Linear Regression and Multiple Linear Regression) and Multivariate Analysis (Principal Component Analysis, Factor Analysis, and Cluster Analysis).

This course is an applied course, so you have to understand the mathematics, but don’t have to do in-depth calculations using matrix algebra (it could be much, much worse!).  This is in keeping with the philosophy that the MSPA program is an applied program, preparing students to go out and start working in this field.

I took this course from Professor Syamala Srinivasan, Ph.D.   She is at the top of the list of Northwestern Professors, who already are very high quality.  I would highly recommend her if you are considering this class.  She has gone above and beyond, with her textbook lectures, additional lectures on topics of interest, SAS tutorials, SAS demo’s for each week’s assignments, sync sessions, response to questions both on the discussion boards and by e-mail.  Her level of involvement in creating the course work and in teaching the class are phenomenal.  I can’t say enough good things about her.

The course structure is as follows.  Every week there is required reading from a variety of textbooks and articles from the library.  There are PowerPoint lectures with audio for each textbook chapter.  In addition there are usually several other special topic PowerPoint/audio lectures.  A recorded video session then goes over the assignment for the week, and the SAS code used for the assignment.   Participation in the week’s discussion board is mandatory and extremely helpful.  The assignments build upon each other and get more complex.  There are intermittent quizzes.  The final exam is a two-part, one a take home exam, the second an online-proctored exam.

Now for the particulars.   This course isn’t for those who are time challenged already.  I would not recommend taking a second course with this one, unless you have a lot of spare time.  I spent a good 20-30 hours per week on this course, and wished I actually had more time to devote to it.  I read almost every mandatory reading assignment and optional reading assignment, so you could cut corners and devote less time, but I would worry about not learning the content of this foundational material.

Textbooks

Regression Textbooks Required

  1. Montgomery, D.C., Peck, E.A., and Vining, G.G. (2012). Introduction to Linear Regression Analysis. (5th Edition). New York, NY: Wiley [ISBN-13: 978-0470542811]

This textbook is the main one used throughout the course.  It has sections that are difficult to get through, but the foundational material is there.

2.  Everitt, B. (2009). Multivariable Modeling and Multivariate Analysis for the Behavioral Sciences. Boca Raton, FL: CRC Press [ISBN-13: 978-1439807699]

This is a good supplement to the Montgomery textbook, with coding examples – both in the book and on-line – using R.

Regression Textbooks Optional

  1.  Pardoe, I. (2012). Applied Regression Modeling. (2nd Edition). New York, NY: Wiley [ISBN-13: 978-1118097281]

This was my favorite textbook, and is definitely more understandable and is written from an applied standpoint.

2.  Ryan, A. G. Montgomery, D.C., Peck, E.A., and Vining, G.G. (2013). Solutions Manual to Introduction to Linear Regression Analysis. New York, NY: Wiley [ISBN-13: 978-1118471463]  To be honest, I didn’t use this a lot.

3.  Sheather, S. (2009). A Modern Approach to Regression with R. Springer [ISBN-13 978-1441918727]  I didn’t use this at all, but it will be handy when working through real world problems using R later.

 

SAS Textbooks

  1. Cody, R. (2011). SAS Statistics By Example. Cary, N.C.: SAS Publishing. [ISBN-13 978-1607648000]
  2. Delwiche, L., and Slaughter, S. (2012). The Little SAS Book: A Primer. (5th Edition). Cary, NC: SAS Publishing. [ISBN-13: 978-1612903439]

I used both of these books as references a fair amount.

In addition there were quite a few reference articles in the library.  Some of these were very good, some were very detailed.

SAS

This course uses SAS for all analysis and visualization.  You could use R, but the course is built around SAS.  I will say I came into the course with a bias against SAS (from ignorance mainly – but also due to the cost of the license for this, and a move away from these closed systems to open systems like Python and R.  I am a huge Python proponent.)  However, I have come to like SAS for how easy it was to learn, and how easy it is to do data analysis and visualization.

An imperative is start learning SAS before the course starts.  You will get an email and syllabus from Dr. Srinivasan early listing what you need to study.  There are SAS tutorials and readings.  I also did the learning within SAS.  I completed the on-line SAS Programming 1: Essentials e-Course, which was very helpful.  There are also multiple additional free courses that you can take.

You can use SAS through the SSCC – Social Sciences Computing Cluster (no additional charge), through the web based SAS Studio (no additional charge), or you can purchase a license.  I exclusively used SAS Studio and had no problems.

Coursework

The Learning Goals of the course are:

  • Develop statistical modeling as a three step process consisting of: (1) exploratory data analysis, (2) model identification, and (3) model validation.
  • Understand how to use automated variable selection as a tool for model identification and as a tool for exploratory data analysis in the presence of a large number of predictor variables or a set of unlabeled predictors.
  • Develop a working understanding of the conceptual (theoretical) foundations of linear regression, principal components analysis, factor analysis, and cluster analysis with the objective of being capable of applying these techniques appropriately and validating their results.
  • Develop a conceptual and practical understanding of the difference between statistical inference and predictive modeling and how it affects our choices and actions in the statistical modeling process.
  • Learn the basics of the SAS Data Step, data manipulation with SAS, and SAS procedures (PROCS) for fitting statistical models.

Weekly Reading and Video assignments

Each week there are required textbook readings, optional textbook readings, course reserve readings, and lecture videos.  The weekly videos are PowerPoint presentations with audio, and go over the textbook readings for that week.  In addition, there are other lectures on special topics.

The special topics include:

Statistical Preliminaries and Notation

Statistical Assumptions for OLS Regression

Estimation and Inference for OLS Regression

Analysis of Variance and Related topics in OLS Regression

Hat Matrix Lecture

Statistical Inference vs Predictive Modeling in OLS Regression

Special Topic: Dummy Variables Hypothesis Testing

Special Topic Lecture (Degrees of Freedom)

Special Topic Lecture (Likelihood Function)

Special Topic Lecture (Mallow’s Cp)

Hypothesis Testing Multiple Linear Regression

Factor Analysis Example Lecture

 

Sync Sessions

There are a total of 4 Sync sessions.  These are invaluable as Dr. Srinivasan reviews the recent material, but then puts it all into a larger context.

Assignments

There are a total of 8 assignments.  These are a combination of using SAS to do analysis and visualization, as well as having to provide an analysis of the produced outcomes.  The code is pretty much already written.  You will have to make a few modifications, but the focus is on using SAS, and the assignments are designed to test your ability to perform regression and multivariate analysis, not struggle producing code from scratch.  This was a very nice feature.  Each week there is a SAS demo video lecture where the Professor runs through the code and the assignment – extremely helpful.

Here are the titles of the assignments:

Assignment 1: Getting to know your data.

Assignment 2: Regression model building.

Assignment 3: Data analysis and regression.

Assignment 4: Statistical inference in linear regression.

Assignment 5: Automated variable selection, multicollinearity, and predictive modeling.

Assignment 6: Principal components in predictive modeling.

Assignment 7: Factor analysis

Assignment 8: Cluster analysis.

 

Discussion Boards

These were extremely robust.  You have to answer 3 questions posed by Professor Srinivasan, and then actively engage in discussions around what other people posted.  The questions were relevant and helped the learning process.  The discussions were robust and enhanced the learning.

Follow up by Dr. Srinivasan

Each week she would send out several e-mails – on how the assignments went, to clarify issues presented in the discussion boards, and to follow up on quizzes.  These were very helpful.

Quizzes and Tests

There were a total of 5 open book quizzes.  These were very doable,  but somewhat demanding.

There were 2 final examinations – a take home exam (1 hour) and a proctored 2 hour exam.  These were challenging but doable.

Final Thoughts

This has been the highlight of the MSPA’s courses so far, as this course is the foundation for building predictive models.  The other courses I have taken were a lead up to this course.  Dr. Srinivasan has gone above and beyond and delivers a high quality product.  My favorite course and Professor so far.

 

 

Data Science, Northwestern University MSPA, Uncategorized

DataCamp’s Importing Data in Python Part 1 and Part 2.

I recently finished these DataCamp  courses and really liked them.  I highly recommend them to students in general and especially to the students in Northwestern University’s Master of Science in Predictive Analytics (MSPA) program.

Importing Data in Python Part 1 is described as:

As a Data Scientist, on a daily basis you will need to clean data, wrangle and munge it, visualize it, build predictive models and interpret these models. Before doing any of these, however, you will need to know how to get data into Python. In this course, you’ll learn the many ways to import data into Python: (i) from flat files such as .txts and .csvs; (ii) from files native to other software such as Excel spreadsheets, Stata, SAS and MATLAB files; (iii) from relational databases such as SQLite & PostgreSQL.

Importing Data in Python Part 2 is described as:

As a Data Scientist, on a daily basis you will need to clean data, wrangle and munge it, visualize it, build predictive models and interpret these models. Before doing any of these, however, you will need to know how to get data into Python. In the prequel to this course, you have already learnt many ways to import data into Python: (i) from flat files such as .txts and .csvs; (ii) from files native to other software such as Excel spreadsheets, Stata, SAS and MATLAB files; (iii) from relational databases such as SQLite & PostgreSQL. In this course, you’ll extend this knowledge base by learning to import data (i) from the web and (ii) a special and essential case of this: pulling data from Application Programming Interfaces, also known as APIs, such as the Twitter streaming API, which allows us to stream real-time tweets.

 

Data Science, Northwestern University MSPA

Learning to Use Python’s SQLAlchemy in DataCamp’s “Introduction to Databases in Python” – useful for students taking Northwestern’s MSPA Predict 420.

I just completed DataCamp’s course titled “Introduction to Databases in Python“.  This is a very informative course, and is actually one of the few tutorials out there that I have run across on SQLAlchemy.

I just finished Northwestern University’s MSPA (Master of Science in Predictive Analytics) Predict 420 class – Database Systems and Data Preparation Review, and I wish I would have taken DataCamp’s course first.  It would have helped tremendously.  You have the opportunity to use SQLAlchemy to interact with SQL databases in Predict 420, but I looked and could not find a really good tuturial on this, until I ran across DataCamp’s course, after I finished Predict 420.  I highly recommend this DataCamp course to other MSPA students.

Introduction to Databases in Python is divided up into 5 sections, with the course’s description of each section attached.

  1.  Basics of Relational Database.  In this chapter, you will become acquainted with the fundamentals of Relational Databases and the Relational Model. You will learn how to connect to a database and then interact with it by writing basic SQL queries, both in raw SQL as well as with SQLAlchemy, which provides a Pythonic way of interacting with databases.
  2. Applying Filtering, Ordering, and Grouping to Queries.  In this chapter, you will build on the database knowledge you began acquiring in the previous chapter by writing more nuanced queries that allow you to filter, order, and count your data, all within the Pythonic framework provided by SQLAlchemy!
  3. Advanced SQLAlchemy Queries.  Herein, you will learn to perform advanced – and incredibly useful – queries that will enable you to interact with your data in powerful ways.
  4. Creating and Manipulating your own Databases.  In the previous chapters, you interacted with existing databases and queried them in various different ways. Now, you will learn how to build your own databases and keep them updated!
  5. Putting it all together.  Here, you will bring together all of the skills you acquired in the previous chapters to work on a real life project! From connecting to a database, to populating it, to reading and querying it, you will have a chance to apply all the key concepts you learned in this course.

 

 

 

 

Data Science, Northwestern University MSPA

Northwestern University MSPA Program – learning R and python resources page creation

For new students coming into Northwestern University’s Master of Science in Predictive Analytics (MSPA) program, there is often considerable apprehension about learning the programming languages (mainly R and Python, and some SAS).   I have created a page on my blog site – Northwestern University MSPA Program – Learning R and Python resources – that lists some of the resources that are available, and my favorites.

I would encourage students to start taking the programming courses ahead of the particular classes, and whatever language you are required to use in that class.  There is enough time between the official courses to take some of these courses.  That way you don’t have to learn the course content and the programming language at the same time (if you don’t it is still doable, just will take more effort).

Data Science, Northwestern University MSPA

Northwestern University MSPA 420, Database Systems and Data Preparation Review

This was the fourth course I took in the MSPA program. I took this course because I wanted to understand relational and non-relational databases better, and become adept at storing, manipulating, and retrieving data from databases.  I thought this skill would be very beneficial when it came to getting data for the other analytics courses.

My overall assessment is that this was a good course conceptually, with a solid curriculum requiring a lot of reading and self study.  However I felt it could have been improved upon by having more sync video sessions or videos prepared, by improving the discussion sections, and by providing the code solutions to the projects.  I will review how the course is organized, and then review my comments above.

I took the course from Dr. Lynd Bacon, a very knowledgeable instructor, and very helpful when engaged.

Course Goals

From the syllabus – The data “includes poorly structured and user-generated content data.”   “This course is primarily about manipulating data using Python tools.  SQL and noSQL technologies are used to some extent, and accessed using Python.”

The stated course goals are listed below:

  • “Articulate analytics as a core strategy using examples of successful predictive modeling/data mining applications in various industries.
  • Formulate and manage plans to address business issues with analytics.
  • Define key terms, concepts and issues in data management and database management systems with respect to predictive modeling
  • Evaluate the constraints, limitations and structure of data through data cleansing, preparation and exploratory analysis to create an analytical database.
  • Use object-oriented scripting software for data preparation.
  • Transform data into actionable insights through data exploration.”

In retrospect, the first three goals were not addressed explicitly in this course.  Part of the third goal was met in that key concepts and terms around database management systems and data management were dealt with in-depth.  There was a lot of conceptual work around extracting data from both relational (PostgreSQL) and non-relational (MongoDB) databases (MongoDB will not be used again as they are switching to ElasticSearch next semester).  The fifth and sixth goals were met through the project work.

Python was used as the programming language and there was a lot of reading devoted to developing Python skills.  Some people had never used python before, and were able to get through it.  I had used Python in previous courses, and felt I still learned a lot.  There was extensive use of pandas DataFrames.  We used the packages json, pymongo to interact with the MongoDB database,  and learned how to save DataFrames and objects by pickling them or putting them on a shelve.  I used the Jupyter Notebooks to do my Python coding.  We also learned some very basic Linux in order to interact with the servers in the Social Sciences Computing Cluster (SSCC) to extract the data from the relational and non-relational databases.

Like the other MSPA courses it was structured around the required textbook readings, assigned articles, weekly discussions, and 4 projects.

Readings

The actual textbooks were mainly for Python.  There was a very valuable text on data cleaning.  All of the reading regarding the relational and non-relational databases came from the assigned articles, some of which were chapters from textbooks.

Textbooks

Lubanovic, B. (2015). Introducing Python: Modern Computing in Simple Packages.  Sebastopol, Calif.: O’Reilly. [ISBN-13: 978-1-449-35936-2]

McKinney, W. (2013) Python for Data Analysis: Agile Tools for Real-World Data. Sebastopol, Calif O’Reilly. [ISBN-13: 978-1-449-31979-3]

Osborne, J. W. (2013). Best Practices in Data Cleaning: A Complete Guide to Everything You Need to Do Before and After Collecting Your Data. Thousand Oaks, Calif.: Sage. [ISBN-13: 978-1-4129-8801-8]

There were additional recommended reference books that I purchased, but did not really reference.

The first two texts were good reads with a lot of practical code to practice on to improve your Python skills.  The textbook on Best Practices in Data Cleaning is worth the read.  It makes you understand the significant importance of cleaning your data correctly, and then testing the underlying assumptions that most statistical analyses are based upon.  The author provides convincing evidence to debunk these myths – robustness, perfect measurement, categorization, distributional irrelevance, equality, and the motivated participant.

Weekly Discussions

To be honest, I was disappointed with this aspect of the course.  Some students were very active, and others participated very minimally and posted their one discussion the evening of the due date.  I did learn some things from the dedicated students, but I feel that if this were stressed more by the professor, then there could have been more robust postings.  This was the weakest discussion section of all the course I have taken so far.

Sync Sessions

Disappointingly there were only 2 sync sessions.  I feel this could be markedly improved.  I would like to see more involvement by the professor in creating either live sync sessions or create learning videos.  Ideally one would be created for each type of database system being studied, so you could see in person how to access, manipulate, and extract the data, and then apply the data cleaning techniques and then perform exploratory data analysis.  This was a huge disappoint for me.

Projects

There were a total of 4 projects.

The first project was around airline flight data, and being able to pull data into DataFrames, and then manipulating and analyzing the data.  The second project required extraction of data from a relational database, and then creating a local sqlite database, manipulating and analyzing the data, then saving the DataFrames by pickling them or Shelving them.  The third project required extracting hotel review information from json files.  The fourth and most challenging project involved extracting 501, 513 Enron emails, and then doing analyses on these emails.

I was disappointed with the more complex projects, and felt at times as if the course work did not adequately prepare me to succeed easily on these projects.   I was able to muck my way through these.  An extremely disappointing aspect of these projects is that good examples of the codes used by students were not referenced or shared by the professor.  I feel that I would have been able to close the loop on my knowledge deficiencies if I had been able to see other very successful code examples, and then been able to learn from them.

Summary

Overall this was an okay course.  It could be improved upon given my suggestions above.  I still learned a lot and will be able to use this knowledge in the future.  It did give me a good foundation upon which to add more knowledge in the future.