Data Science, Northwestern University MSPA

Northwestern University MSPA 420, Database Systems and Data Preparation Review

This was the fourth course I took in the MSPA program. I took this course because I wanted to understand relational and non-relational databases better, and become adept at storing, manipulating, and retrieving data from databases.  I thought this skill would be very beneficial when it came to getting data for the other analytics courses.

My overall assessment is that this was a good course conceptually, with a solid curriculum requiring a lot of reading and self study.  However I felt it could have been improved upon by having more sync video sessions or videos prepared, by improving the discussion sections, and by providing the code solutions to the projects.  I will review how the course is organized, and then review my comments above.

I took the course from Dr. Lynd Bacon, a very knowledgeable instructor, and very helpful when engaged.

Course Goals

From the syllabus – The data “includes poorly structured and user-generated content data.”   “This course is primarily about manipulating data using Python tools.  SQL and noSQL technologies are used to some extent, and accessed using Python.”

The stated course goals are listed below:

  • “Articulate analytics as a core strategy using examples of successful predictive modeling/data mining applications in various industries.
  • Formulate and manage plans to address business issues with analytics.
  • Define key terms, concepts and issues in data management and database management systems with respect to predictive modeling
  • Evaluate the constraints, limitations and structure of data through data cleansing, preparation and exploratory analysis to create an analytical database.
  • Use object-oriented scripting software for data preparation.
  • Transform data into actionable insights through data exploration.”

In retrospect, the first three goals were not addressed explicitly in this course.  Part of the third goal was met in that key concepts and terms around database management systems and data management were dealt with in-depth.  There was a lot of conceptual work around extracting data from both relational (PostgreSQL) and non-relational (MongoDB) databases (MongoDB will not be used again as they are switching to ElasticSearch next semester).  The fifth and sixth goals were met through the project work.

Python was used as the programming language and there was a lot of reading devoted to developing Python skills.  Some people had never used python before, and were able to get through it.  I had used Python in previous courses, and felt I still learned a lot.  There was extensive use of pandas DataFrames.  We used the packages json, pymongo to interact with the MongoDB database,  and learned how to save DataFrames and objects by pickling them or putting them on a shelve.  I used the Jupyter Notebooks to do my Python coding.  We also learned some very basic Linux in order to interact with the servers in the Social Sciences Computing Cluster (SSCC) to extract the data from the relational and non-relational databases.

Like the other MSPA courses it was structured around the required textbook readings, assigned articles, weekly discussions, and 4 projects.

Readings

The actual textbooks were mainly for Python.  There was a very valuable text on data cleaning.  All of the reading regarding the relational and non-relational databases came from the assigned articles, some of which were chapters from textbooks.

Textbooks

Lubanovic, B. (2015). Introducing Python: Modern Computing in Simple Packages.  Sebastopol, Calif.: O’Reilly. [ISBN-13: 978-1-449-35936-2]

McKinney, W. (2013) Python for Data Analysis: Agile Tools for Real-World Data. Sebastopol, Calif O’Reilly. [ISBN-13: 978-1-449-31979-3]

Osborne, J. W. (2013). Best Practices in Data Cleaning: A Complete Guide to Everything You Need to Do Before and After Collecting Your Data. Thousand Oaks, Calif.: Sage. [ISBN-13: 978-1-4129-8801-8]

There were additional recommended reference books that I purchased, but did not really reference.

The first two texts were good reads with a lot of practical code to practice on to improve your Python skills.  The textbook on Best Practices in Data Cleaning is worth the read.  It makes you understand the significant importance of cleaning your data correctly, and then testing the underlying assumptions that most statistical analyses are based upon.  The author provides convincing evidence to debunk these myths – robustness, perfect measurement, categorization, distributional irrelevance, equality, and the motivated participant.

Weekly Discussions

To be honest, I was disappointed with this aspect of the course.  Some students were very active, and others participated very minimally and posted their one discussion the evening of the due date.  I did learn some things from the dedicated students, but I feel that if this were stressed more by the professor, then there could have been more robust postings.  This was the weakest discussion section of all the course I have taken so far.

Sync Sessions

Disappointingly there were only 2 sync sessions.  I feel this could be markedly improved.  I would like to see more involvement by the professor in creating either live sync sessions or create learning videos.  Ideally one would be created for each type of database system being studied, so you could see in person how to access, manipulate, and extract the data, and then apply the data cleaning techniques and then perform exploratory data analysis.  This was a huge disappoint for me.

Projects

There were a total of 4 projects.

The first project was around airline flight data, and being able to pull data into DataFrames, and then manipulating and analyzing the data.  The second project required extraction of data from a relational database, and then creating a local sqlite database, manipulating and analyzing the data, then saving the DataFrames by pickling them or Shelving them.  The third project required extracting hotel review information from json files.  The fourth and most challenging project involved extracting 501, 513 Enron emails, and then doing analyses on these emails.

I was disappointed with the more complex projects, and felt at times as if the course work did not adequately prepare me to succeed easily on these projects.   I was able to muck my way through these.  An extremely disappointing aspect of these projects is that good examples of the codes used by students were not referenced or shared by the professor.  I feel that I would have been able to close the loop on my knowledge deficiencies if I had been able to see other very successful code examples, and then been able to learn from them.

Summary

Overall this was an okay course.  It could be improved upon given my suggestions above.  I still learned a lot and will be able to use this knowledge in the future.  It did give me a good foundation upon which to add more knowledge in the future.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s