Data Science, Deep Learning, Machine Learning, Neural Networks

Neural Networks, Deep Learning, Machine Learning resources

I have come across a few great resources that I wanted to share.  For students taking a machine learning class (like Northwestern University’s MSDS 422 Practical Machine Learning) these are great references, and a way to learn about them before, during, or after the class.  This is not a comprehensive list, just a starter.

Textbook

There is a free online textbook, Neural Networks and Deep Learning.

Videos

There is a great math visualization site called 3Blue1Brown and they have a YouTube channel.  There are 4 videos on neural networks/deep learning which are really informative and a good introduction.

  1.  But what *is* a Neural Network? Chapter 1, deep learning
  2.  Gradient Descent, how neural networks learn. Chapter 2, deep learning
  3.  What is backpropagation really doing? Chapter 3, deep learning
  4.  Backpropagation calculus. Appendix to deep learning chapter 3.

There is a great playlist on Essence of linear algebra, which is a great review and explanation of linear algebra and matrix operations.  I wish I would have seen this when I was learning it.

Scikit-Learn Tutorials

There are tutorials on the Scikit-Learn site.

TensorFlow tutorials

They provide a link to this Google “Machine Learning Crash Course” – Google’s fast-paced, practical introduction to machine learning.

The TensorFlow site has a Tutorials page.  There are tutorials for Images, Sequences, Data Representation, and a few other things.

 

Google AI

Google has it’s own education site (which also has the Machine Learning Crash Course referenced above).

 

Blog sites

Adventures in Machine Learning, Andy Thomas’s blog.

This is a must view site, and worth visiting several times over.   Andy does a great job explaining the topics and has some great visuals as well.  These are fantastic tutorials.  I have listed only a few below.

Neural Networks Tutorial – A Pathway to Deep Learning

Python TensorFlow Tutorial – Build a Neural Network

Convolutional Neural Networks Tutorial in TensorFlow

Word2Vec work embedding tutorial in Python and TensorFlow

Recurrent neural networks and LSTM tutorial in Python and TensorFlow

 

colah’s blog – Christopher Olah’s blog

Another great blog, with lots of good postings.  A few are listed below.

Deep Learning, NLP, and Representations

Neural Networks, Types and Functional Programming

 

Courses

DataCamp – one of my favorite learning sites.  It does require a subscription.

DataCamp currently has 9 Python machine learning courses, which are listed below.  They also have 9 R machine learning courses.

Machine Learning with the Experts: School Budgets

Deep Learning in Python

Building Chatbots in Python

Natural Language Processing Fundamentals in Python

Unsupervised Learning in Python

Linear Classifiers in Python

Extreme Gradient Boosting wiht XGBoost

HR Analytics in Python: Predicting Employee Churn

Supervised Learning with Scikit-Learn

 

Udemy courses

Udemy is also a favorite learning site.  You can generally get the course for about $10.

My favorite Udemy learning series is from Lazy Programmers Inc.  They have a variety of courses.  Their blog site explains what order to take the courses in.   There are many other courses from different instructors as well.

Deep Learning Prerequisites: The Numpy stack in Python

Deep Learning Prerequisites: Linear Regression in Python

Deep Learning Prerequisites: Logistic Regression in Python

Data Science: Deep Learning in Python

Modern Deep Learning in Python

Convolutional Neural Networks in Python

Recurrent Neural Networks in Python

Deep Learning with Natural Language Processing in Python

Advanced AI: Deep Reinforcement Learning in Python

Plus many other courses on Supervised and Unsupervised Learning, Bayesian ML, Ensemble ML, Cluster Analysis, and a few others.

 

If you have other favorite machine learning resources, please let me know.

 

 

Data Science, Northwestern University MSPA

Python Tops KDNuggets 2017 Data Science Software Poll

The results of KDNuggets’ 18th annual Software Poll should be fascinating reading for anyone involved in data science and analytics.  Some highlights – Python (52.6%) finally overtook R (52.1%), SQL remained at about 35%, and Spark and Tensorflow have increased to above 20%.

KDNugetts_poll

(Graph taken from http://www.kdnuggets.com/2017/05/poll-analytics-data-science-machine-learning-software-leaders.html/2)

I am about halfway through Northwestern University’s Master of Science in Predictive Analytics (MSPA) program.  I am very thankful that the program has made learning different languages a priority.  I have already learned Python, Jupyter Notebooks, R, SQL, some NoSQL (MongoDB), and SAS.  In my current class in Generalized Linear Models, I have also started to learn Angoss, SAS Enterprise Miner, and Microsoft Azure machine learning.  However, it looks like you can’t ever stop learning new things – and I am going to have to learn Spark and Tensorflow – to name a few more.

I highly recommend you read this article.

 

Data Science, Northwestern University MSPA, Uncategorized

DataCamp’s Importing Data in Python Part 1 and Part 2.

I recently finished these DataCamp  courses and really liked them.  I highly recommend them to students in general and especially to the students in Northwestern University’s Master of Science in Predictive Analytics (MSPA) program.

Importing Data in Python Part 1 is described as:

As a Data Scientist, on a daily basis you will need to clean data, wrangle and munge it, visualize it, build predictive models and interpret these models. Before doing any of these, however, you will need to know how to get data into Python. In this course, you’ll learn the many ways to import data into Python: (i) from flat files such as .txts and .csvs; (ii) from files native to other software such as Excel spreadsheets, Stata, SAS and MATLAB files; (iii) from relational databases such as SQLite & PostgreSQL.

Importing Data in Python Part 2 is described as:

As a Data Scientist, on a daily basis you will need to clean data, wrangle and munge it, visualize it, build predictive models and interpret these models. Before doing any of these, however, you will need to know how to get data into Python. In the prequel to this course, you have already learnt many ways to import data into Python: (i) from flat files such as .txts and .csvs; (ii) from files native to other software such as Excel spreadsheets, Stata, SAS and MATLAB files; (iii) from relational databases such as SQLite & PostgreSQL. In this course, you’ll extend this knowledge base by learning to import data (i) from the web and (ii) a special and essential case of this: pulling data from Application Programming Interfaces, also known as APIs, such as the Twitter streaming API, which allows us to stream real-time tweets.

 

Data Science, Northwestern University MSPA

Learning to Use Python’s SQLAlchemy in DataCamp’s “Introduction to Databases in Python” – useful for students taking Northwestern’s MSPA Predict 420.

I just completed DataCamp’s course titled “Introduction to Databases in Python“.  This is a very informative course, and is actually one of the few tutorials out there that I have run across on SQLAlchemy.

I just finished Northwestern University’s MSPA (Master of Science in Predictive Analytics) Predict 420 class – Database Systems and Data Preparation Review, and I wish I would have taken DataCamp’s course first.  It would have helped tremendously.  You have the opportunity to use SQLAlchemy to interact with SQL databases in Predict 420, but I looked and could not find a really good tuturial on this, until I ran across DataCamp’s course, after I finished Predict 420.  I highly recommend this DataCamp course to other MSPA students.

Introduction to Databases in Python is divided up into 5 sections, with the course’s description of each section attached.

  1.  Basics of Relational Database.  In this chapter, you will become acquainted with the fundamentals of Relational Databases and the Relational Model. You will learn how to connect to a database and then interact with it by writing basic SQL queries, both in raw SQL as well as with SQLAlchemy, which provides a Pythonic way of interacting with databases.
  2. Applying Filtering, Ordering, and Grouping to Queries.  In this chapter, you will build on the database knowledge you began acquiring in the previous chapter by writing more nuanced queries that allow you to filter, order, and count your data, all within the Pythonic framework provided by SQLAlchemy!
  3. Advanced SQLAlchemy Queries.  Herein, you will learn to perform advanced – and incredibly useful – queries that will enable you to interact with your data in powerful ways.
  4. Creating and Manipulating your own Databases.  In the previous chapters, you interacted with existing databases and queried them in various different ways. Now, you will learn how to build your own databases and keep them updated!
  5. Putting it all together.  Here, you will bring together all of the skills you acquired in the previous chapters to work on a real life project! From connecting to a database, to populating it, to reading and querying it, you will have a chance to apply all the key concepts you learned in this course.

 

 

 

 

Data Science, Northwestern University MSPA

Northwestern University MSPA Program – learning R and python resources page creation

For new students coming into Northwestern University’s Master of Science in Predictive Analytics (MSPA) program, there is often considerable apprehension about learning the programming languages (mainly R and Python, and some SAS).   I have created a page on my blog site – Northwestern University MSPA Program – Learning R and Python resources – that lists some of the resources that are available, and my favorites.

I would encourage students to start taking the programming courses ahead of the particular classes, and whatever language you are required to use in that class.  There is enough time between the official courses to take some of these courses.  That way you don’t have to learn the course content and the programming language at the same time (if you don’t it is still doable, just will take more effort).

Data Science, Northwestern University MSPA

Northwestern University MSPA 420, Database Systems and Data Preparation Review

This was the fourth course I took in the MSPA program. I took this course because I wanted to understand relational and non-relational databases better, and become adept at storing, manipulating, and retrieving data from databases.  I thought this skill would be very beneficial when it came to getting data for the other analytics courses.

My overall assessment is that this was a good course conceptually, with a solid curriculum requiring a lot of reading and self study.  However I felt it could have been improved upon by having more sync video sessions or videos prepared, by improving the discussion sections, and by providing the code solutions to the projects.  I will review how the course is organized, and then review my comments above.

I took the course from Dr. Lynd Bacon, a very knowledgeable instructor, and very helpful when engaged.

Course Goals

From the syllabus – The data “includes poorly structured and user-generated content data.”   “This course is primarily about manipulating data using Python tools.  SQL and noSQL technologies are used to some extent, and accessed using Python.”

The stated course goals are listed below:

  • “Articulate analytics as a core strategy using examples of successful predictive modeling/data mining applications in various industries.
  • Formulate and manage plans to address business issues with analytics.
  • Define key terms, concepts and issues in data management and database management systems with respect to predictive modeling
  • Evaluate the constraints, limitations and structure of data through data cleansing, preparation and exploratory analysis to create an analytical database.
  • Use object-oriented scripting software for data preparation.
  • Transform data into actionable insights through data exploration.”

In retrospect, the first three goals were not addressed explicitly in this course.  Part of the third goal was met in that key concepts and terms around database management systems and data management were dealt with in-depth.  There was a lot of conceptual work around extracting data from both relational (PostgreSQL) and non-relational (MongoDB) databases (MongoDB will not be used again as they are switching to ElasticSearch next semester).  The fifth and sixth goals were met through the project work.

Python was used as the programming language and there was a lot of reading devoted to developing Python skills.  Some people had never used python before, and were able to get through it.  I had used Python in previous courses, and felt I still learned a lot.  There was extensive use of pandas DataFrames.  We used the packages json, pymongo to interact with the MongoDB database,  and learned how to save DataFrames and objects by pickling them or putting them on a shelve.  I used the Jupyter Notebooks to do my Python coding.  We also learned some very basic Linux in order to interact with the servers in the Social Sciences Computing Cluster (SSCC) to extract the data from the relational and non-relational databases.

Like the other MSPA courses it was structured around the required textbook readings, assigned articles, weekly discussions, and 4 projects.

Readings

The actual textbooks were mainly for Python.  There was a very valuable text on data cleaning.  All of the reading regarding the relational and non-relational databases came from the assigned articles, some of which were chapters from textbooks.

Textbooks

Lubanovic, B. (2015). Introducing Python: Modern Computing in Simple Packages.  Sebastopol, Calif.: O’Reilly. [ISBN-13: 978-1-449-35936-2]

McKinney, W. (2013) Python for Data Analysis: Agile Tools for Real-World Data. Sebastopol, Calif O’Reilly. [ISBN-13: 978-1-449-31979-3]

Osborne, J. W. (2013). Best Practices in Data Cleaning: A Complete Guide to Everything You Need to Do Before and After Collecting Your Data. Thousand Oaks, Calif.: Sage. [ISBN-13: 978-1-4129-8801-8]

There were additional recommended reference books that I purchased, but did not really reference.

The first two texts were good reads with a lot of practical code to practice on to improve your Python skills.  The textbook on Best Practices in Data Cleaning is worth the read.  It makes you understand the significant importance of cleaning your data correctly, and then testing the underlying assumptions that most statistical analyses are based upon.  The author provides convincing evidence to debunk these myths – robustness, perfect measurement, categorization, distributional irrelevance, equality, and the motivated participant.

Weekly Discussions

To be honest, I was disappointed with this aspect of the course.  Some students were very active, and others participated very minimally and posted their one discussion the evening of the due date.  I did learn some things from the dedicated students, but I feel that if this were stressed more by the professor, then there could have been more robust postings.  This was the weakest discussion section of all the course I have taken so far.

Sync Sessions

Disappointingly there were only 2 sync sessions.  I feel this could be markedly improved.  I would like to see more involvement by the professor in creating either live sync sessions or create learning videos.  Ideally one would be created for each type of database system being studied, so you could see in person how to access, manipulate, and extract the data, and then apply the data cleaning techniques and then perform exploratory data analysis.  This was a huge disappoint for me.

Projects

There were a total of 4 projects.

The first project was around airline flight data, and being able to pull data into DataFrames, and then manipulating and analyzing the data.  The second project required extraction of data from a relational database, and then creating a local sqlite database, manipulating and analyzing the data, then saving the DataFrames by pickling them or Shelving them.  The third project required extracting hotel review information from json files.  The fourth and most challenging project involved extracting 501, 513 Enron emails, and then doing analyses on these emails.

I was disappointed with the more complex projects, and felt at times as if the course work did not adequately prepare me to succeed easily on these projects.   I was able to muck my way through these.  An extremely disappointing aspect of these projects is that good examples of the codes used by students were not referenced or shared by the professor.  I feel that I would have been able to close the loop on my knowledge deficiencies if I had been able to see other very successful code examples, and then been able to learn from them.

Summary

Overall this was an okay course.  It could be improved upon given my suggestions above.  I still learned a lot and will be able to use this knowledge in the future.  It did give me a good foundation upon which to add more knowledge in the future.

Data Science, Machine Learning

The world of machine learning algorithms – a summary infographic.

This is a very nice infographic that shows the basic types of machine learning algorithm categories.   It is somewhat informative to follow the path of how the algorithm got posted on twitter, where I saw it.  It was somewhat misleading (although not intentional I believe) about who actually created this infographic.  To me this highlights the importance of making sure we are crediting our information sources correctly.  This topic was also broached in this FiveThirtyEight article “Who Will Debunk The Debunkers” by Daniel Engber.  The article discusses many myths, one of them being a myth of how spinach was credited with having too much iron content.  It mentions that an unscholarly and unsourced article became “the ultimate authority for all the citations that followed”.  I have run across this as well, when I was trying to find the source of quotation about what a “Learning Health System” was defined as.  This definition was cited by at least twenty scholarly articles, but there was not reference for the citation, only circular references to the other articles that used this definition.  This highlights the importance of making sure we correctly cite the source of information, so it can be critically analyzed by other people interested in using the data.

I noticed this infographic after it had been tweeted by Evan Sinar (@EvanSinar).  The tweet cited an article in @DataScienceCentral.  That article “12 Algorithms Every Data Scientist Should Know” by Emmanuelle Rieuf, mentions an article posted by Mark van Rijmenan, with the same title – 12 Algorithms Every Data Scientist Should Know“, and then shows the infographic, giving the impression that this was the source of the algorithm.  That article mentions that the “guys from Think Big Data developed the infographic” and provided a link.  That links to the article “Which are the best known machine learning algorithms? Infographic” by Anubhav Srivastava.  It “mentioned over a dozen algorithms, segregated by their application intent, that should be in the repertoire of every data scientist”.  The bottom line, try to be careful with your source citations so it is not hard for people to follow the source backwards in time.  I was able to do this in this case, it just took a little while.  But there are many times where it is impossible to do this.

Now, for the infographic.

12algorithmseverydatascientistshouldknow

 

 

Data Science

Data Science Ecosystem graphic

I ran across this graphic in this article, The Data Science Ecosystem: Preamble, by Lukas Biewald, posted on the Open Data Science (ODSC) site.   This lays out SOME of the ecosystem out there, and I like the way Lukas divides the ecosystem up nicely into components.  I would comment that there is a lot left out about what Python and R can do in the Enrichment, ETL/Blending, Data Integration, Insights and Models sections.  But overall I like the graphic.

Data_Science_Ecosystem

Data Science, Jupyter Notebook, JupyterLab

JupyterLab – Exciting Improvement on Jupyter Notebooks

At SciPy 2016, Brian Granger and Jason Grout presented JupyterLab, now in a pre-alpha release.  This was the most exciting and monumental news of the conference for me.  A blog post about JupyterLab from Fernando Perez can be viewed here, the link to the YouTube video of the presentation is available here, while the video is presented below.

The blog post discusses some of today’s “Jupyter Notebook” functionality, most of which I have not used.  This includes the Notebooks, “a file manager, a text editor, a terminal emulator, a monitor for running Jupyter processes, an IPython cluster manager, and a pager to display help”.   The new functionality allows you to “arrange a notebook next to a graphical console, atop a terminal that is monitoring the system, while keeping the file manager on the left”.  Users of RStudio will be happy to see this.  (I am wondering if they are going to create a Package Manager like RStudio?).

Here are a few screenshots of what it looks like.

 

You can download this now, and help “test and refine the system”.  Instructions to do this are here.

Data Science, Data Visualization, Jupyter Notebook

Jupyter Notebook, matplotlib figure display options, and pandas.set_option() optimization tips.

I prefer to do my coding in a Jupyter Notebook, as my previous posts have mentioned.  However, I have not run across any good documentation on how to optimize the notebook, for either a python or R kernel.  I am going to mention a few helpful hints I have found.  Here is the link to the Project Jupyter site.

First a basic comment on how to create a notebook where you want it.   You need to navigate to the directory where you want the notebook to be created.  I use the Windows PowerShell command-line shell.  When you open it up, you are at your home directory.  Use the “dir” command to see what is in that directory, and then use the “cd” (change directory) command to navigate to the directory you want to end up in.  If it is a longer path, you should enclose in quotes.  If you need to create a new directory, use the “md” or “mkdir” command to create a new directory.  For example, my long path is –  “….\Jupyter Notebooks\Python Notebooks”, and while at SciPy 2016 I created an new folder, and this directory is “….\Jupyter Notebooks\Python Notebooks\SciPy16” – to which I added a folder for each tutorial I attended.

Once you get into the final directory, type “Jupyter Notebook”, and a new notebook will be opened.  The first page that opens up is the “Home” page, and if your notebook exists, you can select it here.  If it doesn’t yet exist, then select “New” if the upper right, select your notebook type (for me R or Python 3), and it will launch the notebook.  (This notebook is from a pandas tutorial I attended at SciPy 2016 – “Analyzing and Manipulating Data with Pandas by Jonathon Rocher (excellent presentation if want to watch the video being created).

2016-07-14_15-48-13

Once you click on the “pandas_tutorial”, this Jupyter notebook will open up.

2016-07-14_15-50-47

A nice feature is that if you clone GitHub repository into that folder, and start a new Jupyter Notebook, then all the files that go with that repository are immediately available for use.

Importing data in a Jupyter Notebook.

If you are tired of hunting down the path for a data set, there is an easy way to find a data set and get it into the directory of the Jupyter notebook.  Go to the “Home” page, and select “Upload” and you will be taken to the “file upload” application.  Navigate to where you stored the data set on your computer, select, and then it will load that onto the home page.  You can then easily load it into your specific Jupyter notebook that is associated with that directory.

2016-07-14_15-48-13

Matplotlib figure display options.

If you don’t specify how to display your figures in the Jupyter notebook, when you create a figure using matplotlib, a separate window will open and display the graph.  This window is nice because it is interactive, and you can zoom in on the graph, save it, put labels in, etc.  There is a way to do this in the Jupyter notebook.

The first option I learned about was:

%matplotlib inline

This would display the graph in the notebook, but it was no longer interactive.

However, if you use:

%matplotlib notebook

The figures will now show up in the notebook , and still be interactive.  I learned this during the pandas tutorial at SciPy 2016.

You can also set your figure size by:

LARGE_FIGSIZE = (12,8) # for example

 

Some pandas optimization hints

Use:

pandas.set_option()

to set a large number of options.  For example:

pandas.set_option(“display.max_rows”, 16)

and only 16 rows of data will be displayed.  There are many options, so just use “pandas.set_option?” command to see what is available.

If you have other useful Jupyter notebook tips, would love to hear about them.