Data Science, Machine Learning

The world of machine learning algorithms – a summary infographic.

This is a very nice infographic that shows the basic types of machine learning algorithm categories.   It is somewhat informative to follow the path of how the algorithm got posted on twitter, where I saw it.  It was somewhat misleading (although not intentional I believe) about who actually created this infographic.  To me this highlights the importance of making sure we are crediting our information sources correctly.  This topic was also broached in this FiveThirtyEight article “Who Will Debunk The Debunkers” by Daniel Engber.  The article discusses many myths, one of them being a myth of how spinach was credited with having too much iron content.  It mentions that an unscholarly and unsourced article became “the ultimate authority for all the citations that followed”.  I have run across this as well, when I was trying to find the source of quotation about what a “Learning Health System” was defined as.  This definition was cited by at least twenty scholarly articles, but there was not reference for the citation, only circular references to the other articles that used this definition.  This highlights the importance of making sure we correctly cite the source of information, so it can be critically analyzed by other people interested in using the data.

I noticed this infographic after it had been tweeted by Evan Sinar (@EvanSinar).  The tweet cited an article in @DataScienceCentral.  That article “12 Algorithms Every Data Scientist Should Know” by Emmanuelle Rieuf, mentions an article posted by Mark van Rijmenan, with the same title – 12 Algorithms Every Data Scientist Should Know“, and then shows the infographic, giving the impression that this was the source of the algorithm.  That article mentions that the “guys from Think Big Data developed the infographic” and provided a link.  That links to the article “Which are the best known machine learning algorithms? Infographic” by Anubhav Srivastava.  It “mentioned over a dozen algorithms, segregated by their application intent, that should be in the repertoire of every data scientist”.  The bottom line, try to be careful with your source citations so it is not hard for people to follow the source backwards in time.  I was able to do this in this case, it just took a little while.  But there are many times where it is impossible to do this.

Now, for the infographic.




Data Science

Data Science Ecosystem graphic

I ran across this graphic in this article, The Data Science Ecosystem: Preamble, by Lukas Biewald, posted on the Open Data Science (ODSC) site.   This lays out SOME of the ecosystem out there, and I like the way Lukas divides the ecosystem up nicely into components.  I would comment that there is a lot left out about what Python and R can do in the Enrichment, ETL/Blending, Data Integration, Insights and Models sections.  But overall I like the graphic.


Northwestern University MSPA, Predictive Analytics

Northwestern University’s MSPA (Master of Science in Predictive Analytics) Program review by a recent student graduate.

I ran across this blog posting today by a student who finished the MSPA program.  My Thoughts on Northwestern University’s MSPA is written by Bhaskar Karambelkar, a student who graduated this summer.  He provides a comprehensive overview of the program, and rates each course on Course Content, Professor Engagement, Overall Value to the Program, and Overall Value to Me.  This is well written, and worth a read by anyone considering this program.

This prompted me to look for other bloggers who are in the course or who have finished the course.  I ran across a few who posted once, but did not post any follow up.  If anyone knows of any other active bloggers, please let me know.

The official Northwestern University MSPA site is:

There are two Linked In groups that may interest you as well.

The “Northwestern University MS Predictive Analytics” group is “for current students and alumni of the Northwestern MSPA Program”.  There are useful articles posted, and questions posed to the group about which professors to take for the courses, sharing of syllabus, etc. It is very useful to browse when considering which class/professor to take.  There are 2,097 members currently.

The “Networking Group for Northwestern University’s MS in Predictive Analytics Program” is “an open group to allow student’s in Northwestern University’s MS in Predictive Analytics Program to network with each other. The group is open to others, including recruiters, who may be interested in networking with us.  The advantage of having a networking group are three fold. First, it will enable us to have a common communication point without have to be linked directly to each other. Second, it will enable us to have a lasting connection to current, future, and past students. And third, it will enable us to be easily found by recruiters.  Please note that this is not an “alumni” group and that this group has no official affiliation with Northwestern University.”  There are 3,455 members currently, and the content is pretty similar to the other Linked In group.




Data Science, Jupyter Notebook, JupyterLab

JupyterLab – Exciting Improvement on Jupyter Notebooks

At SciPy 2016, Brian Granger and Jason Grout presented JupyterLab, now in a pre-alpha release.  This was the most exciting and monumental news of the conference for me.  A blog post about JupyterLab from Fernando Perez can be viewed here, the link to the YouTube video of the presentation is available here, while the video is presented below.

The blog post discusses some of today’s “Jupyter Notebook” functionality, most of which I have not used.  This includes the Notebooks, “a file manager, a text editor, a terminal emulator, a monitor for running Jupyter processes, an IPython cluster manager, and a pager to display help”.   The new functionality allows you to “arrange a notebook next to a graphical console, atop a terminal that is monitoring the system, while keeping the file manager on the left”.  Users of RStudio will be happy to see this.  (I am wondering if they are going to create a Package Manager like RStudio?).

Here are a few screenshots of what it looks like.


You can download this now, and help “test and refine the system”.  Instructions to do this are here.

Data Science, Data Visualization, Jupyter Notebook

Jupyter Notebook, matplotlib figure display options, and pandas.set_option() optimization tips.

I prefer to do my coding in a Jupyter Notebook, as my previous posts have mentioned.  However, I have not run across any good documentation on how to optimize the notebook, for either a python or R kernel.  I am going to mention a few helpful hints I have found.  Here is the link to the Project Jupyter site.

First a basic comment on how to create a notebook where you want it.   You need to navigate to the directory where you want the notebook to be created.  I use the Windows PowerShell command-line shell.  When you open it up, you are at your home directory.  Use the “dir” command to see what is in that directory, and then use the “cd” (change directory) command to navigate to the directory you want to end up in.  If it is a longer path, you should enclose in quotes.  If you need to create a new directory, use the “md” or “mkdir” command to create a new directory.  For example, my long path is –  “….\Jupyter Notebooks\Python Notebooks”, and while at SciPy 2016 I created an new folder, and this directory is “….\Jupyter Notebooks\Python Notebooks\SciPy16” – to which I added a folder for each tutorial I attended.

Once you get into the final directory, type “Jupyter Notebook”, and a new notebook will be opened.  The first page that opens up is the “Home” page, and if your notebook exists, you can select it here.  If it doesn’t yet exist, then select “New” if the upper right, select your notebook type (for me R or Python 3), and it will launch the notebook.  (This notebook is from a pandas tutorial I attended at SciPy 2016 – “Analyzing and Manipulating Data with Pandas by Jonathon Rocher (excellent presentation if want to watch the video being created).


Once you click on the “pandas_tutorial”, this Jupyter notebook will open up.


A nice feature is that if you clone GitHub repository into that folder, and start a new Jupyter Notebook, then all the files that go with that repository are immediately available for use.

Importing data in a Jupyter Notebook.

If you are tired of hunting down the path for a data set, there is an easy way to find a data set and get it into the directory of the Jupyter notebook.  Go to the “Home” page, and select “Upload” and you will be taken to the “file upload” application.  Navigate to where you stored the data set on your computer, select, and then it will load that onto the home page.  You can then easily load it into your specific Jupyter notebook that is associated with that directory.


Matplotlib figure display options.

If you don’t specify how to display your figures in the Jupyter notebook, when you create a figure using matplotlib, a separate window will open and display the graph.  This window is nice because it is interactive, and you can zoom in on the graph, save it, put labels in, etc.  There is a way to do this in the Jupyter notebook.

The first option I learned about was:

%matplotlib inline

This would display the graph in the notebook, but it was no longer interactive.

However, if you use:

%matplotlib notebook

The figures will now show up in the notebook , and still be interactive.  I learned this during the pandas tutorial at SciPy 2016.

You can also set your figure size by:

LARGE_FIGSIZE = (12,8) # for example


Some pandas optimization hints



to set a large number of options.  For example:

pandas.set_option(“display.max_rows”, 16)

and only 16 rows of data will be displayed.  There are many options, so just use “pandas.set_option?” command to see what is available.

If you have other useful Jupyter notebook tips, would love to hear about them.






Data Science, Data Visualization

Altair – A Declarative Statistical Visualization Library for Python – Unveiled at SciPy 2016 Keynote Speech by Brain Granger.

You should check out Altair, an API designed to make data visualization much easier in Python.  Altair was introduced today during a keynote speech by Brian Granger during the opening day of SciPy 2016 (Scientific Computing with Python). Brian is the leader of the IPython project and co-founder of Project Jupyter (Jupyter notebooks are my favorite way to code in Python or R).

Matplotlib has been the cornerstone of data visualization in Python, and as Brian Granger pointed out, you can do anything you want to in matplotlib, but there is a price to pay for that, and that is time and effort.

Altair is designed as “a declarative statistical visualization library for Python”.  Here is the link to Brian Granger’s GitHub site which houses the Altair files.  Altair is designed to be a very simple API, with minimal coding required to produce really nice visualizations.  A point Brian made in his talk was that Altair is a declarative API, which specifies what should be done, but not how it should be done.  The source of the data is a pandas DataFrame, that is in a “tidy format”.  The end result is a JSON data structure that follows the Vega-Lite specifications.

Here is my understanding of this relationship from a very high level Altair to Vega-Lite to Vega to D3.  (For more information, follow this link)  D3 (Data-Driven Documents) is a web-based visualization tool, but this is a low-level system.  Vega is designed as a higher-level visualization specification language built on top of D3.  Vega-Lite is a high-level visualization grammar, and a higher level language than Vega.  It provides a concise JSON syntax, which can be compiled to Vega specifications (link).  Altair is an even higher-level, and emits JSON data structures following the Vega-Lite specifications.   The idea is that as you get higher up, the complexity and difficulty of producing a graphic goes down.

On the GitHub site there are a number of Jupyter notebook tutorials.  There is a somewhat restricted library of data visualizations available, and they currently list scatter charts, bar charts, line charts, area charts, layered charts, and grouped regression charts.

The fundamental object in Altair is the “Chart”, which takes a pandas dataframe as a single argument.  You then start specifying what you want: what kind of “mark” and visual encodings ( X,Y, Color, Opacity, Shape, Size, etc.) you want.  There are a variety of data transformations available, such as aggregation, values, count, valid, missing, distinct, sum, average, variance, stdev, median, min, max, etc.  It is also easy to export the charts and publish them on web as Vega-Lite plots.

This looks like a very exciting and much easier to use data visualization API, and I look forward to exploring it more soon.

Data Science has great courses for learning Python, R, Data Science.

Just a quick blog post to highlight the numerous courses available on  I just completed Data Analysis in Python with Pandas, and found it very informative, especially with some of the advanced functions in DataFrames.

It is worthwhile keeping an eye on this site, because they have intermittent sales where these courses are deeply discounted.  I currently have 35 courses that cover Python, R, Data Science, MongoDB, SQL, MapReduce, Hadoop, teaching kids to code, Machine Learning, Data Vis, Time Series Analysis, Linear Modeling, Graphs, Rattle, Linear Regression, Statistics, Simulation, Monte Carlo Methods, Multivariate Analysis, Bayesian Computational Analyses, and more, most of which were purchased during these sales.

These are great course to learn the  underlying languages and concepts and to brush up when you have not used them for awhile.

I highly recommend these courses, just wish I had time to do more of them.