Opening comments

A while ago I launched on a brief yet intense journey to learn what the commotion around machine learning was about. These notes are a result of that journey. Most are my own “practitioner’s notes” to remind me of what various algorithms do (so I can sound knowledgeable in meetings), and what are the R commands to run them.

Below is the no-nonsense no-hype version of what I found machine learning to be. The rest is details on how to use R for machine learning, and specifically some of the commonly used algorithms.

If you have any comments, write to me at mp@pareek.org. There is no email I do not read.

Thank you.



I find the phrase machine learning to be a bit of a misnomer. In all of it, there is no machine that is learning anything whatsoever. Most of it is just plain mathematics applied to data to discover relationships within a data set. In this introduction, I wanted to share the plain English facts of what I found machine learning to be.

What machine learning does

When we say machine learning is about discovering relationships between data, this ‘relationship’ is expressed in a way that a bunch of variables can then be used to determine the value of another variable (that is unknown, and of interest to know). As an example, if I know the fuel consumption, engine size and the year of manufacture of a car, can I say without looking what its horsepower might be? Or whether it is a sedan or an SUV? The first question is the example of a ‘prediction’ problem, and the second of a ‘classification’ problem. The difference between prediction and classification is that in the former we predict a number, while in the latter we identify a bucket to which an item belongs.

There is yet another kind of problem that machine learning solves. These are called ‘clustering’ problems. So if I have a list of cars with their fuel consumption, engine size and other things, how do I bucket them together so that the cars that are the most similar to one another are in a single bucket. In this situation, machine learning math creates buckets to which each car would belong without any input by the human. Of course, the human would need to still spell out how many buckets to create. Clustering is also called ‘unsupervised learning’, again a rather misleading phrase that seems to suggest that the machines are teaching themselves autonomously and will soon take the planet over.

The rocket science

If you are still with me, you now know exactly what machine learning is all about. Yes, that really is all there is to it. Interestingly, I also found during the course of my journey that not much of this is new. Most of the algorithms were created decades ago, and some are over half a century old. This used to be called data mining in the boring old days, and then we started to use phrases such as statistical learning, machine learning, artificial intelligence, deep learning and what have you. Of course, there would be many who would point me to the fine differences between all of these topics, but the reality in my opinion is it doesn’t matter.
In essence the problem solved by all of these things, ie prediction, classification and clustering, is roughly the same. If you use plain vanilla linear regression, you might call it data mining, but if you use a neural network, you would call it deep learning. Looks good on corporate decks. And consulting decks particularly. Yet neural networks are essentially based on linear regression, and a simplistic explanation of a neural network is just a series of linear regressions run sequentially one after the other.

Even between prediction, classification and clustering, there is really not all that much of a difference. If you can predict a score of some type based on data, you can use it to classify, or to create clusters. In essence, all these problems are really solving for the same thing.

In my view, all of machine learning can be boiled down to finding out the relationship between:

Some output ~ f(input 1, input2, input3,…., input_n).

The ~ means ‘is a function of’. Machine learning helps us find that function. Once we know the function, we can use it in situations where we only know the inputs.

It is now relevant to talk about a few additional concepts. Once we know what these mean, we would be ready to stand our own in any discussion relating to machine learning. And understand the hype behind what we read in the popular media.

Model building

Machine learning starts with data. Think of data as being a large table in a spreadsheet. There are columns, and there are rows. Continuing with our car example, each row in the spreadsheet would be for a different car, and each column would be an attribute of that car, such as miles per gallon, number of doors, acceleration 0-60, engine size. So the rows are observations, and the columns are the variables. Now for all these cars, we already know all the variables. So we know whether it is an SUV or a sedan or a hatchback, and what its engine size is.

Now imagine we are provided additional data for a car where we know everything except its engine size. The problem we need to solve is that given all the other car data, can we determine the relationship between engine size and other variables so that we can predict engine size? Machine learning allows us to do exactly that. The data in our spreadsheet is ‘labeled’ data, ie for each row we already know the answer we will seek in the future, so we can use it to try determine the relationship between the known and unknown variables. The process where we establish the relationship is called the ‘learning’. Once the relationship is known, it is expressed as a ‘model’.

But just to be clear, most ‘data scientists’ are not creating new models. The models mostly already exist. All we are doing is to calculate the coefficients or the constants in the model based upon our data, and tweak some other parameters. That is not to say that new inventions are not happening: Google, IBM and others are doing that, but by and large the models already exist. The problem is generally reduced to determining which model works the best, ie gives the most accurate results. And these models exist inside of software libraries and packages, most available as open source software. As an example, the caret package in R (which is an open source statistical software) provides access to over 200 machine learning models to pick from. You can do the same thing in Python, another open source programming language.

Now the machine will not pick the model for you, the data scientist uses her knowledge of the data to decide which ones will work best in the given situation. Even this decision making is being automated – there are routines available that will run a data set through all possible models and tell you which one gives you the most accurate results.

Which brings us to the question of model accuracy. Suppose we build a model where for a given population we know the age, education, income, industry, family status and political affiliation. Our task is to build a model that can predict political affiliation based on the rest of the variables. Once we have built the model (which could be any model of the hundreds available), we need to run it against our data to see if it is accurate at predicting political affiliation. Only then can we release it for use in real life. If it provides us reliable results (we can define reliability as a percentage of accurate predictions against our test data), then it is a good model otherwise not. In fact, we could create a large number of possible models and pick the one that gives us the best results.

Ensemble methods leverage multiple algorithms simultaneously, for example, don’t use just one but many algorithms to make a prediction, and then use the value predicted by the majority of the algorithms in the ensemble.

The commoditization

Model building, in short, is now at a stage that I will call commoditized. The means to run data through an algorithm and tune it to perfection are commodities. The skills to do all of this are fast becoming commodities as well. In the future, I will not be surprised if learning algorithms are packaged as point-and-click options in Excel.

The challenges

But what is difficult is getting good quality labeled data. To ‘learn’, you need good labeled data in the first place. When I say ‘labeled’, it means already classified data so the machine can make a good guess at the relationship it needs to predict. It is here that the Googles and Facebook of the world have an unsurmountable advantage compared to the rest of us. Each time I override Google’s autocorrect on my phone, Google gets more labeled data to figure out what is correct. No startup can compete with that.

The other problem is data rarely comes neatly organized as a table in a spreadsheet. It comes as feeds from sensors, log files, emails, images, videos, and conversations. A lot of heavy lifting needs to be performed to organize this unstructured data into a format that algorithms can be applied to. Often, this exercise is called feature extraction, which is the process of identifying the variables that can help us predict whatever it is that we are trying to predict. Speed is another issue, because some of these algorithms are compute intensive and can take hours to run. But if you are driving a car, you don’t have that luxury.

It is in good input data, speed and feature extraction where innovation is needed and happening.

The rest of these notes are about R, and how to implement the most common algorithms.