Despite being one of technology’s most rapidly growing fields for over a decade, understanding of data science has remained relatively poor. Technical jargon and buzzwords frequently obscure the basic ideas and concepts from outsiders, leaving even technical marketeers in the dark.
In this article I hope to demystify some of the terms that are commonly thrown around in relation to data science and machine learning, and provide a gentle introduction to both.
What exactly is ‘data science’?
As the name might suggest, data science is the study of how we can leverage raw data to understand the world around us. The word ‘data’ might evoke thoughts of numbers in a spreadsheet or Matrix-style formulas falling down a screen, but in fact data encompasses everything from photos of your pet cat to the sheet music of Beethoven and logs of weather conditions.
Thanks to ongoing research (and a surge in funding) new data science tools are consistently being developed, and allowing us to use data to create seemingly impossible applications. For example, check out this page. Notice anything strange about the person you’ve just loaded a picture of? Well, the photo is fake, and that person does not exist. Using millions of real photos of people, a computer has learned to create shockingly realistic fake images to mimic the real ones. That’s data science.
Before we go further, it’s important to explain that data science as a term is fairly broad, and encompasses a multitude of distinct disciplines. Here are the main ones to think about.
Artificial Intelligence is a general term to describe using computers to imitate human intelligence – such as deciding if a tweet is happy, sad or neutral in tone. Whilst we might think of AI in terms of fully conscious human-like creations (like Hal from Space Odyssey), in a data science context we typically use AI as an umbrella term to describe anything that ‘learns’.
Machine Learning is a subset of AI that describes the specific techniques used for training models to learn tasks, using data. You’ll hear the word ‘model’ thrown around a lot in this context, but it simply refers to the output of a machine learning process, which ‘models’ some real world process. For example, a weather model trained on years of weather data models how weather patterns work in the real world. This model can then use current weather data to simulate the following week’s weather, and report back what it expects the weather to be like in the coming days.
Deep Learning is an emerging branch of machine learning that strays away from the typically statistical approach of machine learning models. Deep learning models are loosely based on how the human brain functions, with computerised versions of neurons and synapses. Whilst deep learning research has been growing over the last decade and a half, the fundamental concepts have been floating around since the 40s. Only now, with ever improving modern hardware, have we been able to begin to realise the full potential of these concepts. For example, deep learning was behind the fake people generator we looked at earlier, and it can even been used to compose music.
Business Intelligence describes a more traditional form of data science, the kind you might be more familiar with. Business intelligence involves quickly analysing large amounts of present and historic business data, to provide insights which can support and give context to strategic business decisions. For example, does your bike business need to know how many red bikes were sold in Greater Manchester from 2013 to 2017? Business intelligence tools can quickly provide the answer, allowing you to make your case for developing a new red bike.
Let’s create a machine learning model
You don’t need seventeen degrees in mathematics and statistics to understand the basics of how a machine learning model works. In the next few minutes, I’ll walk you through an example of a linear classification model that will determine whether an animal is a cat or a dog. I promise it’ll be easy.
For this task, we’re going to measure two data points that might help us distinguish cats from dogs: size and ear-floppiness (technical term).
The first thing we can do is gather a bunch of example dogs and cats, and plot them on a graph at numeric points based on their relative size and ear-floppiness.
The model we’re going to be building is known as a linear classification model. The task at hand for building the model is to find a single line on the graph that can separate cats from dogs. Some dogs are smaller than some cats, and some cats have floppier ears than some dogs, so we can’t just say ‘an animal is a cat if it’s size is less than a classification of five’.
Instead we need some sort of slanted line to separate the two groups. The real model would use a bit of fancy maths to find the exact line, but for this example we can just visually work out where it lies.
Now that we’ve found our line, we’re done! We’ve trained a complete classification model which we can use to classify a new animal.
To do this, we plot the new animal’s size and ear-floppiness and see if it sits above or below the line. Let’s see an example of this using my cat, Rhea. Rhea is quite small, so we can put her at around three for size, and around three for ear-floppiness too. On our graph, that looks like this:
Rhea comes in below the classification line, so our model would predict ‘cat’! Genius stuff.
Though, in this case the model is correct, it also shows where there are limitations. For example, you can see where the model might fail if we asked it to classify a husky puppy.
This might seem like an overly simplistic model but, at a large scale, linear models like this have been used in industry for decades, often to great effect.
If you’re curious about a more technical example (including Python code snippets), this article goes through how we can use data science to detect whether a cancer is benign or malignant. Ultimately though, the core concepts of all classification models stay the same: collect some examples, try to find a way of separating them, see how the model holds up to new examples it hasn’t seen before. That’s all there is to it!
You could even try testing the model above yourself, by searching for an image of a cat or a dog and seeing whether the model gets it’s classification right. Equally, you could think about other measurements that might work better to separate the two animals. For example, could colour work? What about top speed? Start plotting your ideas on a graph and you’ll be a data scientist before you know it!