The Magic Behind Machine Learning? It's All in the Data!
You have probably understood that the whole concept of machine learning is dependent on data from experience. In summary, we are going to design a bunch of computational models, feed our data related to a specific task to them, and help them learn from the data to be able to handle the task. Let's see what we mean by data.
Suppose a fish packing plant wants to automate sorting incoming fish on a conveyor belt according to species. As a pilot project, it is decided to separate sea bass from salmon using optical sensing. Ashley, a senior engineer of the company, has been in charge of the project. Since the separation of salmon and sea bass needs to be done based on their appearance, Ashley sets up a camera and takes some sample images of the incoming fish, such as the following:
The first idea that comes into her mind for automatically separating the fish is to write a computer program to check (analyze) some properties of the incoming fish based on its image to decide whether it is a salmon or a sea bass. In other words, she needs to write some rules for the program to have a pre-defined strategy for every anticipated situation that can differentiate between two species of fish. In fact, the program will have some fixed rules that Ashley has to formulate.
The problem is, how many situations can Ashley anticipate? How many nested "if/else statements" can she write to handle every different property between salmon and sea bass or distinguish them by taking into account all of the possible aspects? Moreover, maybe she has taken hundreds (or perhaps thousands) of sample images. How can she take a look at all these images and try to find rules for distinguishing species based on the insight she gained?
This is where machine learning comes into the picture. Machine learning looks at the problem differently. In machine learning, we do not want to find and program the rules. We want to create a machine that can find the rules itself. In fact, instead of hard coding the insight into the machine, we want to write a program that has the ability to learn.
Fortunately, Ashley has a good grasp of machine learning. So, she decides to take a machine learning-based approach. She begins to note some physical differences between the two types of fish, such as:
- Width
- Lightness
- Number of fins
- Position of the mouth
- etc.
These measurable properties are called features.
A feature is an individual measurable property or characteristic of a phenomenon/sample being observed.
By extracting features, Ashley intends to reduce the raw data (image of the fish in this case) by measuring and keeping certain features or properties. She chooses the width and the lightness as features representing each fish on the conveyor. Eventually, she will have something like the following table:
Each row of the above table is obtained by the following procedure, which is often referred to as data collection:
1. Sample Collection: using the camera to take an image of the incoming fish.
2. Feature Extraction: calculate the width and the lightness of the fish from the image.
3. Data Annotation (Labeling): Write down the species of the fish as its label.
The table rows correspond to the samples or observations that Ashley has observed. The columns correspond to the features or variables she has recorded for those samples. Each sample has some features and a label, which, in this case, is the fish species. After obtaining the above table, Ashley can refer to each fish only by width and lightness. This was a typical example of data collection for machine learning applications.
Ashley continues and collects data from several more fish to add more rows to the table. Then, she draws a point for each fish on an x-y graph whose horizontal and vertical axes are the lightness and width of the fish, respectively. She uses the black points for salmon and the red points for sea bass. Here is her resulting graph:
For example, "Fish 4" in the table will be represented by a red point in position (x=6, y=17.1) of the graph (the one that is circled out). This graph is called the scatter plot of the samples. A scatter plot uses dots to represent values for two different numeric variables (lightness and width in Ashley's case).
Remember that the project's purpose was to sort the incoming fish according to species. So, Ashley needs to use the collected data to distinguish between salmon and sea bass. She can do this by simply drawing a line on the scatter plot to separate the black dots from the red ones, something like the following blue dashed line:
After finding such a line, she can implement an automatic workflow to take the following three steps to recognize the species of a new fish:
- Take an image of the fish using the camera
- Extract the features (width and lightness) from the image
- Check which side of the line the point represented by the measured features resides (left for salmon, right for sea bass)
Finding the line that can discriminate between the two fish species is what machine learning does. The process of feeding data to a machine learning model by which it learns how to perform the desired task (finding the line in Ashley's case) is called training. For a more valuable and in-depth learning experience, make sure to check out our Fundamentals of Machine Learning Online Training.