How to explain Machine Learning to my grandma.

23 min readOct 24, 2019

Grandma, I’m going to explain to you about machine learning, I’ll make sure you are going to understand everything. Machine learning is a valuable tool for helping software engineers to take advantage of the data.

It allows to customize the products for specific groups of people. Imagine that I created my spell checker by hand writing the code, and that, due to its success, I want to have versions in the 100 most popular languages. I would have to start from scratch for each language, which would take years of effort. But if I created it with machine learning, recreating it for another language would mean collecting data in that language and feeding the same machine learning model.

If you were a software engineer, you will learn to do three things better.

First, you will have a tool to reduce programming time. Imagine you want to write a program to correct spelling errors. It could include many examples and general rules (such as M before B and N before V) and, after weeks of hard work, achieve a reasonable program. Or you could use an existing machine learning tool, introduce some examples and get a more reliable program in a smaller fraction of time.

Third, machine learning allows you to solve problems that you wouldn’t know how to do it by hand. As a human being, I have the ability to recognize the faces of my friends and understand their speech, but I do it in my subconscious. If I have to write a program to do it, I would be baffled. These are tasks that machine learning algorithms do very well. I don’t have to tell the algorithm what to do. I just have to show you many examples and, from these, the task is completed.

In addition to these three practical reasons that justify mastering machine learning, there is a philosophical reason: machine learning changes the way you think about a problem. Software engineers are trained to think logically and mathematically. We use affirmations to test if the properties of our program are correct. With machine learning, the approach changes from mathematical to natural science: We make observations of an uncertain world, perform experiments and use statistics, not logic, to analyze the results of the experiment. The ability to think like a scientist will expand your horizons and new areas that you had not explored will open to you. Enjoy the trip and exploration. Did you understand grandma?

Phase 1. Prerequisites before starting the trip to machine learning, you should know how to program in python and especially focus on the following tools:

Python programming. Libraries:

Panda for handling data sets, MatPlotLib for data visualization, SeaBorn to represent heat maps, Nunpy represent mathematical operations at the initial level, Scitkit-Learn, represents evaluation metrics.

Tensor flow, initial level, is an open source platform for machine learning. It can be installed on Linux or on a virtual machine in Windows.

Mathematics, granny you must have an intermediate level of the following knowledge:

Algebra, linear algebra, trigonometry, statistics, calculation

Bash terminal, Shell learning. terminal is a console that interprets commands from a program previously configured by commands, for example, we can tell the computer to shut down in 30 minutes with the following command: “charly $ — shutdown -s -t 1800”. In machine learning it would be very useful to automate certain tasks to save us significant time.

Once we have installed the previous tools and practiced a bit with exercises found at the end of this blog, we are ready to enter the magical world of machine learning.

Phase 2. Framework: how to establish a framework for a task as a machine learning problem and cover many of the basic vocabulary terms shared among a wide variety of machine learning methods, let’s look at an example:

Before we begin, let’s review the basic framework. It’s about supervised machine learning. With it, we learn to create models that combine inputs and make useful predictions, including data that we have not seen. When we train that model, we assign tags. For example, with spam filters, that tag can be “is spam” or “is not spam”. It is what we try to predict. Attributes are the way we represent data. These can be extracted from an email, such as words in the mail, sending and receiving addresses, or routing and header information; any data that we can extract to represent it in our machine learning system. For example, a data or an email. It can be a tagged example in which we have information about the attribute and the value of the tag. Perhaps, a user has provided us with this information. Or we can have an example without a label, such as an email with attribute information, but we don’t know if it’s spam or not. Then, we classify the email and refer it to the Inbox or spam folder. Finally, we have a model that makes the prediction. We test and create it through a learning process based on data.

ML systems learn how to combine inputs to produce useful predictions about data never seen before.

Let’s explore the basic terminology of machine learning.

Labels

A label is the value we are predicting, that is, the variable and the label could be the future price of wheat, the type of animal shown in an image, the meaning of an audio clip or just anything.

Attribute

An attribute is an input variable, that is, the variable x A simple machine learning project could use a single attribute, while a more sophisticated one could use millions of attributes.

In the spam detector example, the attributes could include the following:

Words in the email text
Return address
Time of day it was sent
Presence of the phrase “an amazing trick” in the email

Samples

An example is a particular data instance, x. (The x is placed in bold to indicate that it is a vector). The examples fall into two categories:

Tagged examples
Unlabeled examples

A tagged example includes both attributes and the tag. This means the following:

labeled examples: {features, label}: (x, y)

Labeled examples are used to train the model. In our example of the spam detector, the examples tagged would be the individual emails that users explicitly marked as “is spam” or “is not spam.”

An example without a tag contains attributes, but not the tag. This means the following:

unlabeled examples: {features, ?}: (x, ?)

Once the model is trained with tagged examples, that model is used to predict the tag in unlabeled examples. In the spam detector, the unlabeled examples are new emails that people have not yet tagged.

Models

A model defines the relationship between attributes and the label. For example, a spam detection model could very clearly associate certain attributes with “is spam.” Let us highlight two phases in the cycle of a model:

Training means creating or learning the model. That is, you show examples labeled to the model and allow it to gradually learn the relationships between the attributes and the label.
Inference means applying the trained model to examples without labels. That is, you use the trained model to make useful predictions.

Regression versus classification

A regression model predicts continuous values. For example, regression models make predictions that answer questions such as the following:

What is the value of a house in California?
What is the probability that a user clicks on this ad?

A classification model predicts discrete values. For example, classification models make predictions that answer questions such as the following: Is a given email message spam or is it not spam?

Is this image of a dog, a cat or a hamster?

Phase 3. Linear regression

It is known that crickets sing more frequently on the hottest days. For decades, professional and amateur entomologists have cataloged data on the amount of songs per minute and temperature. For your birthday, Aunt Ruth gave you her beloved cricket database and invites you to learn a model to predict that relationship.

Figure 1. Songs per minute against temperature

Indeed, the representation shows that the amount of songs increases with temperature. Is the relationship between the edges and the temperature linear? Yes, since it is possible to draw a straight line like the following to represent this relationship:

Figure 2. A linear relationship

Although the line does not pass perfectly through each point, it clearly demonstrates the relationship between the temperature and the edges per minute for these points. If we apply a little algebra, we can determine this relationship as follows:

Y=mx+b

where:

Y It is the temperature in degrees Celsius, corresponding to the value we are trying to predict.
m It is the slope of the line.
x is the amount of songs per minute, corresponding to the value of our input attribute.
b
is the intersection iny.

According to the conventions of machine learning, the equation for a model will be written in a slightly different way:

y′=b+w1x1

Where:

y′ It is the predicted label (a desired result).
b it is the ordinate to the origin (the intersection in y). In some machine learning literature, it is referred to as w0.
w1 is the weighting of attribute 1. The weighting is the same concept of the “slope” m, which was indicated above.
x1 It is an attribute (a known entry).

To infer (predict) the temperature y ′ for a new edge value per minute x1, just add the value of x1 to this model.

Subscripts (e.g., w1 and x1) indicate more sophisticated models that are based on various attributes. For example, a model that is based on three attributes would use the following equation:

y′=b+w1x1+w2x2+w3x3

First steps with TensorFlow: Toolkit

TensorFlow is a computational framework to build machine learning models. TensorFlow provides a variety of different toolkits that allow you to build models at your preferred level of abstraction. You can use lower level APIs to build models by defining a series of mathematical operations. Alternatively, you can use higher-level APIs (such as tf.estimator) to specify predefined architectures, such as linear regressors or neural networks.

The following figure shows the current hierarchy of TensorFlow toolkits:

In very general terms, here is the pseudocode for linear classification program implemented in tf.estimator:

import tensorflow as tf # Configure un clasificador lineal. 
clasificador = tf . estimador . LinearClassifier ( feature_columns ) # Entrena el modelo en algunos datos de ejemplo. clasificador . train ( input_fn = train_input_fn , steps = 2000 ) # Úselo para predecir. predicciones = clasificador . predecir ( input_fn = predict_input_fn )

Training and test sets: data striping

lthe idea of splitting the dataset into two subsets:

training set: a subset to train a model.
test set: a subset to test the trained model.

You could imagine dividing the unique data set as follows:

Figure 1. Cut a single data set into a training set and a test set.

Make sure your test set meets the following conditions:

It is large enough to produce statistically significant results.
It is representative of the dataset as a whole. In other words, do not choose a test set with different characteristics to the training set.

Assuming that your test set meets the above conditions, your goal is to create a model that generalizes well to new data. Our test suite serves as a proxy for new data. For example, consider the following figure. Note that the model learned for the training data is very simple. This model does a perfect job, some predictions are wrong. However, this model works so well on the test data and training data. In other words, this simple model does not overfit training data.

Figure 2. Validation trained against test data model.

Never train on test data. If you are seeing surprisingly good results in their evaluation metrics, it may be a sign that he is accidentally training in the test suite. For example, high precision could indicate that the test data have been leaked to the training set.

For example, consider a model that predicts if an email is spam, using the subject line, the email body and the sender’s email address as features. We distribute the data in training and test sets, with a division of 80–20. After training, the model achieves 99% accuracy in both the training set and the test set. We would expect less accuracy in the test set, so we take another look at the data and discover that many of the examples in the test set are duplicates of examples in the training set (the search for duplicate entries was not performed for the same spam). email from our inbound database before dividing the data). We have inadvertently trained ourselves in some of our test data and, as a result, we are no longer accurately measuring how well our model is generalized to new data.

Representation: Feature Engineering

In traditional programming, the focus is on the code. In machine learning projects, the focus shifts to representation. That is, a way for developers to refine a model is adding and improving features.

The left side of Figure 1 illustrates raw data from a source of input data; the right side illustrates a feature vector, which is the left side of Figure 1 illustrates raw data from a source of input data; the right side illustrates a feature vector, which is the set of floating point values comprising the examples in the dataset. Engineering characteristics means transform the raw data into a feature vector. Expect to spend a lot of time doing engineering features.

Many machine learning models should represent the actual characteristics as vectors numbered, as the values of the characteristics must be multiplied by the weights of the model.

Figure 1. Engineering characteristics assigned raw data features ML.

Assigning numerical values

Integers and floating point data does not require special coding that can be multiplied by a numerical weight. As suggested in Figure 2, the conversion of raw integer value 6 to 6.0 feature value is trivial:

Figure 2. Assignment of integer values to floating point values.

With this approach, this is how we can map our street names with numbers:

map of Charleston Road 0
mapa Shorebird Way a 2
mapa Shorebird Way a 2
mapa Rengstorff Avenue a 3
map everything else (OOV) 4

However, if we incorporate these index numbers directly into our model, impose some restrictions that might be problematic:

We learn a single weight that applies to every street. For example, if we learn a weight of 6 to STREET_NAME, so multiply by 0 to Charleston Road for 1 to North Shoreline Boulevard, 2 for Shorebird Way and so on. Consider a model that predicts housing prices using street_namecomo feature. Is unlikely to be a linear price adjustment based on the name of the street and, in addition, this would mean that ordered the streets based on the average price of housing. Our model needs the flexibility to learn different weights for each street to be added to the estimated price using the other features
We are not counting cases where street_namepueden take multiple values. For example, many homes are located on the corner of two streets, and there is no way to encode that information in the street_namevalor if it contains a single index.

To remove both restrictions, we can create a binary vector for each categorical feature in our model representing the values as follows:

For values that apply to the example, set the corresponding vector elements 1.
All other elements set to 0.

The length of this vector equals the number of elements in the vocabulary. This representation is called encoding one hot when a single value is 1 and coding of several hot when several values are 1.

Figure 3 illustrates a unique encoding of a particular street: Shorebird Way. The element in the binary vector for Shorebird Way has a value of 1, while all other elements have values 0 streets.

Figure 3. Assigning the street address through a unique coding.

This approach effectively creates a boolean variable value for each entity (eg street name). Here, if a house is on Shorebird Way, then the binary value is 1 only Shorebird Way. Therefore, the model uses only the weight Shorebird Way.

Similarly, if a house is on the corner of two streets, then two binary values are set to 1, and the model uses two respective weights.

The only encoding digital data extends not want to multiply directly by a weight, like a zip code.

sparse representation

Suppose you have 1,000,000 names of different streets in your data set to include as STREET_NAME values. explicitly create a binary vector elements 1,000,000 where only 1 or 2 elements are true representation is a very inefficient in terms of storage and computation time in processing these vectors. In this situation, a common approach is to use a sparse representation in which only nonzero values are stored. Dispersed representations, weighing independent model for each characteristic value is still learns, as described above.

Representation: Data Cleaning

Apple trees produce a mixture of large and fruit worms. However, apples in supermarkets high-end display 100% perfect fruit. Between the garden and the grocery store, someone spends a lot of time removing the bad apples or throwing a little wax on recoverable. ML engineer, you’ll spend a lot of time throwing bad examples and cleaning callable. Even a few “bad apples” can spoil a large data set.

Handling outlier

The following chart represents a feature called roomsPerPerson from the data set Housing California. RoomsPerPersonse value calculated by dividing the total number of rooms for an area by the population of that area. The plot shows that the vast majority of the areas in California have one or two rooms per person. But take a look along the x axis.

Figure 4. A long tail.

How can we minimize the influence of atypical these extremes? Well, one way would be to take the log of each value:

Figure 5. The logarithmic scale still leaves a tail.

Binning

The following graph shows the relative prevalence of houses in different latitudes in California. Note the grouping: Los Angeles is located at latitude 34 and San Francisco is approximately at latitude 38.

Figure 7. Houses for latitude.

latitudelatitude3534

To make the latitude is a useful predictor, divide latitudes in “containers” as suggested by the following figure:

Figure 8. Values grouping.

Instead of having a characteristic floating point, we now have 11 different Boolean functions (LatitudeBin1, LatitudeBin2, …, LatitudeBin11). Having 11 separate features is somewhat inelegant, so you unámoslas in a single vector of 11 elements. Doing so will allow us to represent latitude 37.4 as follows:

[ 0 , 0 , 0 , 0 , 0 , 1 , 0 , 0 , 0 , 0 , 0 ]

Thanks to binning, our model can now learn completely different for each latitude of weights.Depuration

So far, we have assumed that all data used for training and testing were reliable. In real life, many examples where data sets are unreliable due to one or more of the following:

Missing values. For example, a person forgot to enter a value for the age of a house.Ejemplos duplicados. Por ejemplo, un servidor cargó por error los mismos registros dos veces.

Tags bad For example, a person mistakenly labeled an image of an oak and a maple.
Evil characteristic values. For example, someone tapped an additional digit or a thermometer was left in the sun.

Once detected, usually “fixes” eliminating bad examples of the data set. To detect missing or duplicate values examples, you can write a simple program. Detecting incorrect values or tags characteristics can be much more difficult.

In addition to detecting incorrect individual examples, it must also detect incorrect data together. Histograms are a great mechanism to visualize your data together. Also, get statistics like the following can help:

Maximum and minimum
Mean and median
Standard deviation

Consider generate lists of the most common values for discrete characteristics. For example, make country: ukcoincidir the number of examples with the number expected. Should language: jpser really the most common language in your dataset?

Know Your Data

Follow these rules:

Consider how you think it should be your data.
Verify that the data meets these expectations (or which may explain why not).
Verify training data match other sources (such as panels).
Try your data with all the care with which you would any mission-critical code. Good ML is based on good data.

Classification: True Positive vs.Falso and vs.Negativo

In this section, we will define the main building blocks of the metrics we use to evaluate classification models. But first, a fable:

Aesop’s fable: The boy who cried wolf (compressed)
A bored shepherd tending the flock of people. For fun, shouting “Wolf!” Even though there is no wolf in sight. The villagers run to protect the flock, but then become very angry when they realize that the child was playing them a joke.
[Iterate N times the previous paragraph.]
One night, the shepherd sees a real wolf approaching the flock and shouts: “wolf” The villagers refuse to be duped again and stay in their homes. The hungry wolf turns the flock lamb chops. The people are hungry. Panic occurs.

Let the following definitions:

“Wolf” is a positive class.
“No wolf” is a negative class.

We can summarize our model “Wolf prediction” using a 2x2 matrix confusion representing the four possible outcomes:

A true positive is an outcome where the model correctly predicts the positive class. Similarly, a true negative is an outcome where the model correctly predicts the negative class.

A false positive is an outcome where the model incorrectly predicts the positive class. And a false negative is an outcome where the model incorrectly predicts the negative class.

In the following sections, we’ll look at how to evaluate classification models using metrics derived from these four outcomes.

Introduction to Neural Networks: Anatomy

Figure 1. Problem nonlinear classification.

Summary

Now, our model has all standard components of what is generally known as “neural networks”. A set of nodes, similar to neurons, organized in layers. A set of weights representing connections between each layer of the neural network and the lower layer. The bottom layer must be another layer neural network or other layer. A set of biases, one for each node. An activation function that converts the result of each node in a layer. The different layers may have different activation functions.

One thing to keep in mind is that neural networks are not always necessarily better than the combinations of attributes, but they are a flexible alternative that works well in most cases.

Training of neural networks:

Neural networks multiple classes: One against all

One against all provides a way to leverage the binary classification. In a given problem with N possible solutions classification, one solution is to address all N independent binary classifiers, ie, a binary classifier for each possible result. During training, the model is run through a sequence of binary classifiers, training each to answer a question separate classification. For example, in a given dog picture, it is possible to train five different recognizers four to see the image as a negative example (not a dog) and one as positive (a dog), as follows:

¿This image is an apple? Do not. Is this image is a bear? Do not. Is this image is a sweet? Do not. Is this image is a dog? Yes. Is this image is an egg? Do not.

This approach is relatively reasonable when the total number of classes is small, but becomes increasingly inefficient as the number of classes increases.

We can create a model one against all significantly more efficient with a deep neural network in which each node represents a different class result. In the figure, this approach is suggested:

Figure 1. A neural network of one against all.

Static vs. Dynamic Inference

You can choose either of the following inference strategies:

offline inference, meaning that you make all possible predictions in a batch, using a MapReduce or something similar. You then write the predictions to an SSTable or Bigtable, and then feed these to a cache/lookup table.
online inference, meaning that you predict on demand, using a server.

Summary

Here are the pros and cons of offline inference:

Pro: Don’t need to worry much about cost of inference.
Pro: Can likely use batch quota or some giant MapReduce.
Pro: Can do post-verification of predictions before pushing.
Con: Can only predict things we know about — bad for long tail.
Con: Update latency is likely measured in hours or days.

Here are the pros and cons of online inference:

Pro: Can make a prediction on any new item as it comes in — great for long tail.
Con: Compute intensive, latency sensitive — may limit model complexity.
Con: Monitoring needs are more intensive.

Fairness: Types of Bias

Machine learning models are not inherently objective. Engineers train models by feeding them a data set of training examples, and human involvement in the provision and curation of this data can make a model’s predictions susceptible to bias.

When building models, it’s important to be aware of common human biases that can manifest in your data, so you can take proactive steps to mitigate their effects.

WARNING: The following inventory of biases provides just a small selection of biases that are often uncovered in machine learning data sets; this list is not intended to be exhaustive. Wikipedia’s catalog of cognitive biases enumerates over 100 different types of human bias that can affect our judgment. When auditing your data, you should be on the lookout for any and all potential sources of bias that might skew your model’s predictions.

Reporting Bias

Reporting bias occurs when the frequency of events, properties, and/or outcomes captured in a data set does not accurately reflect their real-world frequency. This bias can arise because people tend to focus on documenting circumstances that are unusual or especially memorable, assuming that the ordinary can “go without saying.”

EXAMPLE: A sentiment-analysis model is trained to predict whether book reviews are positive or negative based on a corpus of user submissions to a popular website. The majority of reviews in the training data set reflect extreme opinions (reviewers who either loved or hated a book), because people were less likely to submit a review of a book if they did not respond to it strongly. As a result, the model is less able to correctly predict sentiment of reviews that use more subtle language to describe a book.

Automation Bias

Automation bias is a tendency to favor results generated by automated systems over those generated by non-automated systems, irrespective of the error rates of each.

EXAMPLE: Software engineers working for a sprocket manufacturer were eager to deploy the new “groundbreaking” model they trained to identify tooth defects, until the factory supervisor pointed out that the model’s precision and recall rates were both 15% lower than those of human inspectors.

Selection Bias

Selection bias occurs if a data set’s examples are chosen in a way that is not reflective of their real-world distribution. Selection bias can take many different forms:

Coverage bias: Data is not selected in a representative fashion.

EXAMPLE: A model is trained to predict future sales of a new product based on phone surveys conducted with a sample of consumers who bought the product. Consumers who instead opted to buy a competing product were not surveyed, and as a result, this group of people was not represented in the training data.

Non-response bias (or participation bias): Data ends up being unrepresentative due to participation gaps in the data-collection process.

EXAMPLE: A model is trained to predict future sales of a new product based on phone surveys conducted with a sample of consumers who bought the product and with a sample of consumers who bought a competing product. Consumers who bought the competing product were 80% more likely to refuse to complete the survey, and their data was underrepresented in the sample.

Sampling bias: Proper randomization is not used during data collection.

EXAMPLE: A model is trained to predict future sales of a new product based on phone surveys conducted with a sample of consumers who bought the product and with a sample of consumers who bought a competing product. Instead of randomly targeting consumers, the surveyer chose the first 200 consumers that responded to an email, who might have been more enthusiastic about the product than average purchasers.

Group Attribution Bias

Group attribution bias is a tendency to generalize what is true of individuals to an entire group to which they belong. Two key manifestations of this bias are:

In-group bias: A preference for members of a group to which you also belong, or for characteristics that you also share.

EXAMPLE: Two engineers training a résumé-screening model for software developers are predisposed to believe that applicants who attended the same computer-science academy as they both did are more qualified for the role.

Out-group homogeneity bias: A tendency to stereotype individual members of a group to which you do not belong, or to see their characteristics as more uniform.

EXAMPLE: Two engineers training a résumé-screening model for software developers are predisposed to believe that all applicants who did not attend a computer-science academy do not have sufficient expertise for the role.

Implicit Bias

Implicit bias occurs when assumptions are made based on one’s own mental models and personal experiences that do not necessarily apply more generally.

EXAMPLE: An engineer training a gesture-recognition model uses a head shake as a feature to indicate a person is communicating the word “no.” However, in some regions of the world, a head shake actually signifies “yes.”

A common form of implicit bias is confirmation bias, where model builders unconsciously process data in ways that affirm preexisting beliefs and hypotheses. In some cases, a model builder may actually keep training a model until it produces a result that aligns with their original hypothesis; this is called experimenter’s bias.

EXAMPLE: An engineer is building a model that predicts aggressiveness in dogs based on a variety of features (height, weight, breed, environment). The engineer had an unpleasant encounter with a hyperactive toy poodle as a child, and ever since has associated the breed with aggression. When the trained model predicted most toy poodles to be relatively docile, the engineer retrained the model several more times until it produced a result showing smaller poodles to be more violent.

Data Skew

Any sort of skew in your data, where certain groups or characteristics may be under- or over-represented relative to their real-world prevalence, can introduce bias into your model.

If you completed the Validation programming exercise, you may recall discovering how a failure to randomize the California housing data set prior to splitting it into training and validation sets resulted in a pronounced data skew. Figure 1 visualizes a subset of data drawn from the full data set that exclusively represents the northwest region of California.

Figure 1. California state map overlaid with data from the California Housing data set. Each dot represents a housing block, with colors ranging from blue to red corresponding to median house price ranging from low to high, respectively.

If this unrepresentative sample were used to train a model to predict California housing prices statewide, the lack of housing data from southern portions of California would be problematic. The geographical bias encoded in the model might adversely affect homebuyers in unrepresented communities.