Machine Learning is about Change of Space rather than Higher Dimension
Updated: Oct 16, 2021
This post is intended to provide good intuition toward mathematical and stochastic processes underneath machine learning systems. That intuition is believed here critical to, firstly, be able to reach production level, secondly, implement a solution which is robust towards real-world data – for the practitioner, we are not focusing on specific arguments like “regularization”, we rather want to provide interesting reading to any curious person. The two functions shown in the main picture above characterize many modern machine learning models, however, they do it in very different ways depending on the specific structure. Discussing the linear or non-linear characteristics of those two functions through practical examples is an interesting way to approach our discussion. We will target a final project which inspired this post, a project on Natural Language Processing (NLP). We will also develop the argument in terms of image recognition applications. Our first example will be a discussion that the interested reader can find in common books, called the “XOR” example, which we will then use for comparison with the other two applications.
We can immediately say that, for what concerns machine learning and effectively solve problems in that domain, saying that one, all, or none of the functions above is linear or non-linear does not provide much value. Someone could say that the first function is non-linear and the second or both are piecewise linear (function b a bit more so, while function a only approximately), however, it could be more useful to say that they are both, at the same time, linear and non-linear depending on how they are deployed within the model - note, we could include in the discussion also the sigmoid function, similar to function a but being bound between 0 and 1 rather than between -1 and 1. Behind these statements, there is the intuition this post wants to develop below.
“I see well in many dimensions as long as the dimensions are 2” - Martin Shubik
The quoted phrase above could summarize the view many have about machine learning – the same or similar phrases are indeed used in books on AI (artificial intelligence), NLP (natural language processing), ML (machine learning), and more. While the author of the quote was probably not referring specifically to machine learning, that does a good job in summarizing the human limitation that the technology under discussion allows to overcome. However, when it comes to more complex machine learning models like deep neural networks, that phrase does not tell the whole story – the “deep” characteristic is not even that important, as we will see in a bit.
Trying to summarize in mathematical terms the quoted phrase by Martin Shubik, we could say that humans can intuitively see relationships among variables as long as they are limited to linear combinations in 2 dimensions (we added the “linear” characteristic, which was not part of the concept of the quote). That linear combination is something like aX + bY = 0, which is the equation of a straight line in a 2D chart, and can be written as Y = cX + d. When associated with neural networks though, that quote is missing a key concept. Neural networks are not only tools able to account for many more variables (i.e. dimensions) compared to humans and “common” systems; they are also able to change the space created by those variables. While that fact itself is not a prerequisite to solving the problem, it is the way neural networks operate in general, therefore, if we want to design effective systems, we need the right intuition on that. Let us explain this by going through the simple math of a fully connected neural network with only 1 hidden layer (red layer in figure 1).
Figure 1: schematic NN, Xi = inputs, Yi = outputs, Zi = intermediate features build by the model (picture from the book Elements of Statistical Learning)
The math and the intermediate features that the model of figure 1 computes once it has received inputs Xi, and while trying to output Yi, is summarized by the following:
Figure 2: picture from the book Elements of Statistical Learning
Each intermediate feature Zm (say “neuron”) of the intermediate layer of the network is a linear combination of the inputs to that layer – figure 2. The result is then multiplied by the function before the parenthesis, which acts as a sort of on/off switch for that linear combination toward the computation developed at the next layer (possibly the output one) – there may be also a proportionality or logarithmic factor depending on the specific function chosen. That sigma is often the ReLU function which is function b of the initial picture, while function a can be used too and a slightly different function, the sigmoid one, is often used in the output layer for classifications problems - i.e. what is an image about, what kind of text we are reading, etc. Going back to our initial question, saying whether function (a) and (b) are linear or not, does not mean much considering our objective, what matters is the non-linear characteristic that the overall structure allows; let us see why that is important - please note, the linearities of the specific function still affect the performance of the overall model, however, we are trying here to solve 80% of the problem and discuss the major concepts.
Let us now consider more simple models, logistic regression, which we can think of as models for classification applications and we can associate them to the idea of neural networks (figure 1) without the intermediate layer (red layer in figure 1). Logistic regression is also similar to linear regression, with the only addition of the non-linear sigmoid function (similar to function a) at the output to give back results within 0 and 1, and allowing for classifications rather than regressions - example: think about the result between 0 and 1 as the probability of an image to be about a dog. Starting from the inputs of the problem, logistic regression would immediately calculate the following:
Figure 3: in logistic regression, there are no intermediate layers and the output Y is the result of a straightforward linear combination of the input X
That lack of intermediate layers prevents simple logistic regression from leveraging a critical ability of a more complex deep neural network, which is the ability to “change the variable-space and linearly separate it”. That is why, even though logistic regression presents a non-linear output function (the sigmoid function), they can be considered linear models, because they only consider linear combinations of the input variables.
We will show what all the differences between linear (i.e. logistic regression) and non-linear models (i.e. neural networks) is about, by going through a common example from the book Deep Learning by Ian Goodfellow, then, as anticipated in the beginning, we will stretch the concept through a customized example based on image recognition to then arrive at the final application on Natural Language Processing. We thought it was worth summing up below the “XOR” example to provide context to the reader who has not gone through the mentioned text; the experienced reader can skip through “end of XOR”.
The starting example is the “XOR” application: can we build a linear mathematical model (linear meaning like a simple Logistic Regression, without the ability to build Zm at the intermediate layer) able to tell us whether a system is in the “AND”, in the “OR”, in the “XOR” state? What that means is the following: picture a system trying to spot real-estate opportunities on the basis of whether a house has the elevator and the pool (“AND”), whether it has either one (“OR”), or whether it has either one but not both at the same time (“XOR”) – situations depicted in Figure 4. This is equivalent to asking whether we can linearly separate a 2D space with “elevator” and “pool” on the two axes. It turns out, the solution exists for the “AND” and “OR” conditions (figure 4 a/b), but we get into trouble in the “XOR” case (figure 4 c).
Intuitively, we can tell from the picture that the solution to the third “XOR” problem is non-linear: while we can linearly find the solution to the first two cases, we cannot separate the blue and white dots with a straight line in the third one. While we could solve the third case by drawing an oval around the two blue or white dots, which is simple in principle for a human, our machine prefers to approach the problem differently. A neural network would rather “separate the variables” (or inputs), which is equivalent to changing the variable space, in such a way that the new and modified space could be linearly separated. The same Deep Learning book mentioned above explains that concept by showing what the following simple network would output (network with an intermediate hidden layer in the derived variables h):
Figure 5: simple “deep” NN
The derived features h that the model builds, allow for a change in the variable-space shown in figure 6, which is possible to separate linearly.
Figure 6: The reader with a bit of patience to go through a couple of linear combinations in h1, h2, and y1 will see the new space works “linearly” according to figure 6/b.
End of “XOR”
Let us try now to stretch the concept by applying the discussion to a customized example on image recognition. Let us input an image to a neural network and see what each layer does – please note, a real application on image recognition would likely involve a convolutional NN; at the end of the post, there will be a technical parenthesis for the interested reader on the convolution operation applied to a image recognition. Pictures of the intermediate results of an image recognition operation like the one shown in figure 7 (example with three hidden-layer) are pretty common on the web; let us see what those steps mean in terms of our mathematical and space analogy discussed above.
Figure 7: Simple image detection (across three layers of a neural network)
The important thing to note in figure 7 is that the very first feature that the neural network catches is fairly linear (Low-Level Features), meaning that the network is initially separating the input space (picture) through lines (similarly to Figure 4 a/b); that operation is often described by saying that the NN firstly spots hedges. We can think about that step as the same ability that a simple logistic regression model would present. Then as the computation proceeds across the network, the derived features obtained out of the hidden layers, increase their non-linear character. Our neural network is combining those initially linear features in subsequent non-linear and derived features which increase their non-linear characteristic as they are passed across layers. It is interesting to think about the concept that the model is not really able to draw non-linear lines, it is composing them with linear pieces, and it is doing that one layer at a time - for the practitioner, not literally as maybe a decision tree would do. To triangulate the concepts we could say that, if the results from what in figure 7 is called Mid-Level Features and High-Level Features were shown in the new space built by the model rather than in the 2D dimensions of the original image, we would still see lines as per figure 6/b.
Why is this discussion important and what we can conclude from it? Basic answer: understanding these idea is important to make more educated starting guesses on whether, for example, a simple application could be solved with easier to handle linear or logistic regression. Moreover, this would grant us the ability to better evaluate intermediate solutions like Decision Trees, which are models halfway in between linear and non-linear models – a lot depends also on the specific implementation. However, there is much more to that, and we can briefly hint to that idea through the project focused on Natural Language Processing (NLP) which inspired this post.
Natural language processing is a complex and fascinating field, and a lot of recent developments in the field could be associated with what is called “embeddings” – not the only alternative. Text is usually preprocessed before feeding a neural network with it, that preprocessing often involves embeddings. During embeddings, the text is somehow contextualized: usually, in text, words do not occur randomly, they are often grouped; we can think that text containing the word “king” likely contains also the words “queen” and “kingdom”, while the same two words are unlikely to be found in a text presenting the word “Apple Inc” - which could be defined a “2-gram” rather than a “word”, and which represents high complexity in text processing since “Apple” is completely different from “apple” and common lower-casing pre-processing would miss that. That context and co-occurrence probability can be mathematically represented through a vector within an n-dimensional space, where n is subjective to the designer. While a higher n-value could allow for better results, those n dimensions, though not exactly probabilities, can be thought of as the chance of co-occurrence of words among them. Higher n would represent more complex contextualization leveraging also minor co-occurrences compared to lower n where only major co-occurrences would play a role. Say we want to represent all the words of a given text in a 2-dimensional space (i.e. n=2); each word would have a 2-D vector associated: say word “king” is vector 234i+547t, where i and t are unit-vectors representing the two main directions. Once trained, such embedding would represent our text through the 2D representation shown in figure 8.
Figure 8: 2D embedding example from the book Speech and Language Processing by Daniel Jurafsky & James H. Martin
It is important to note from figure 8 that even though we are leveraging a 2D representation, it does not mean that all the words in our text will be divided into 2 groups, as it is possible to see from the same figure presenting three groups. Those with some familiarity with vectors may have already perceived the possibility to compute mathematical calculations like “king” – “man”+ “woman” = “queen”; we can think about that calculation in the following way:
Similar words with high co-occurrence tend to be grouped together. In typical embeddings where n is in the order of 100, words like “king”, “queen”, “man”, and “woman” are likely to be found close to each other along some t dimensions of the total n, while be distanced along the remaining ones. All that complexity would be a description of the context they are usually found in. Therefore, subtracting “man” from “king” and adding “woman” would keep the dimensions of “king” representing something that in the text might be the subject of phrases containing “ … ruled upon …” , but it would subtract the dimension of phrases containing “ … he …” – note, in real applications pronoun are not that common to be used and they are often deleted during preprocessing. Once we add “woman”, what would be left are words still associated with “… ruled upon …” but which now are also associated with “… she …”. The resulting subject of those phrases is likely to be “queen”.
Embedding is just the starting point. Usually, we would then try to classify a given text through sentiment analysis to tell whether a specific email is spam or not, or we might try to edit some text, or classify another text as poem or news-article … To do that, we would need a following model, which could be logistic regression, or a neural network, or something else (decision trees, etc.). On the base of the model we chose, we would [or not] have the space-alteration we briefly discussed above. Therefore, our embedded space (e.g. figure 8) would [or not] go through a transformation of space like we saw in Figures 6 and 7. While trying models is not necessarily wrong – at least for personal experience, trial and error is a big part of practicing with such models given their complexity – the type of trials we structure could be effectively narrowed down if we applied correct intuition of the underlying mathematics and statistics. Sticking to the NLP example we just saw, thinking about spaces (meaning its mathematics and statistics) could allow us, for example, to understand whether we are training our model with the right text given our following application space (i.e. probability distribution of the text we will apply our predictive model on). It could also allow us to understand what a neural network would do to our input or embedded space compared to a simple logistic regression model.
Thanks to coding libraries (mathematical models for specific purposes already packed and deployable with just a few lines of code), it is becoming increasingly easy to play with complex models. However, while randomly trying stuff is not necessarily wrong, if we want to get something out of it, we should strive to understand the major mathematical dynamics.
Now, for the interested reader, a final technical parenthesis on convolution and image recognition.
Please, feel free to connect with suggestion, comments, questions ... riccardo[at]m-odi.com
It follows now an intuitive technical parenthesis on Convolutional Neural Networks:
Convolutional NN are often encountered in image recognition. While they may seem higher-level NN, they have in theory lower capacity than common dense (i.e. fully connected) NN. For the interested reader, the reason why they are deployed so widely is that their limitation allows for a much lighter computation in cases where common features are likely to characterize the input signal (like in image recognition) - explained below. Briefly, convolutional NN make use of a common operation in engineering, which is convolution. That is often used for example in signal filtering – say a mixing operator who wants to eliminate components of specific frequencies from an input sound. S/he would basically multiply (specifically “convolute”, which is slightly different) the function representing a filter to an input signal, and s/he would get a filtered output. The key feature is that the same filter is applied to the entire signal. While in signal processing, the filter is designed by us and applied to the signal to have an intended output, in image recognition the filter (which we can think of as specific for each layer of the network) is unknown and it is what the network must come-up with during training by optimizing its parameters (i.e. weights). For example, say a picture could be roughly approximated by first identifying its vertical edges in the first layer of a convolutional NN, the model would likely choose to optimize for a filter identifying all vertical edges in that first layer. Therefore, convolutional NN are more limited, with that limitation consisting in the fact that they must find common features to be “shared” across individual layers (say “step of the convolution”).
The mathematical characterization of convolutional NN, once understood, and once understood how it alters the underneath space, could be then applied to other domains different from image recognition which cannot be tackled through common dense NN because of their high [and sparse] feature space.