How Does a Prediction Know It Was Wrong? What Is a Loss Function
Tracing Complex Calculations Backwards: Backpropagation
Stacking Deeply Without Information Loss: Skip Connections

In Chapter 1, we explained that machine learning and deep learning are technologies that learn on their own by looking at data. Here, a curious question arises. How can a computer be said to 'learn'? It doesn't store memories in a brain like a person, and no teacher is teaching it one by one. Yet why is it said that computers become increasingly smarter?

This chapter aims to answer precisely this question. The goal is to understand how deep learning models become a little smarter through learning. At this time, the concept that absolutely must be understood is 'optimization.' Simply put, optimization means the method of finding the best answer when solving a problem.

Let's look at an example. Imagine you took a math exam. When you first solved the problems, there might be many you got wrong. Then you would gather the wrong problems and make an error notebook, right? Looking at the error notebook and checking one by one where you made mistakes, you study again. Doing this, wrong problems decrease in the next exam and you can get a better score. Deep learning models are similar. At first they often give wrong answers. But every time they're wrong they 'find why they were wrong' themselves and keep correcting so they don't make the same mistake next time — that's how they learn.

This way, deep learning models gradually find more accurate answers through the optimization process. Eventually they can make very smart and accurate predictions. Now let's slowly explore together how deep learning models do optimization and what methods and principles they use.

How Does a Prediction Know It Was Wrong? What Is a Loss Function

For a computer to fix what it got wrong, it first needs to know how wrong it was. After we solve math problems, we always grade them, right? Then we can compare the correct answer with our written answer to know where we were wrong. Deep learning models also need a similar method to check how accurate their predictions are. The important concept used at this time is the 'Loss Function.'

Simply put, a loss function is a tool that tells the difference between the value the model predicted and the actual correct answer as a number. If this number is large, the model was wrong, and the smaller it is, the more accurate the prediction. Therefore, the model's goal is to make this loss value as small as possible.

There are various loss functions, but the most commonly used is 'Cross-Entropy.' This term may feel a bit difficult, but its meaning is very simple. The closer the model predicts toward the probability that it will get the correct answer, the smaller the loss becomes, and conversely, the more confidently it gives the wrong answer, the larger the loss becomes.

Let's look at a slightly easier example. Imagine a model is doing the task of distinguishing dogs and cats. If the model looks at a photo and predicts "90% probability this photo is a dog, 10% probability it's a cat," but it's actually a cat, what happens? At this time the model's prediction is completely wrong. Therefore the loss value appears large. On the other hand, if in the same situation the model predicted "49% probability dog, 51% probability cat" — though confused, a prediction closer to the correct answer — the loss value is relatively small.

This way, through the loss value computers can specifically confirm how wrong they were. But if the fact of being wrong has been confirmed, the method of where and how to fix it must also be known, right? The method solving this problem is 'Gradient Descent' to be explained next.

Going Down a Little at a Time Until Eventually Reaching the Lowest Point

'Gradient Descent' may feel difficult at first but can be easily understood with a simple example. Imagine climbing a mountain and the path isn't visible well due to fog. Even if we can't see the whole mountain, we can know which direction from where we are now goes downward. So we find the downward direction at each moment and walk down a little. Continuing down one step at a time like this, we eventually arrive at the lowest point on the mountain, right? Gradient descent is precisely this method of finding the lowest point.

Computers also solve problems in a similar way. The mountain here is made from something called the 'Loss Function.' The lowest point on this mountain means the place where the model's predictions are closest to the actual correct answers. Then what is the 'correct answer' here? It is the 'label' predetermined by humans.

Labels are data that humans have directly marked as correct answers. For example, when teaching a computer to distinguish dog and cat photos, humans pre-mark each photo as 'this is a dog' or 'this is a cat.' This marking is the 'label.' Computers compare their predictions with these labels to calculate how wrong they were, and based on this gradually make their prediction methods more accurate.

In this process, 'Gradient' plays an important role. Gradient is like a hint telling from the current position which direction to go to lower the loss value. Following this hint, computers gradually change various values (parameters) the model possesses to find more accurate answers.

One change of values like this is called 'one step.' But it's difficult to get perfect results with just a few steps. So computers repeat steps thousands or tens of thousands of times, getting a little closer to the correct answer each time. Just like a person's skills improve by repeating exercise every day, computers also learn and improve through repeating this process.

Tracing Complex Calculations Backwards: Backpropagation

Deep learning models actually have an enormous number of parameters ranging from hundreds of millions to as many as billions. Parameters are small 'standards' the model uses when making decisions after looking at input data. Just as a person uses various standards like height, face shape, and voice to distinguish friends, models also use parameters to judge data. But with so many parameters, checking which parameters work well and which parameters created wrong predictions one by one is a very difficult task.

But there is a good method for solving such complex problems. That is the method called 'Backpropagation.' Simply put, backpropagation is a method of starting from results and tracing calculations in reverse. For example, if we got a math problem wrong at school, there are cases where we look back at the calculation one by one from the beginning to find where it went wrong. Another example is similar to when food tastes strange after cooking, going back and checking ingredients from the beginning to find where a mistake was made. Similarly, backpropagation also traces backward following the calculations starting from the computer's predicted result, checking one by one how much each parameter influenced the error (loss).

Mathematically this process is calculated using a calculus concept called the 'Chain Rule,' but it's okay not to memorize or understand all the complex formulas. What's important is that through backpropagation the model can accurately find out which parts need to be fixed by how much.

This way deep learning models learn by repeating four stages. First the model looks at data and makes a prediction. Second, it calculates the loss by comparing that prediction with the correct answer (label) predetermined by humans. Third, it finds the reason for being wrong using backpropagation. Finally fourth, it modifies parameters based on that information. Going through all four of these processes once is called an 'epoch.' One epoch means all given data has been learned once, and in actual learning the model is made more accurate by repeating usually dozens to hundreds of times.

Various Methods of Gradient Descent

Basic gradient descent is a method that uses all data at once to calculate loss values and modify parameters. Simply, it's similar to checking all students' exam papers at once and improving the teaching method based on those results. But if there are too many students, checking them all at once takes a long time and is difficult. Computers similarly take a lot of time processing all data at once and must use many resources.

So in actual deep learning, the 'mini-batch' method of selecting only some of the data rather than all of it is widely used. Mini-batch is like checking only a few students' grades first, then changing the teaching method a little based on those results rather than all students. This way problems can be identified and improved quickly, making it more efficient.

Besides this, various methods have been developed to make gradient descent more effective. One of these is the 'momentum' method. Momentum helps reach goals faster by remembering the direction and speed of previous movement and utilizing them.

Let's understand this method with a simple example. Think of a scene pushing a cart down from a hill. At first it moves slowly, but gradually gains speed and goes down quickly. Similarly, momentum also gives force to continue in the direction the model was going, helping to arrive at the destination quickly by remembering the direction and speed it was going.

Another widely used method is one called 'Adam.' Adam automatically adjusts the 'learning rate' needed when a model is learning. The learning rate is a value determining how much the model will be modified at once — if this value is too large or too small, learning doesn't proceed well. Adam automatically adjusts this value appropriately so it's convenient to use and very efficient. That's why many deep learning models currently use this method.

All these various methods commonly help computers learn faster and more accurately. Thanks to this, even complex problems can be solved efficiently.

Problems That Arise in Deep Neural Networks

Stacking neural networks deeper seems like it would solve more complex problems. In fact, to some degree that's true. The more layers, the more various and detailed characteristics can be captured in the process of analyzing data. For example, complex photos can more accurately distinguish people's faces or objects, or longer sentences can be more accurately understood in meaning.

But making neural networks too deep causes unexpected new problems to appear. The most representative of these is the 'Vanishing Gradient Problem.' As explained earlier, gradient is an important hint telling which direction to modify and by how much when computers learn.

But as neural network layers increase, these gradient values are multiplied many times repeatedly in the backpropagation process — these values become too small and eventually approach nearly zero. For example, similar to when using a calculator to keep multiplying small numbers (0.1), values get increasingly smaller and eventually become a number close to zero. When this happens, gradients are almost not transmitted to front layers and learning no longer proceeds properly.

Conversely, a problem where gradients become too large can also occur. This problem is called 'Exploding Gradient.' For example, similar to when using a calculator to keep multiplying large numbers (e.g. 10), values increase rapidly. In this case, parameter values suddenly jump greatly making normal learning difficult.

Besides these two problems, another difficulty can arise due to how computers store numbers. Computers store numbers using a method called 'floating-point numbers,' and this method can only express numbers in defined precision. For example, if we input a very small number like 0.0000000001 in a calculator, the calculator might display it as 0 on screen. This is because calculators and computers have limited decimal places for storing very small numbers. Similarly computers also cannot accurately store very small numbers and simplify them as numbers close to zero. When numbers aren't properly stored like this, it becomes difficult to obtain accurate gradient values in the learning process, and ultimately problems arise where models cannot learn exactly as originally intended. Therefore when making neural networks deep, it is important to use special techniques or strategies together to overcome such problems.

Stacking Deeply Without Information Loss: Skip Connections

A representative method created to solve these problems is 'skip connections.' Simply put, skip connections are creating a shortcut in the middle of the path through multiple layers. For example, it's similar to going through the front classroom door as a shortcut to a classroom rather than going all the way to the end of the corridor at school. This way information loss that could occur while going a long way can be prevented and the destination reached faster.

In neural networks, as layers increase, information gradually fades in the process of being transmitted to the back. In particular, the problem occurs where gradient (Gradient) needed for learning is not properly transmitted to front layers. When skip connections exist in such situations, gradients can be quickly transmitted forward directly rather than going a long way around.

For example, in a situation where gradients are transmitted in order Layer A → Layer B → Layer C, when skip connections exist a path connecting directly from Layer A to Layer C is created. This way gradients can be effectively transmitted without being lost.

A representative model utilizing skip connections is 'ResNet (Residual Network).' This model was made to work without problems even with depth of 50, 100, even over 150 layers. The core reason ResNet could succeed was that stable learning was enabled by inserting such skip connections in every layer. Neural networks don't have to learn new things at every layer — when needed, input information can also be passed directly to the next layer, making learning very stable and efficient.

The Importance of 'Shortcuts' Needed When Understanding Language

The idea of skip connections is widely used not only in the computer vision field analyzing photos but also in the natural language processing (NLP) field dealing with language. A representative example is a model called the 'Transformer.' Transformers demonstrate outstanding performance in understanding the meaning of sentences or predicting the next word.

Skip connections also play a very important role in transformers. Transformers consist of multiple layers processing information, and after calculating new information in each layer, the original information is added back again. Simply put, new information and original information are combined to help the meaning of sentences be maintained accurately.

In particular, sentences have various lengths and meaning can change depending on context, making them very complex. Thanks to skip connections, transformers pass through deep layers while maintaining sentence context well and prevent important information from being transformed or lost.

Skip connections have far greater significance than simply adding one connection jumping layers. They help information be transmitted well without being lost, and enable learning to proceed stably and efficiently. Almost all deep learning models that appeared after ResNet utilize the concept of skip connections in various ways, and models to come will very likely continue using this jumping connection method.

What Does a Deep Learning Model Actually Learn Like?

The actual learning process of deep learning models consists of several important stages. First, the structure of the model must be determined. Just like drawing a blueprint before constructing a building, models also pre-design which layers to stack how many of, and what calculations to do at each layer.

Next, a loss function that can tell numerically how wrong the model was when making wrong predictions is determined. Simply put, it plays a role like a scoring sheet that marks wrong problems in an exam as a score. When a loss function is determined, initial standards (parameters) of the model are arbitrarily set. These standards can be thought of as small rules the model uses when judging data.

Now actual data is input into the model. For example, showing a dog photo while also informing the correct answer (label) 'this is a dog.' The model makes a prediction based on input data and checks through the loss function how different that prediction is from the correct answer. If the model wrongly predicted 'cat,' the loss value becomes large.

The next stage is Backpropagation. Backpropagation is the process of tracing back following this wrong prediction to find out which standards of the model influenced the wrong prediction. For example, it's similar to re-solving wrong exam questions while checking one by one where mistakes were made.

When standards that were wrong have been found through backpropagation, these standards must now be modified. The method used at this time is gradient descent. This method is similar to practicing a wrong problem so as never to make the same mistake again. This way models modify standard values to make slightly more accurate predictions.

This process doesn't happen just once. It's repeated dozens to hundreds of times until all data has been learned. It's the same principle as a person's skills improving as they persistently repeat some exercise or study. Models also keep repeating and learning to make more accurate and precise predictions.

However learning doesn't always proceed smoothly. Sometimes models become too familiar with data used for practice and make wrong predictions when encountering new data. This is called 'Overfitting.' It's similar to a situation at school of perfectly memorizing specific problems then failing when slightly different problems come out.

To prevent overfitting, methods are used of intentionally slightly transforming data (adding noise) or 'Dropout' which is a method of turning off some of the model's functions in the middle of learning. Just as easily getting tired and injured when repeating only the same movement while exercising, dropout helps models not depend too much on specific standards and use various standards evenly. Doing this can make models that also work accurately when new data comes in.

What Does the World That Artificial Intelligence Learns Look Like

Finally let's look at an interesting analogy that helps understand the process of deep learning models learning. The process of models gradually reducing loss values and learning is similar to exploring a complex maze. This maze is called the 'Loss Landscape.'

The loss landscape is not just a simple maze but is far more complex than what we normally think of. Here there are various terrain forms like high hills, deep valleys, and wide flat plateaus. There are even middle points (called 'saddle points') that are ambiguous to go up or down, making it even more complex.

Models explore this complex maze trying to find the lowest point — the place where loss values are smallest. The smaller the loss value, the more accurate the model's predictions. But fortunately, it's not necessary to find the one perfect lowest point in this maze. According to recent research, there are many good points with adequately low loss values in this loss landscape, so sufficiently good models can be made without fixating on just one place.

This is similar to how even without getting a perfect score on a test, getting consistently high scores is sufficient to achieve good results. Rather, trying too hard to get a perfect score just leads to study tuned only for that exam, similar to the situation where real ability doesn't improve. Deep learning models must also be careful as pursuing excessively low loss values too forcefully can cause the problem of 'overfitting' where they match specific data well but give wrong results for new data.

Ultimately models explore this complex terrain finding the best path, gradually learning more accurately and precisely.

Summarizing Deep Learning's Learning Process

In Chapter 2, we looked in detail at how deep learning actually learns from data. Deep learning models use loss functions to confirm numerically how wrong their predictions were. And to fix these wrong points, they use gradient descent and backpropagation to gradually adjust standards (parameters). Through repeating this process, models gradually become capable of more accurate predictions.

We also learned about problems that can arise as neural networks go deeper. There were the vanishing gradient problem where gradients become too small making learning difficult, and the exploding gradient problem where conversely gradients become too large. We also learned about the structure called skip connections that create shortcuts jumping layers to solve these problems.

In the next chapter, we will look at the building blocks that make up the big building called deep learning. Starting from the simplest basic building block of linear regression, through activation functions that help draw complex patterns, methods of stacking multiple layers to build taller buildings, and the structure of neural networks that have evolved to handle various materials like images and language — we'll look at them slowly in order.