Mathematical Background of Artificial Intelligence: Information Theory and Gradient Descent
From Perceptron to Transformer: Evolution of Structure
Change from Data-Centered to Feedback-Centered

Until now we have examined three major ways artificial intelligence learns and develops. The first is supervised learning that learns by telling correct answers in advance, the second is unsupervised learning that finds rules itself within data without correct answers, and the third is reinforcement learning that finds the best method through trial and error.

For these three methods to work well in practice, there is a commonly essential process. A process where AI models first predict answers when solving problems, then evaluate how different those predictions are from actual correct answers, and then slightly modify internal calculation methods to obtain more accurate answers. This process is carried out through three core concepts examined earlier: the loss function, gradient descent, and backpropagation.

Let's organize once more how these concepts are connected.

First, the loss function plays the role of expressing as a number the difference between AI's prediction and the actual correct answer. Simply, it's like scoring how different from the correct answer. The higher the score, the further from the correct answer.

Next, gradient descent presents the direction of how to adjust the numbers inside AI to reduce this loss score. By the principle of finding the steepest path down to descend most quickly from a hill, it helps AI move step by step in the right direction to obtain better results.

Finally, backpropagation is the process of tracing backwards to find which connections or calculations in AI went wrong when it made a wrong prediction. It's similar to checking in sequence going backward from the goalkeeper through defenders and midfielders to find where the problem started when a goal was conceded in a soccer game.

Here, let's look in a little more detail at cross-entropy, an important concept related to the loss function. Cross-entropy is a concept that originated from a field called 'information theory.' Information theory is simply the academic discipline researching 'methods of efficiently and accurately conveying information.' When arranging an appointment place and time with a friend via text message, using simple and clear expressions can reduce misunderstanding. Information theory is the academic field researching how to most effectively convey information in this way.

This concept from information theory is also utilized in deep learning. For example, cross-entropy is what expresses as a number the difference when AI looked at a photo and predicted 'cat' but the actual photo was a dog. The larger the cross-entropy value, the more wrong the prediction, and the smaller it is, the more accurate the prediction means. Therefore AI proceeds with learning in the direction of minimizing the cross-entropy value.

Mathematical Background of Artificial Intelligence: Information Theory and Gradient Descent

Supervised learning that learns by telling correct answers in advance, unsupervised learning that discovers rules itself within data without correct answers, and reinforcement learning that finds the optimal method through trial and error.

For these three methods to work well, there is a commonly essential core process. AI models first predict answers when solving problems, evaluate how different those predictions are from actual correct answers, and then slightly modify internal calculation methods to obtain more accurate answers. There are three important concepts that must be understood to understand this process: the loss function, gradient descent, and backpropagation.

Let's briefly review how these concepts connect.

First, the loss function is the concept of expressing as a number the difference between AI's prediction and actual correct answer. Simply, it's like scoring how different from the correct answer after solving a problem. The higher the score, the further from the correct answer it means.

Next is gradient descent. This method tells how to change the numbers inside AI to reduce the score obtained from the loss function. By the principle of finding the steepest path to descend most quickly from a high hill, it also helps AI find the optimal direction to move step by step to reduce errors.

Finally is backpropagation. Backpropagation is the method of tracing backwards to find which connections or calculations in AI were wrong when it made wrong predictions. It's similar to tracking backward from goalkeeper through defenders and midfielder to find where the problem started when a goal was conceded in a soccer game.

Among these, the important concept related to the loss function and frequently used is cross-entropy. Cross-entropy is a concept originating from a mathematical field called information theory.

Information theory is an academic discipline researching 'methods of accurately and efficiently conveying information.' Think about arranging an appointment place and time with a friend via text message. If the message is complex or ambiguous, there's high possibility the friend will misunderstand. So the message must be sent in the clearest and simplest expression possible. Information theory is the field researching how to convey information as clearly as this way.

This concept of information theory is also utilized in deep learning. For example, cross-entropy is what accurately expresses as a number the difference when AI looked at a photo and predicted 'this photo is a cat,' but the photo was actually a dog. The larger the cross-entropy value the more wrong the prediction, and the smaller it is the more accurate the prediction means. Therefore AI proceeds with learning in the direction of minimizing cross-entropy values.

From Perceptron to Transformer: Evolution of Structure

Another important flow in AI development has been the structural change of neural networks. Neural networks are structures that imitate the way neurons (nerve cells) in human brains connect and exchange information, where numerous small computing devices connected to each other process complex information.

Neural networks first appeared in the late 1950s. The simplest early neural network is called a Perceptron. Perceptrons had a very simple structure that multiplied input information by weights representing the importance of each, added all these values together, outputted '1' if a certain threshold was exceeded, and '0' if not. For example, it's similar to when recommending a movie to a friend, giving 2 points for action elements, 3 points if a famous actor appears, and 4 points if the cinematography is excellent, then recommending if the total score exceeds a certain threshold.

However, perceptrons had a fatal limitation of being unable to solve even slightly complex problems like the XOR problem (a logic problem where it's true only when two inputs differ from each other). Just as in the movie recommendation example above, it's difficult to accurately reflect the complex tastes of viewers with only a simple score summation method. For these reasons, interest in neural network research temporarily declined.

Then in 1986, the backpropagation algorithm appeared and attention resumed. The backpropagation algorithm is a method that helps AI track internal calculation processes backwards from wrong results to correct errors. This enabled learning even complex neural networks with multiple layers, called Multi-Layer Perceptrons (MLP). As layers increased, solving complex problems became possible, and neural network research was actively conducted again.

After the 1990s, neural network structures became even more diverse. Convolutional Neural Networks (CNN) that can effectively recognize images, Recurrent Neural Networks (RNN) that process sequential data like sentences or music well, and Long Short-Term Memory (LSTM) structures that better remember longer sentences or temporal information were developed in sequence.

For example, CNN could accurately recognize cats in new photos by learning from numerous cat photos, and RNN and LSTM could read long sentences and remember meaning to perform translation or conversation. These achievements promoted practical development of AI technology.

Recently in 2017, the Transformer, a new structure, appeared bringing another major advancement. Unlike existing RNN or LSTM, transformers use a Self-Attention method that simultaneously looks at all words in a sentence and grasps the relationship between each word.

Self-attention is similar to the method of quickly scanning the whole to find important content with a glance rather than reading from beginning to end when looking for needed information in a book. Thanks to this method, transformers came to demonstrate excellent performance in accurately understanding context and processing long sentences or complex data. The currently frequently used famous language models like GPT, BERT, and T5 are based on the transformer structure. As a result, transformers have positioned themselves as the new standard of recent AI technology.

The Deep Learning Leap of the 2010s: The Shock of ImageNet and AlexNet

The fact that we commonly encounter AI as we do now is not an old story. In particular, the occasion when many people began to know the word 'deep learning' in earnest was in 2012. That year a very important event happened in AI history — a new AI model called AlexNet demonstrated tremendous results at an image classification competition called ImageNet, far surpassing existing methods.

ImageNet was a very large annual image classification contest. This competition required correctly classifying objects or animals in photos into 1,000 categories by looking at over a million photos. For example, AI models had to correctly identify what kind of thing was in each photo — cats, dogs, cars, flowers, and so on.

In this competition, outstanding research teams worldwide had long been competing with traditional computer vision techniques. Traditional methods meant humans directly finding characteristics like eyes, noses, and mouths in photos and informing computers, which then used those characteristics to classify photos. This way, photos could be distinguished to some degree, but there were limits to accurately classifying millions of different photos.

Then in 2012, AlexNet created by Professor Geoffrey Hinton of the University of Toronto and his students appeared. AlexNet demonstrated surprising results absolutely dominating all other teams. How surprising were these results? If existing methods showed an average error rate of about 25% in photo classification, AlexNet immediately lowered this to about 16% level. This was a tremendous achievement completely beyond people's imagination at the time.

The secret to AlexNet demonstrating such outstanding performance was using a special structure called Convolutional Neural Network (CNN). CNN is designed so that AI can find important characteristics on its own from small pieces of images by dividing images into small pieces. More simply, while previously humans had to directly find characteristics like eyes and noses and tell computers, CNN is a structure made to enable self-discovery and learning of such characteristics.

Moreover, AlexNet learned using a device called GPU that can simultaneously rapidly process tens of millions of complex calculations. Here GPU is a device originally developed to rapidly process screens in games or graphic work, but it turned out to have great effectiveness in deep learning as well due to its excellence in processing much data simultaneously.

After this event, as deep learning proved it could demonstrate powerful performance not just in theoretical possibility but in the real world, it received great attention from researchers and companies worldwide. Many research teams began creating deeper and more complex neural network structures to surpass AlexNet.

Among the models that soon appeared were VGG that raised image recognition performance by stacking deeper layers, and GoogLeNet that improved performance by connecting multiple layers in various ways. In particular, ResNet that appeared in 2015 introduced a new idea called skip connections to solve the problem of learning not proceeding well when neural networks became too deep. Skip connections were a method of directly transmitting information by jumping over layers, similar to quickly going up to a desired floor with an elevator without going through every floor in a high-rise building. Through this, ResNet could stably learn even very deep neural networks of over 150 layers.

These innovations didn't remain only inside research labs but began to be quickly applied in the real world we live in. For example, autonomous vehicles accurately recognizing surrounding environments, AI in hospitals diagnosing diseases faster and more accurately than humans, smartphones automatically recognizing faces in photos, and so on in various fields.

The innovation of ImageNet and AlexNet in the 2010s like this can be said to have elevated deep learning from theoretical technology to practical technology, opening the door to the AI era we face today.

Evolution of AI Understanding Language: The Rise of Transformers and Language Models

Just as deep learning began making major achievements in the image field, major changes also occurred in the field dealing with human language in recent years. In particular, in natural language processing (NLP), which is the field of computers understanding and processing the language we use, a very important event occurred around 2017. That was the emergence of Transformers.

Before transformers appeared, recurrent neural networks (RNN) were mainly used. RNN operates by receiving words in time order — considering that words are connected in temporal sequence in language characteristics — and processing information one by one in order. Just like reading from the first word to the last word in order when we read a sentence. However, this way, as sentences become longer, the influence of earlier words on later words diminishes and the problem occurs of difficulty fully grasping the meaning of sentences.

What solved this problem was precisely the Transformer. Transformers were designed not to look at sentences one by one in order from beginning to end, but to simultaneously look at the entire sentence and quickly find which words are importantly related to each other. This is possible because of using the Self-Attention method.

Self-attention is a method of directly calculating what relationships words within sentences have with each other. For example, in the sentence "I played soccer at school with a friend today," when understanding the word 'soccer,' it finds for itself the fact that other words like 'school' and 'friend' are more important and related. Transformers automatically calculate the importance of each word this way to more accurately grasp the meaning of sentences.

The most famous language model using this transformer structure is GPT (Generative Pretrained Transformer) created by a company called OpenAI, first released in early 2018. GPT is simply an AI that learned in a method of continuously predicting what the next word will be by reading an enormous amount of writing. GPT learned by repeating over tens of billions of sentences, so it demonstrated excellent ability in various fields including writing, translation, and answering questions.

BERT (Bidirectional Encoder Representations from Transformers) announced by Google at the time also attracted great attention. BERT learns in a slightly different method from GPT. A method of covering some words in sentences like a puzzle and predicting what the word in the blank is by looking only at surrounding words. Such a learning method is called self-supervised learning. Self-supervised learning enables AI to obtain needed learning information from data itself without humans providing correct answers one by one.

These language models first do pre-training through self-supervised learning with very large data (many books, internet articles, etc.). Pre-training is similar to the process of a person first studying basic knowledge. After building this foundation, with only simple additional learning (fine-tuning), it became possible to well solve various different problems like translation, question answering, and sentiment analysis.

Previously, whenever changing what AI could do, structure had to be redesigned from scratch. But after transformer-based language models appeared, one well-learned model could handle various different tasks just by a little additional learning, greatly saving time and cost.

Thanks to this change, transformers became innovative technology in the natural language processing field, and models like GPT and BERT appeared enabling AI to better understand and use human language. The reason translation services and AI chatbots we use now are very natural and convenient is also thanks to such innovation.

The Era of Self-supervised Learning and General-purpose Models

As self-supervised learning methods appeared, AI met a further advanced era. Models like GPT and BERT were able to pre-learn general knowledge through enormous data using self-supervised learning, and based on this became able to easily perform even more diverse tasks. In particular, the reason this method attracts attention is that it solved the 'labeling cost' problem that was one of the biggest difficulties in existing AI development.

Labeling means humans directly attaching correct answers to data. To find cat photos from photos for example, humans had to one by one attach labels 'this photo is a cat' to each photo. This was a very difficult and time-consuming and costly task. But self-supervised learning doesn't need human help. Because it can create problems and correct answers from data itself — by covering some words in sentences as blanks and predicting what fills those blanks, or by covering parts of images and guessing the original images. When this happened, the enormous amount of unlabeled data available on the internet could be utilized almost freely.

As a result, AI learned from much more vast amounts of data than before, and naturally an era of 'general-purpose AI models' capable of doing various tasks simultaneously rather than only doing one thing well began. Models like GPT and BERT for example can all do translation, document summarization, question answering, and even natural writing with one model. Even when doing new tasks, they could immediately be applied just by showing a few simple examples rather than teaching much additional learning.

The fact that one AI model became able to simultaneously perform multiple different tasks has very great significance both technologically and industrially. Going forward, such general-purpose models will be utilized in even more fields and become more closely connected with our lives.

Change from Data-Centered to Feedback-Centered

This change is also bringing important turning points to AI development direction. If AI until now developed centered on the ability to find accurate answers through vast data, going forward it is changing from data-centered to feedback-centered with humans. For AI to naturally communicate with humans, it must be able to converse in a way that humans feel comfortable and satisfied, not just stopping at giving accurate correct answers. Just as conveying the same content but in a way that the listener can receive it pleasantly and comfortably, thoughtful consideration is becoming important.

For this, the method that has recently appeared is 'Reinforcement Learning from Human Feedback (RLHF).' This method is a way where AI presents two answers to a question and humans directly choose which answer they prefer and inform AI. For example, in response to the question "What should I do on weekends?", imagine AI answered A: "Well, I don't know" and B: "Since the weather is good, how about going for a walk in the park?". If humans evaluate answer B as better, AI will try to give kinder and more specific answers to similar questions going forward.

Meanwhile, what simplifies this method even further is 'Direct Preference Optimization (DPO).' If RLHF goes through a complex reinforcement learning process again based on human evaluations, DPO is a method where AI simply references preferred answers and immediately imitates that style. For example, if RLHF is like a teacher improving writing ability by correcting multiple times, DPO is similar to the teacher just showing a few example sentences saying "writing like this is better." This way AI can much more simply and quickly learn the answer method humans want.

Also, recently it is also becoming important for AI not to just say answers but to show the thought process itself like a human. This method is called the 'Chain-of-Thought (CoT)' method. For example when the question "It rained yesterday and the sun came out today. What would the state of the park floor be?" is received, rather than AI immediately answering "The floor is wet," it shows the thought process step by step: "Since it rained yesterday, the park floor would have gotten wet, but since the sun came out today, it may have dried a bit. Still, it probably hasn't completely dried, so the floor seems like it would be slightly damp." Doing this allows people to more trust and understand the answer AI presented.

Another important change is the 'Self-Instruct' method where AI makes its own study materials and develops. Self-Instruct is a method where humans first present only a few questions and answers, then AI directly creates new questions and answers in similar format. For example, humans give instruction documents like "explain the names and characteristics of solar system planets" and a few examples, then AI automatically creates questions and answers like "What are the characteristics of Jupiter?", "Why does Saturn have rings?" and learns again using those questions and answers. It's similar to a student looking at a few example problems in a textbook and making similar practice problems themselves to study.

Finally, the method of actively utilizing external knowledge has become very important in recent AI. A representative example is the 'Retrieval-Augmented Generation (RAG)' method. RAG is a method where AI doesn't just use its own internal memory when generating answers, but like a person searching for information on Google or Naver when curious, searches for needed information from external internet or databases, references it, and then gives answers. For example, when receiving a question like "What is the population of Seoul, South Korea in 2025?", AI can directly search for the latest materials on the internet and generate accurate answers based on these. This method is very usefully used in situations where the latest accurate information is important.

This way, AI development that began with self-supervised learning has now advanced to the stage capable of going beyond just well handling data to communicating better with humans, creating problems itself and learning, and actively utilizing external knowledge. These changes are becoming important technological foundations that make AI we will meet in the future more smart, kind, and trustworthy entities.

Connecting AI's Present and Future

Until now we have step by step examined how AI has developed. AI's history can be seen not just as a process of technology accumulating but as a history of how the very learning method itself has continuously evolved. Starting from the very simple artificial neural network structure called perceptron, it has now grown into extremely complex and enormous models with tens of billions of connections. Current AI has become smart enough to simultaneously understand not just text or images but both together and create new content, and to even independently judge complex situations to a level similar to humans.

Particularly what became a major turning point in this process was the emergence of self-supervised learning. Thanks to self-supervised learning, enormous amounts of data could be utilized without human help. As a result, the endless and wide information repository called the internet became the best textbook for AI. When this happened, AI began advancing in the direction of 'general-purpose AI' capable of demonstrating good performance in multiple fields simultaneously.

To this, various methods have also recently appeared like reinforcement learning based on human feedback that helps AI communicate even more human-like. Also, methods of AI not immediately presenting only answers but explaining the process of getting those answers step by step like a human, and methods of AI finding and bringing needed information from outside are also being actively researched. This way AI is increasingly developing into a system capable of thinking and speaking like a person, not just stopping at guessing correct answers.

Today's AI is no longer a simple tool. It has become a complex being that understands human speech, accurately grasps what we want, looks up what it doesn't know, and kindly explains the thought process until giving an answer. And at the center of such complex capabilities, the steady effort of always learning from data and trying to advance further is firmly placed.

In the next chapter, we will learn about how these developed AI models can learn faster and more efficiently in actual reality. Now making models unconditionally large is not the solution. Various research is being conducted on how to make things faster with less data, less cost and time while producing the same performance. In other words, we are now moving from the stage of 'scaling' AI to one more step of 'optimizing' AI more efficiently and practically.