Models That Reflect the Flow of Time: Recurrent Neural Networks (RNN)
Completely Changing the Structure: Transformers and Self-Attention
The AI technologies we frequently hear about today all started from one special structure: the Artificial Neural Network (ANN). Just as countless neurons in the human brain connect and exchange information, artificial neural networks are designed so data passes through multiple layers and solves increasingly complex problems.
In this chapter, we examine what artificial neural networks are and how they work — why so much AI is developing centered on them, and what reason this structure became the core of deep learning. We start from linear regression, the simplest starting point, then examine the Multi-Layer Perceptron (MLP), Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), and finally the Transformer.
Starting from Linear Regression
Linear regression is simply finding rules from data — such as predicting that more study hours produce better test scores. However, real-world data contains complex non-linear relationships that cannot be expressed with a single straight line. This led to stacking multiple layers with activation functions between them to add non-linearity — the starting point of artificial neural networks.
Activation Functions: Breathing Life into Neural Networks
Activation functions positioned between layers give changes to results at each layer. Initially Sigmoid and Tanh were used but suffered from the Vanishing Gradient problem in deep networks. ReLU (Rectified Linear Unit) solved this by outputting 0 for negative inputs and the input value itself for positive inputs — maintaining constant gradients and enabling faster learning. Variants like Leaky ReLU, Swish, and ELU followed.
Problems That Arise from Connecting Everything
Multi-Layer Perceptrons (MLP) with fully connected layers suffer from parameter explosion. For a 224×224 color image with 1,000 neurons in the first layer, over 150 million connections are needed, causing slow training and overfitting. More specialized structures were needed.
The Problem-Solver for Image Analysis: Convolutional Neural Networks (CNN)
CNN divides images into small regions and moves a window of set size analyzing data partially. Its key concept is weight sharing — one filter applied the same way across the entire image — drastically reducing parameters. CNN can recognize objects regardless of their position in images. As layers deepen, the Receptive Field expands from small local features to overall shapes. CNN is widely used in Image Classification, Object Detection, and Face Detection.
Models That Reflect the Flow of Time: Recurrent Neural Networks (RNN)
RNN processes data sequentially while reflecting previous results in the next calculation — suited for language, music, stock prices, and weather data where order matters. However, it suffers from the vanishing gradient problem with long sequences and cannot process in parallel. LSTM and GRU improved this by learning what information to remember or forget, but structural limitations remained.
Completely Changing the Structure: Transformers and Self-Attention
Transformers process all data simultaneously rather than sequentially. The core technique is Self-Attention — grasping how closely related each word is to all other words. For example, in "I drank coffee this morning," self-attention recognizes "drank" is most closely related to "coffee." This enables parallel computation of all word relationships simultaneously, dramatically speeding up processing even for long sentences, while allowing far-apart words to directly exchange meaning.
The Power of Transformers Solving Various Problems with One Structure
Originally developed for machine translation, transformers proved outstanding in question answering, document summarization, sentiment analysis, creative writing, code generation, and image captioning — all with one model structure. Models like ChatGPT, Gemini, and Claude share the common transformer structure, undergoing pre-training on tens of billions of sentences.
Self-Supervised Learning: No Need for Correct Answers
Collecting labeled training data is enormously costly. Self-supervised learning solves this by having models create problems themselves — covering words in sentences and predicting the blank (as BERT does) or predicting the next word (as GPT does). Combined with transformers, this enables learning without labels, utilizing unlimited unlabeled internet data, and producing models that apply broadly to diverse tasks.
Looking Back on the Development of AI Structures
From linear regression through activation functions, MLP, CNN, RNN, to transformers — each structure was designed for its purpose and the characteristics of data it handles. All share the common goal of finding meaningful rules and patterns through data and applying them to new situations. Based on these various neural network structures, we can create even more sophisticated and creative AI systems.

