Distributed Training and Computation Optimization: Scaling in Real Environments
AI-Optimized Computation Environment: Hardware and Inference Strategies
Scaling in deep learning means the process of adjusting the size and complexity of AI models to improve performance. Simply put, it is the method of making model sizes larger or more complex so AI can solve problems more smartly. However, scaling doesn't simply mean bigger is better. Just as a building isn't a good building by just making it tall without consideration — the height, sturdiness, and internal structure must be well harmonized for a good building — AI models also need to carefully consider in which direction and by how much they should grow to become efficient and practical models.
Early AI development often made major advances when new algorithms or ideas appeared. For example, like AlexNet mentioned earlier, when new structures appeared AI performance greatly improved. However, in recent years the situation has changed somewhat. Now completing entirely new algorithms has less impact on performance improvement than how much larger model size can be scaled and how much more data can sufficiently be learned.
Why did this change occur? Recent researchers discovered the fact that if only sufficient data above a certain amount is gathered and sufficient computer performance capable of processing this is available, performance keeps improving just by making model size gradually larger. That is, even without adding something special, the model can become increasingly smarter just by making it larger and more complex.
Researchers call this phenomenon the 'scaling hypothesis.' This hypothesis is the claim that "if models are continuously made larger and much data is provided, AI can improve endlessly." Just like study scores going up when study time is extended and more problems are solved even if the basic method remains the same. Of course, just as it's necessary to study concentratedly rather than just extending study time, it's also important to appropriately well design the direction and method of growing AI models.
In fact, very large-scale models like GPT-3 and Vision Transformer (ViT) have recently appeared in the AI field. GPT-3 is a language model that learned by reading hundreds of billions of sentences, and Vision Transformer is an image recognition model that learned by looking at millions of images. These models demonstrated far superior performance that previous models couldn't achieve simply thanks to expanding to much larger sizes than existing models. For example, GPT-3 enabled natural conversation at human-similar levels including writing, and Vision Transformer became able to more accurately find and classify objects in photos.
Like this, scaling a model means not just making size larger but the process of carefully considering so AI can develop more practically and efficiently. This concept of 'scaling' has become something that cannot be left out of thinking when predicting how much more AI can develop going forward.
Five Directions for Growing Models
When we say growing AI models, it may feel like simply unconditionally increasing size, but in fact there are various methods of growing models. From now on let's look at the five representative methods one by one slowly.
The first method is increasing the depth of the model. Increasing depth means stacking more layers of neural networks. Just like at school first learning simple concepts then gradually learning slightly more difficult concepts and understanding increasingly complex content, AI models can also learn more complex patterns and structures step by step the more layers there are. For example, the first layers learn the simplest characteristics from photos (lines, colors, etc.) and as we go further back they become able to recognize more complex characteristics (shapes of human faces or specific objects, etc.). This has the advantage of being able to find complex structures and relationships better.
The second method is making the width of the model larger. Making width larger means increasing the number of neurons (small computing units playing roles similar to brain cells) inside each layer. Simply, just as more cars can pass on a wide road than a narrow road, widening the width of each layer enables more information to be processed simultaneously and quickly. This method is used efficiently particularly in environments where computer performance is excellent.
The third method is raising input resolution. Raising resolution means making AI models process larger images or longer sentences. For example, just as being able to see small text or objects better with a larger and higher quality photo of the same image, AI models can also more accurately grasp even detailed information the more input information there is.
The fourth method is increasing the total parameter count. Parameters mentioned here are values AI finds itself in the learning process, representing the amount of knowledge the model has. It's similar to the amount of knowledge or memory in a person's head. It's a method of greatly increasing the overall knowledge capacity of models by appropriately combining the depth, width, and input resolution explained earlier. Large-scale models frequently mentioned in recent AI like GPT-3 are representative cases actively using this method.
The last fifth method is increasing computation at inference time (test-time compute). This method means the method of calculating more carefully when models solve actual problems after finishing learning. For example, it's similar to finding the most accurate answer by trying multiple times when taking an exam. Without changing the already made model, it's a method of repeating more careful multiple calculations or comparing in various methods to give answers. Doing this can obtain more precise and accurate answers.
However, growing models in various ways like this doesn't always lead to good results. If models become excessively deep or too complex, problems can arise of learning not proceeding well or overfitting — matching learned problems well but being unable to solve new unfamiliar problems. Moreover, if models become too large and complex, it takes too much time to calculate and costs grow.
Therefore researchers constantly consider not only how to increase size but methods to design model structures more precisely and efficiently to produce maximum performance with the same size or resources.
From VGG to DenseNet: Deeper and Deeper
Attempts to make models deeper were actively conducted mainly in structures called Convolutional Neural Networks (CNN). CNN is structured similar to the way our eyes recognize objects — first recognizing small parts and then the overall shape step by step when viewing things.
One of the early famous models among these is VGGNet. VGG created neural networks very deeply by continuously stacking very small 3×3 filters repeatedly. Here filters are tools analyzing specific parts of images a little at a time, similar to using a magnifying glass to carefully observe small parts of images. Stacking multiple small filters like this enables more precise analysis of images so performance improved greatly. But VGG's problem was that as models became too deep and complex, enormous amounts of values to remember (parameters) accumulated making memory usage and calculation time greatly increase.
What appeared to solve this problem was a neural network called ResNet (Residual Network). ResNet used the special idea of 'skip connections.' Skip connections are similar to bypass roads created to quickly pass through congested sections on a highway. In neural networks, the phenomenon can occur where information gradually weakens or disappears as it passes between layers, and skip connections enable information to be transmitted directly to the next layer without disappearing by jumping over layers. Thanks to this, ResNet could stably learn even very deep neural networks of over 100 layers, and performance also greatly improved. This method became a concept adopted as default by most models that appeared subsequently.
Another famous model is Inception, also known as GoogLeNet. Inception introduced a method of simultaneously using multiple filters of different sizes within one layer rather than just stacking networks deep. For example, it's similar to simultaneously using a magnifying glass, telescope, and microscope within one layer. It's as if being able to simultaneously see objects of various sizes from small to large with each tool. Through this method more information could be obtained in one analysis, and calculation amount and speed could be maintained efficiently while providing the same performance.
Finally DenseNet is also noteworthy. DenseNet used a unique method of directly connecting information obtained from previous layers not just with the next layer but with all subsequent layers. Simply, rather than the method of transmitting information one person at a time in order among friends, it's the method of all friends directly informing each other. Thanks to this method, important information from earlier layers could be transmitted well without being lost. This method greatly helped increase the amount of information models have while rather reducing the number of parameters to remember, improving efficiency.
Like this, making models deeper doesn't end simply at stacking many layers unconditionally. It's a very precise design process that must simultaneously consider all elements like how effectively information is transmitted when layers are deepened, how diversely information the model can express, and whether calculation speed slows.
Balanced Expansion Strategy: EfficientNet and RegNet
The most commonly encountered concern when growing AI models is the question of which part to grow exactly how much to produce the best performance. For example, if only continuing to increase the depth of neural networks, problems can arise of information not being well transmitted and disappearing midway. If only increasing width, too much information coming in simultaneously makes calculation complex and takes a long time, and if only raising input resolution, too much memory is used and computers might not be able to handle it.
To solve these problems, a method of growing depth, width, and resolution together in balance attracted attention. This method is called 'compound scaling.' By analogy, it's similar to not just building a building extremely tall, but well harmonizing height, width, and interior facilities to make the most sturdy and useful building.
A representative case of this balanced method is a model called EfficientNet. EfficientNet first found the most efficient small basic model, EfficientNet-B0, using technology called NAS (Neural Architecture Search). NAS is technology where AI itself tests enormous model structures to find the structure with best performance. Using this found small basic model as a starting point, models with increasingly better performance from EfficientNet-B1 through B7 were created by simultaneously expanding depth, width, and input resolution in balance a little at a time.
EfficientNet determined how much to increase each element following pre-set mathematical ratios, and maintained overall computation at an appropriate level. As a result, EfficientNet could obtain higher accuracy with much less computation and fewer parameters than other famous models. Simply, a smaller and lighter model produced better performance.
Another example of a balanced model is RegNet. RegNet used a slightly different method from EfficientNet, using rules that humans directly set rather than automatically finding structures like NAS. RegNet's core idea is that models showing good performance have width (i.e., the number of neurons or channels at each layer) that progressively increases following certain rules as layers deepen. Just like when constructing buildings increasing rooms or facilities at a certain ratio each floor, setting rules when expanding models and following those rules is good for performance.
Utilizing such rules enables fine adjustment of the balance between computation and performance even without a complex automatic search process. Therefore RegNet is evaluated as a comparatively simple yet very efficient and practical model.
The common point of EfficientNet and RegNet is not just making size larger unconditionally, but thinking very carefully about how to use few resources to produce the best performance.
Scaling Transformers and Super-large Language Models
The most dramatic change in the flow of growing models occurred in the AI field dealing with language, i.e., natural language processing (NLP). In particular, as a structure called the Transformer appeared, language models began growing rapidly.
Let's briefly look at why transformers are good structures for growing models. Transformers have a relatively simple structure, layers can be freely stacked more, and they have the advantage of being able to quickly process multiple calculations simultaneously. Simply, like quickly seeing multiple pages at once rather than slowly reading a book page by page, they have a structure capable of processing a lot of information simultaneously. Thanks to these characteristics, transformers were very suitable for expanding into enormous models with tens of billions or more parameters.
Early models like BERT and GPT had parameter counts of hundreds of millions. But over time model sizes kept growing. For example, GPT-2 was about 1.5 billion, and subsequently appearing GPT-3 came to have an enormous 175 billion parameters. Such massively large models like GPT-3 demonstrated far superior performance in various language tasks including translation, question answering, and writing compared to previous models through just size alone. That is, model size itself became an important factor determining AI capabilities.
However, endlessly increasing parameter count is not always a good method. In 2022, a company called DeepMind discovered a very important fact through research called 'Chinchilla.' This research emphasized that the amount of data is just as important as model size. For example, GPT-3 was enormously large as a model itself, but the amount of data actually learned was not sufficient compared to that size, so performance could not be fully demonstrated. Simply, it's similar to a case of a very smart student who hasn't read many books so cannot fully demonstrate ability.
On the other hand, the Chinchilla model was smaller than GPT-3 but learned much more carefully using far more data. Just as a student who reads more books and studies more carefully achieves better results than one who doesn't, Chinchilla obtained far superior results in performance despite smaller size.
Subsequent research changed in the direction that not just increasing model size but well balancing model size, amount of data, learning frequency, and computation is needed to produce good performance. Recent large-scale language models like LLaMA, PaLM, and Claude were also made based on this balanced strategy. That is, we are now moving from a 'size competition' of simply making models larger to an era of 'efficiency competition' of operating more efficiently and effectively with the same size.
Going forward, AI models we will encounter will be smart and efficient models well combining size, data, and efficiency, not just large size alone.
Thinking Longer and Deeper: Test-Time Compute
Until now we looked at how to make AI models larger and more efficiently. However, even at the stage where models have finished studying and are actually solving problems, better results can be obtained by just slightly changing calculation methods. This method is called 'test-time compute (inference stage computation expansion).'
In traditional methods, when questions were posed to AI, it was general to immediately present one answer. However, recently various methods are being attempted to make answers more accurate and trustworthy rather than immediately giving just one answer.
One of these is a method of generating multiple answers for one question then selecting the most consistent and trustworthy among them. It's similar to consulting multiple friends about worries then selecting and following the most trustworthy and good advice.
Another method is first showing the thought flow in the process of solving problems then presenting the final answer. The 'Chain-of-Thought' method examined earlier is related to this. For example when solving a math problem rather than just saying "the answer is 30," explaining the solution process in detail like "first multiply 10 by 2 to make 20, then add 10 to get 30" then presenting the final answer. Doing this makes it easy to find when mistakes occur in the middle and obtain more accurate answers.
When facing complex problems, there is also a method of going through multiple processes to make answers gradually more accurate rather than immediately giving just one answer. This is also called a 'self-refinement' structure, a method designed to enable AI to think once more like a person and recheck its own answers. This method has already been widely utilized in systems using reinforcement learning. For example, the famous AI go program AlphaGo determined its next move by repeating virtual games thousands of times to select the move with highest probability of winning. Just as go masters think in advance about multiple possible moves before making a next move to decide the best one, repeating calculation multiple times and thinking more carefully can become an important method of making results much better.
That is, without retraining the model itself, just more carefully designing calculation methods at the inference stage of already learned models can open paths to improving performance. Particularly in environments where it's difficult to make models larger or available resources are limited, this 'test-time compute' strategy can be a very useful alternative.
The Role of Normalization Techniques for Stable Learning
As deep learning models became deeper and more complex, new problems appeared. As layers deepened, values generated at each layer would not remain constant but start gradually becoming larger or smaller, creating difficulties in learning. Simply, it's similar to a situation arising at school where a teacher started class at an appropriate level but as class progresses difficulty becomes too high or too easy so students can't properly follow class content.
When values like this are unstable, AI model learning speed slows or cases occur of not being able to learn at all. The concept appearing to solve such problems is normalization.
Normalization is a method that helps values coming out of each layer of models be maintained at appropriate levels without becoming too large or too small. Like a teacher adjusting class difficulty to an appropriate level that students can understand, it plays the same role.
One of the most widely used normalization methods is Batch Normalization. Batch normalization is carried out in mini-batch units — small groups processing multiple data at once. It's the method of adjusting neuron output values within mini-batches to a form with mean of 0 and variance of 1. By school analogy again, it's similar to a teacher readjusting scores within a certain range based on class average when student scores are too different from each other making class difficult to proceed.
However, making all values too similar like this can cause unique characteristics of original data to disappear. So batch normalization uses two additional learnable parameters to enable models to slightly readjust values themselves. Just like after making student scores similar, adding a bit of individual students' personal effort or characteristics to maintain individual individuality.
Doing this keeps output values of each layer maintained within a constant range enabling models to stably proceed with learning. Thanks to this, learning proceeds faster and even overfitting — the phenomenon of becoming too fitted only to specific learning data and not being able to well handle new data — can be somewhat alleviated.
However, batch normalization also has disadvantages. If mini-batch size is too small, effects can rather decrease or learning can become unstable. So other normalization methods have also been proposed to supplement such disadvantages.
A representative method among these is Layer Normalization. Layer normalization is the method of simultaneously normalizing all neuron values from each layer rather than mini-batches. Besides these, there are also various methods like Group Normalization dividing layer values into multiple groups for normalization and Instance Normalization normalizing each neuron individually.


