How LLMs are trained?

Harini Narasimhan
5 min readJan 3, 2024

--

Initial training process of the LLMs are called pre-training and it is achieved by steps that we follow for training any Deep learning model but curated for LLMs such as

  • Data collection and scraping
  • Model architecture decision
  • Training

We can also decide to pre-train or fine-tune an existing LLM based on the requirements at the last step.

Data collection and scraping

In this phase, an understanding to encode a deep statistical representation of language is build in the model where it learns from vast amounts of unstructured textual data. This can be gigabytes, terabytes, and even petabytes of text. This data is pulled from many sources, including scrapes of the Internet and corpora of texts that have been assembled specifically for training language models. When we scrape training data from public sites such as the Internet, we often need to process the data to increase quality, address bias, and remove other harmful content. As a result of this data quality curation, often only 1–3% of tokens are used for pre-training. We should consider this when we estimate how much data we need to collect if we decide to pre-train our own model.

In this self-supervised learning step, the model learns the patterns and structures present in the language. These patterns then enable the model to complete its training objective, which depends on the architecture of the model.

Model architecture decision

There are different architectures of the model and the basic principle of these architectures are derived from the transformers and click here to understand them in detail.

A basic transformer model looks like,

During pre-training, the model weights get updated to minimize the loss of the training objective. The encoder generates an embedding or vector representation for each token. Pre-training also requires a large amount of compute and the use of GPUs. There are three variance of the transformer model such as

  • Encoder-only
  • Encoder-decoder models
  • Decoder-only

Now that we have seen how this different model architectures are trained and the specific tasks they are well-suited to, we can select the type of model that is best suited to our use case. One additional thing to keep in mind is that larger models of any architecture are typically more capable of carrying out their tasks well. Researchers have found that the larger a model, the more likely it is to work as we needed to without additional in-context learning or fine-tuning.

While this may sound great, it turns out that training these enormous models is difficult and very expensive, so much so that it may be infeasible to continuously train larger models. Let’s take a closer look at some of the challenges associated with training large models.

Training

There are many challenges while we begin to train large language models, And some of them are listed below with a possible solution

Running of of memory

Most LLMs are huge, and require a ton of memory to store and train all of their parameters. It actually require approximately 6 times the amount of GPU RAM that the model weights alone take up. To train a one billion parameter model at 32-bit full precision, we’ll need approximately 24 gigabyte of GPU RAM. To reduce the memory required to store the weights of our model, reducing their precision from 32-bit floating point numbers to 16-bit floating point numbers, or eight-bit integer numbers can be done and known as Quantization.

Quantization statistically projects the original 32-bit floating point numbers into a lower precision space, using scaling factors calculated based on the range of the original 32-bit floating point numbers.

Impact of Quantization

Improve model’s performance

The model performance can be improved in two ways by

  1. Increasing the size of the dataset
  2. Increasing the number of parameters in the model

In theory, we could scale either of both of these quantities to improve performance. However, another issue to take into consideration is our compute budget which includes factors like the number of GPUs we have access to and the time we have available for training models. To mitigate these issue, we can follow

> Chinchilla law of training — the optimal training dataset size for a given model is about 20 times larger than the number of parameters in the model.

> Parameter Efficient Training — Tuning the existing model using methods instead of pre-training like LoRA and Soft-prompting.

Domain adaptation

It is straightforward to pre-train our own model from scratch, if our target domain uses vocabulary and language structures that are not commonly used in day to day language. We may need to perform domain adaptation to achieve good model performance.

For example, imagine you’re a developer building an app to help lawyers to summarize legal briefs. Legal writing makes use of very specific terms like mens rea and res judicata — these words are rarely used outside of the legal world, which means that they are unlikely to have appeared widely in the training text of existing LLMs. As a result, the models may have difficulty understanding these terms or using them correctly.

Pre-training your model from scratch will result in better models for highly specialised domains like law, medicine, finance or science.

Example: BloombergGPT is a model trained for financial domain

With this learning, we can train a LLM from scratch or use an existing LLM for our application with clear understanding on data collection, architecture of LLM to choose and solve the training challenges.

Resources:

--

--

Harini Narasimhan
Harini Narasimhan

Written by Harini Narasimhan

Project Engineer at IITK | Freelance Data Scientist | Computer Vision | Image processing | AI for social good

No responses yet