March 28, 2025

ikayaniaamirshahzad@gmail.com

State of Foundation Model Training Report 2025


Executive summary

  • Lorem ipsum
  • lorem ipsum
  • lorem ipsum

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry’s standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.

Introduction

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry’s standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry’s standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.

Research methodology

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry’s standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.

icon


Spotlight

Lorem Ipsum is simply dummy text of the printing and typesetting industry.

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry’s standard dummy text ever since the 1500s,
when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries

Lorem ipsum – Lorem ipsum

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry’s standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.

Lorem Ipsum is simply dummy text of the printing and typesetting industry.

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry’s standard dummy text ever since the 1500s,
when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries

Lorem ipsum – Lorem ipsum

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry’s standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.

icon


Key Term

Lorem Ipsum

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry’s standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.

icon


Key Term

  • Is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry’s standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.
  • Is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry’s standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.
  • Is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry’s standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.
  • Is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry’s standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.

Is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry’s standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.

Is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry’s standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry’s standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.

Is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry’s standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.

How to navigate the report

Lorem Is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry’s standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.

Lorem Is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry’s standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.

Lorem Ipsum

Lorem Is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry’s standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.

360

New users

$3.1M

Sales Ravenue

$1.1M

Profit

Current state of foundation model training

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry’s standard dummy text ever since the 1500s. Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry’s standard dummy text ever since the 1500s. Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry’s standard dummy text ever since the 1500s.

icon


Spotlight

In 2022, we made the decision to invest in LLMs. This was pretty speculative at the time. We didn’t necessarily expect products to come out of this, but we wanted to have the capability.
Later that year, with the release of ChatGPT, the question was no longer whether it was going to work but how quickly we could get there. So we doubled down on our LLM efforts, and since late 2022, most of our products have run on top of LLMs rather than the much more complicated architectures we used prior.

Stefan Mesken, VP of Research at DeepL

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry’s standard dummy text ever since the 1500s. Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry’s standard dummy text ever since the 1500s. Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry’s standard dummy text ever since the 1500s.

Why are companies training foundation models?

Regardless of their size, industry, or location, the companies we interviewed for this report have one thing in common. They work in a specific domain, cater to a niche in their market, or provide solutions for a particular task.

While this is hardly unusual for companies, it sets them apart from the companies at the center of the public discourse around foundation model training. Industry high-flyers like Anthropic, Mistral, and Open AI and established tech giants like Google and Meta develop foundation models for extremely generic tasks. It is likely that the mass market addressed by these models will remain reserved for a few key players—just like the consumer and business computer chip market, which is dominated by a handful of global champions.

While generic foundation models are expected to become a commodity, they are and will not be suitable for specialist applications. This is either because they outright lack the necessary capabilities, fail to reach the desired performance level, or do not meet business or regulatory constraints.

Against this backdrop, a company’s decision to train its own foundation models follows naturally. If a generic foundation model could solve the problem at the core of their business, they would not (or no longer) have a business case in the first place. However, if foundation models appear to hold promise, training them becomes an integral part of maintaining an edge.

In the following, we will discuss the driving factors in the decision-making process and also shed light on the reasons that ultimately make companies decide against foundation model training.

Data privacy, regulatory requirements, license restrictions

Many companies process data that is subject to strict privacy and governance requirements. Whether this is due to regulations, customer demands, or because the data constitutes a trade secret, it’s paramount for these companies to own the entire data pipeline.

When a company controls everything from the raw data to the downstream application, it can provide provenance information and accurately explain the data processing steps to customers, end users, and regulatory authorities. This becomes even more important if faced with different demands in different jurisdictions across the world or from different customers.

Using foundation models available under open licenses seems attractive because it offloads the personnel-intensive and resource-hungry pre-training to a third party. Further, companies can try a range of fully-trained models with different architectures and sizes at little cost. However, not knowing in any appreciable detail what data and processes were used to train a model quickly becomes an untenable liability.

Finally, despite being labeled “open,” foundation models’ licenses might prohibit commercial use at scale or for specific purposes, as is prominently the case for Meta’s Llama model family. This is similar to the well-known problems with freely available software distributed under licenses containing non-compete clauses or subject to export restrictions.

Building and maintaining competency

For companies that find themselves with foundation models at the core of their business, building and maintaining competency becomes a key strategic concern. Even if relying on third-party API products or fine-tuning open foundation models could solve their needs in the short term, companies decide to invest in foundation model development to gain expertise in this key technology, believing that it will prove impossible to catch up in the future.

Only limited information about foundation model development is shared publicly. While the basic architectural principles and training techniques are widely known, little about specific setups and processes leaves a company’s boundaries.

If it does, this information is often outdated already, with teams only sharing what they no longer consider a trade secret promising a competitive advantage. Further, academic publications and company blogs accompanying model releases are not looking to provide guidance to developers but are predominantly created to signal innovation and novelty. Often, their focus is on highlighting newly introduced features and comparing benchmark performances rather than describing the essential groundwork.

The companies interviewed for this report share the belief that competency in foundation models can only be acquired through first-hand experience and experiments. Training foundation models is a rare capability today, and it will likely remain the case going forward. Already, there are big skill and knowledge gaps between teams, which many of the interviewed companies expect only to grow as foundation model technology evolves.

This argument also applies at a macroeconomic and international scale: Capabilities in security- or business-critical technologies like foundation models should not exclusively rest in the hands of just a few (potentially adversary) actors.

icon


Spotlight

It might be that in three years, we’ll have to conclude that we’re not going to beat the performance of GPT-8 or whichever models will be state-of-the-art by then. It’s really important for the team and all stakeholders to understand that this is a possible outcome, even though the probabilities are hopefully low.
But we’ll have learned a lot, and I strongly believe that the expertise we’ll have acquired will be worth it, even if the model or the service itself does not pay off. Otherwise, it’s really hard to justify this effort internally.

Keunwoo Choi, Senior Principal ML Scientist at Genentech

The widely available access to state-of-the-art foundation models through APIs can make investing in foundation model training seem less attractive. Their implementation and performance are the current benchmarks in the eyes of users, setting expectations that can be challenging to match for internal teams out of the gate.

Further, API pricing often widely undercuts the actual costs of providing the service. In light of stalling progress in performance through scaling model size (see “Trends—Scaling”), concerns about exhausting the readily available training data, and lack of a clear path to profitable products, it is unlikely that foundation model providers will be able to continue underselling access to their APIs. Thus, training and deploying foundation models will become more attractive from a cost perspective.

Why are companies deciding against training foundation models?

While, as documented in this report, training foundation models is an attractive strategy for a significant number of companies, many more decide against it.

Notably, this decision appears to be independent of the teams’ or companies’ size. Several of the companies featured in this report are fairly small but are nevertheless fully committed to a foundation model strategy. Then again, we interviewed AI specialists at international enterprises who had decided against it.

Perhaps surprisingly, even among companies that are currently investing in foundation model capabilities, there is a sentiment that training custom foundation models is usually not necessary. When asked to make recommendations to others considering going down this path, they instead suggest adapting openly available foundation models.

However, the majority of companies interviewed for this report do train their own foundation models, and—as we have already seen in this section—for good reason. What might explain this apparent contradiction is that the companies have, through their effort, closed the specific gap in the foundation model landscape they targeted initially.

As carved out above, the decision to train foundation models hinges on a strong business case. If—based on the information available today—the expected benefits justify the investment, training foundation models is a viable option.

What kinds of foundation models are companies training?

Foundation models play very different roles in companies’ products and services. This, in turn, informs what kinds of models they train—the business objectives directly drive modeling decisions.

“Foundation model” is a broad term without a clear definition. While, per its original definition (see “Introduction—A note on terminology”), it describes large, pre-trained, generic models from which task-specific models are derived, it is also used as a synonym for Large Language Models (LLMs) or to describe any large-scale transformer-based model. The boundaries between the strict definition and other uses of the term blur in the case of models that can be used directly as well as serve as the foundation of derived models. In this section, we’ll use the term “foundation model” in its broadest possible meaning unless explicitly stated otherwise.

When it comes to the foundation model approach, the companies interviewed for this report fall into one of four broad categories:

Training foundation models for direct application

Companies in this category solve broad and complex tasks like language translation or climate modeling. The models they train must be applicable across many circumstances and/or process vast amounts of diverse data.

Companies following this strategy typically train a single flagship foundation model (line). They either use this model in their products or on behalf of customers but do not sell the model itself.

Foundation models in this category are not used to derive models through fine-tuning but are adapted through in-context learning. They are also “foundational” to the company’s business (the core product) and are typically among the largest models.

Training a foundation model to derive fine-tuned models for downstream applications

A typical company in this category uses AI technology to solve specific tasks in a domain, either for internal applications or for their customers. A dedicated team trains a foundation model that is domain-specific in multiple dimensions: the information it contains, the data modalities and structures it can process, and its operational requirements.

Based on this foundation model, the team derives task-specific models through fine-tuning to be used by internal or external customers. The foundation model is not shared, nor is access to it sold.

These models are characterized by being “foundational” for downstream applications, not so much their size or broad applicability. They are typically built for a relatively narrow domain or set of tasks, allowing them to be significantly smaller than the foundation models developed by companies in the first category.

Training smaller, task-specific models

Companies in this category are structurally similar to those in the previous category: They work in a specific domain and have to solve particular tasks. Usually, a dedicated foundation model team caters to internal customers.

Instead of creating one generic foundation model from which task-specific models are derived, this team creates a dedicated model for each task to be solved. Although these models are not intended to be used for deriving downstream models, they are nevertheless referred to as foundation models because they share architecture and training techniques.

Training small task-specific models limits the scope and scale of the development effort in multiple ways. First, the models can be trained rather quickly on modest infrastructure. Second, they require only data for the particular task to be solved. This allows training to commence immediately once this data becomes available, minimizing the time between data collection and model deployment. Third, model evaluation can focus on a single well-defined application scenario. There is no need to balance performance across multiple tasks, and over-optimizing on one task at the expense of others is no concern.

icon


Spotlight

If I have to solve a particular problem, I don’t need an 80-billion-parameter model. I need a small model that can understand the specific input related to that task and solve it. We are not building an end-to-end solution for everything—we are creating a suite of small-scale models that help us achieve our goals.

Ram Singh, Associate Director (ML) at Cleareye.ai

Overall, this strategy is a path to quickly reaching good performance on a particular task. However, this fast return of business value comes at the expense of giving up the benefits of transfer learning (see “Current State—Training data”). Thus, once teams have mastered the development of task-specific models, they often start to combine multiple tasks into one model. This reduces the overall training costs, unlocks performance improvements through transfer learning, and avoids the need to maintain an increasingly large number of models in parallel. Thus, while not foundation models according to the strict definition, small task-specific models are “foundational” to a company’s AI strategy.

Fine-tune open models

In contrast to training foundation models from scratch, fine-tuning them requires far fewer resources. Further, the process is more standardized, and comparatively mature tooling is available. Thus, teams fine-tuning third-party foundation models can focus on curating high-quality datasets and optimizing task performance.

This motivates even companies with the budget and skills for training foundation models to restrict their efforts to adapting third-party models. A dedicated team either derives task-specific models similar to the teams in the second category or creates company-internal versions of standard foundation model applications.

While techniques like distillation and LoRA are essential for foundation model teams to master, fine-tuning is a very different process from pre-training foundation models. Thus, a strategy exclusively focused on fine-tuning third-party models does not build foundation model capabilities to the extent that a strategy focused on small, task-specific models, as discussed in the previous section, does. This is even more pronounced when adaptation is exclusively achieved through in-context learning.

The more a team relies on high-level abstractions and tools, the more it risks struggling with adapting to new developments in upstream models. Further, this strategy’s success and long-term viability hinge on the availability of suitable upstream models.

In any case, the fine-tuned models derive from a foundation model whose training data is unknown, which can lead to surprising behavior and generally constitutes a liability.

Addendum: Hybrid approach

Not all companies’ strategies can be sorted neatly into just one of the four categories. A particularly common hybrid approach is using open models where they are available and combining them with custom foundation models to create multi-modal models (see “Trends-Multi-modal LLMs”). This approach strikes a balance between resource demands, skill-building, and delivery timelines in service of the overall business objectives.

What does the hardware and data infrastructure look like?

The scale of the data centers and the vast amounts of energy frontier labs like Open AI and Anthropic spend on training their flagship foundation models are widely discussed. Thus, training foundation models might seem out of reach for companies not operating on billion-dollar budgets. However, it can be accomplished on a much smaller scale.

The hardware and energy required to train a foundation model depend on its size and architecture. In turn, the specific hardware constrains size and architecture, with the GPU memory as a key restriction. Further, larger models generally need more training data, leading to longer training times.

Foundation model teams typically solve this chicken-and-egg problem by defining a compute budget beforehand. As a general rule of thumb, about a fifth of this budget can spent on the final training run, with the remainder needed for experimentation and test runs.

Compute platform: GPU vs CPU

GPUs are the default choice for foundation model training. They are the core building blocks of today’s high-performance computing (HPC) clusters, as they provide unmatched performance on parallelizable computations.

Nvidia continues to dominate the GPU market and established a de facto industry standard with its CUDA framework. However, AMD has tripled its revenue from equipping data centers between Q2/2023 and Q4/2024 (Statista), with a majority of the world’s top HPC clusters as of November 2024 running on their Instinct GPUs (Top 500). Intel plays a significant role in the data center GPU market as well.

CPUs are cheaper per time slice, have better market availability, and are easier to deploy on-premises. Further, they can be equipped with significantly more memory than GPUs. For smaller foundation models, these benefits can outweigh the drastic loss in computational performance compared to GPUs.

When deciding on a computing platform, companies that are training foundation models aim for an efficiency sweet spot. While GPUs are significantly more expensive in direct and indirect costs, the speedup they bring is nevertheless worth it.

Maintaining and operating foundation model training infrastructure

Except for the smallest ones, training foundation models require multiple GPUs that are usually distributed across a cluster. While gaining access to GPUs was difficult and costly over the last couple of years, the companies interviewed for this report found that the availability of GPUs in the cloud and for purchase has significantly improved.

Whether the GPU cluster is set up at a cloud provider or on-premises, teams find that a lot of engineering is required before foundation model training can commence. Compared to CPUs, GPUs are a less mature platform. Foundation model teams frequently report issues with drivers and encounter opaque hardware failures.

The prerequisite expertise in machine-level debugging, networking, and distributed systems is traditionally not found in data science and machine learning teams. Likewise, this knowledge is not common among existing IT and infrastructure teams. Many teams who previously used to rely on their services find that when it comes to foundation model training, they need to take on more of the infrastructure work themselves. Some companies even go as far as assembling dedicated teams to handle hardware and infrastructure.

Optimizing hardware utilization

A perpetual challenge in foundation model training is maintaining high hardware utilization and using the available resources efficiently. Keeping the expensive GPUs under constant load often requires engineering a specialized training loop in accordance with the model architecture and the specific cluster setup.

The main bottlenecks are typically the limited size of the GPUs’ memory and the transfer speed between cluster nodes. Teams can achieve a reduction in required memory space and data transfer by re-computing intermediate results multiple times locally.

Another challenge is loading the training data. While a single LLM training batch is around 60 million tokens (DeepSeek), about 250 MB, a single high-resolution pathology scan can be as large as the entire ImageNet dataset (2 to 4 GB), even when compressed. Handling data at this scale and providing them to the GPUs during training is a serious data engineering effort.

icon


Spotlight

We’re building foundation models for large-scale Earth simulations. For us, one training data point is several gigabytes. At this scale, it’s very difficult to load the data fast enough to keep the GPUs utilized. A lot of our engineering efforts go into just that.

Cristian Bodnar, Co-Founder of Silurian AI

However, cost and utilization optimization is not the be-all-end-all. Experienced teams interviewed for this report tell us that once they had gotten accustomed to working with their training infrastructure and had resource utilization and cost management in place, they quickly found that it was prudent not to focus too much on easily quantifiable infrastructure costs and utilization.

icon


Spotlight

It’s tempting to focus on infrastructure optimization because it is easy to put numbers on it. However, there are often more important goals the team should focus on, even if they are harder to measure and quantify. We’ll sometimes have to leave some GPU utilization on the table to make progress.

Keunwoo Choi, Senior Principal ML Scientist at Genentech

Where does the training data come from, and what role does it play?

From a 10,000-foot view, companies and the public are increasingly becoming concerned about data sovereignty. In addition to owning models and the training infrastructure, as well as building expertise, controlling where data comes from and how it is used is a key component of many foundation model strategies.

The shifting role of data in machine learning

Prior to the advent of deep learning, machine-learning models were typically trained on meticulously curated and labeled datasets. A few hundred to a few thousand highly expressive samples were often sufficient. Feature engineering, the practice of identifying high-signal features and compressing the data into lower-dimensional representations, played a major role.

With the advent of deep learning and increased compute and memory capacity, the datasets became significantly larger. ImageNet, a widely used, internet-curated dataset, consists of X images with …, totaling 
Foundation models have brought yet another shift. The datasets are orders of magnitude bigger, the individual samples are larger, and the data is less clean. The effort that was previously spent on selecting and compressing samples is now devoted to curating vast datasets. In addition, data engineering has become crucial not only in data curation but also throughout the training process.

Beyond being able to process larger and more diverse data sets, foundation models exhibit strong transfer learning abilities, i.e., learning to solve a task by first training on data that does not contain examples of the task and later only on a few high-quality task-specific samples. In-context learning, where task-specific examples or instructions are only provided at inference time, is the pinnacle of this capability.

icon


Key Term

Zero-shot and few-shot learning

Transfer learning capabilities influence data acquisition for foundation model training. For example, training on monolingual data can improve the ability of foundation models in language translation. An LLM can learn to process a new language through fine-tuning on a monolingual dataset, retaining its ability to solve tasks like summarization or question answering originally acquired in a different language.

Making new data sources accessible to machine learning

The inherent ability of foundation models to process large amounts of data without preprocessing, missing or contradictory information, and different modalities open up ways to utilize previously untapped data sources.

icon


Spotlight

Most of today’s climate and weather models rely on nicely structured input data coming from weather agencies. They put a three-dimensional grid over the planet, and for each cell, you get a data point comprising well-defined quantities.
But there’s a ton of data out there that’s a lot uglier and far less processed. There’s data from weather stations randomly spread across the globe, from radar, from satellites. It’s all very heterogeneous—some data is dense, some data is sparse, and the shapes and formats are different. It’s challenging to work with, but it’s literally lying around for people to use.
We’re now at a point where we can build models that are able to absorb this kind of information. I believe that’s the next frontier: utilizing any information you have, no matter if it’s a dataset meticolously prepared by a government agency or signals you scrape directly off a sensor.

Cristian Bodnar, Co-Founder of Silurian AI

Examples of publicly accessible sources include large web crawl datasets, source code repositories on GitHub, or digitized libraries. The practice of scraping data from the internet and utilizing it with coarse-grained filtering, as practiced by the leading LLM providers, raises concerns about low-quality and harmful information fed into LLMs, as well as about data privacy, ownership, and copyright.

Many organizations have equivalent data repositories (e.g., data lakes). Data suitable for foundation model training can also be purchased.

Data qualities in foundation model training

Curating high-signal data remains a top priority for foundation model teams. Training on low-signal data, at best, makes training slow (and, in turn, costly). Usually, it is detrimental to downstream performance. Data quality is particularly important toward the end of a training run. An established practice is to train with carefully vetted high-quality data in the final iterations.

In light of the vast amounts of data required for foundation model training, it is unrealistic to inspect every data sample or even a significant fraction of it. How can we still get to this high-signal data? ., FineWeb or), and 

The role of human annotators in foundation model training

With the change in data sources used, the role of domain experts in the model training process evolved as well. Traditionally, they were involved in curating and annotating data ahead of training. In foundation model training, their core responsibility is to evaluate the models’ performance on downstream tasks.

Still, many foundation model efforts rely on human labelers to create and prepare training datasets. Point to the huge amount of human labeling the big players do (e.g., https://time.com/6247678/openai-chatgpt-kenya-workers/), there is also a big market for data annotation (billion dollar market in 2023/2024, numbers and projections vary between just shy of 1 billion USD to 10+ billion)
Scanned documents can be useless if there are no experts to analyze and help preprocess them. You need domain expertise to turn the data into information a model can learn from. This is true in cases where the data does not (directly) contain the answer the model is supposed to give (as in language translation and climate modeling).)

Reinforcement Learning from Human Feedback (RLHF)—credited for the breakthrough success of ChatGPT—relies on data created by labelers explicitly hired to do so but also on feedback collected from users of foundation model applications.

icon


Key Term

Reinforcement Learning from Human Feedback (RLHF)

Synthetic data

A machine-learning model requires a certain number of data samples to learn a concept or relationship. Thus, as discussed above, the relevant quantity is not the number or size of the data samples but the amount of pertinent data samples contained in a dataset.

This becomes a problem for signals that rarely occur and thus are rare in collected data. In order to include a sufficient number of data samples that contain the signal, the dataset has to become very large, even though the majority of additionally collected data samples are redundant.

Oversampling rare signals risks overfitting on the samples rather than learning robust representations of the signal. A more useful approach is to create data samples that contain the rare signal artificially. Synthetic data has a long history in machine learning, being also used for privacy purposes and training on alternative scenarios.

The companies interviewed for this report that utilize synthetic data treat its generation as an inherent part of their foundation model efforts. They develop their own approaches, building on established methods and recent progress in the field.

Having a human in the loop is often crucial to ensuring that synthetic samples resemble the desired real-world signal. Otherwise, models optimize for irrelevant artificial information, which might even negatively influence their performance on the downstream task.

icon


Spotlight

In our medical image data, a lot of crucial features are extremely rare. Even if we train a model on a very large dataset, the number of samples a feature appears in is too small for the model to learn it sufficiently. Thus, we synthesize images that contain such crucial features– but in a way that we, as developers, are not the ones prescribing what exactly those features are. Then, we validate these images with experts.

Robert Berke, CTO and Co-Founder at Kaiko AI

How are foundation model teams organized?

While the team setup varies, what all foundation model teams surveyed for this report have in common is that they are multi-disciplinary. Traditionally, machine-learning teams consisted of data scientists and ML engineers who handled deployment and operations.

This is insufficient for successful foundation model projects. Implementing the model architecture, preprocessing a dataset, and maintaining training pipelines are not enough. As one interviewee for this put it, “You need someone to do the maths,” but at the same time, people who can handle distributed data processing and training on large-scale infrastructure.

Software engineering is crucially important and can no longer be neglected at the scale of foundation model training. (Previously, poor engineering could be compensated by resource usage.) It runs through a lot of the topics we discuss in this report) Some companies training foundation models have dedicated infrastructure teams, but many embed infrastructure/SRE folks within their teams as boundaries blur.

Hiring for foundation model teams

Because of the depth and range of skills that have to come together to train a foundation model, assembling and paying for a team can be a bigger hurdle than the infrastructure costs or data availability. A handful of people can be enough if they bring the skills, with the typical number of people involved in creating an FM ranging from five to 500 people according to our survey.

Since the skills required are broad and specific, hiring for FM training teams is a challenge. There are few people with substantial experience yet. There are many people who have experience using LLMs/FMs or have maybe played with fine-tuning Llama and the like, but very few people have experience pretraining FMs. Thus, you typically have to rely on top performers who can get up to speed quickly. For smaller, less-resourced companies, the high salaries afforded by frontier labs are another barrier to attracting top talent.

icon


Spotlight

You need to make sure to take hiring very seriously. There’s a seemingly endless influx of applications for research positions at DeepL. So it’s a lot more about finding the right people out of all the ones that apply and then making sure that you help them grow and develop. It’s a hard-fought battle, but we’ve done really, really well.

Stefan Mesken, VP of Research at DeepL

Lorem ipsum

Scaling

Lorem ipsum

Multi-modal LLMs

Lorem ipsum

Foundation models beyond LLMs/language

Lorem ipsum

Improve training efficiency through data and software engineering

Lorem ipsum

Best practices for foundation model training

Lorem ipsum

Summary and outlook

Lorem ipsum

Lorem ipsum

Appendices



Source link

Leave a Comment