The Coming Plateau of AI ModelScaling–An Opportunity for AI Innovation

Jan 28

Jordan Zeiger, Tensility Intern and Undergraduate Student at Cornell University College of Engineering

Armando Pauker and Wayne Boulais Managing Directors at Tensility Venture Partners

Introduction

The amount of human-generated public text data available for training large language models (LLMs) could soon reach a limit—potentially as early as 2026—according to a recent paper titled "Will We Run Out of Data? Limits of LLM Scaling Based on Human-Generated Data," [1] published in June 2024 by Epoch AI, a research institute investigating key trends and questions that will shape the trajectory and governance of AI. This intriguing concept prompted us to consider the potential effects of reaching an asymptotic limit on new training data for LLMs. As this limit approaches, will AI enter a new era? What will happen when AI model companies can no longer rely on increasing model training dataset sizes to achieve better performance? How will the industry evolve? How will the AI landscape change? In this blog post, we explore the implications of a future where the growth rate of LLM training datasets plateaus, viewing this limit not as the end of AI’s growth, but as an opportunity to explore new business opportunities.

The Epoch AI Observation

Epoch AI highlights the rapid growth in the size of the datasets used to train AI models in recent years. The size of the training dataset directly correlates with improved performance. For instance, the dataset used to train GPT-4 is four orders of magnitude bigger than the dataset used for GPT-2 as the number of parameters increased from 1.5 billion to 1.8 trillion. However, since the amount of human-generated data is finite (though large) and growing at a far slower rate than the increase in dataset sizes, this exponential growth in training data size cannot continue indefinitely.

Figure 1: “Projection of effective stock of human-generated public text and dataset sizes used to train notable LLMs. Individual dots represent dataset sizes of specific notable models. The dataset size projection is a mixture of an extrapolation of historical trends and a compute-based projection that assumes models are trained compute-optimally.” (Epoch AI)

The Epoch projection is illustrated in Figure 1 above, where the blue line represents the projected growth of dataset sizes used to train notable LLMs, while the green line shows the estimated stock of human-generated public text. Note that the y-axis is logarithmic. While human data grows exponentially (which appears linear on a log scale), dataset sizes used totrain notable LLMs have grown even more rapidly. The graph shows error bands without one specific date when training data will only grow at the rate of normal human data generation - indicating that the timing depends on several assumptions. One such assumption is how the models are trained. Models can be trained "compute-optimally", which would delay hitting the asymptotic limit. However, it has become increasingly common for models to be overtrained, meaning trained on more data than what is compute-optimal, which is done because overtrained models are more efficient during inference. As shown in the graph’s purple section, with 5x overtraining, the projected limiting date is 2027. This may be an optimistic estimate, as recent models like Llama 3 have been overtrained by 10x. While there is a range of timeframes, from this analysis it seems clear to us that this event will happen; it’s just a matter of when.

The graph shows that between 2026 and 2032, dataset sizes for training LLMs will asymptotically approach the upper limit of available human-generated data and then grow at a measured pace. Once this limit is reached, the lack of new data growth could create a bottleneck, fundamentally altering the pace of future LLM scaling. This is in line with scaling laws for neural language models [2]. This limit may arrive even sooner if models continue to be overtrained, as discussed above.

Deals like the recent partnership between Condé Nast and OpenAI, where AI companies gain access to proprietary content, illustrate how industry leaders are scrambling to secure more training data as they recognize the impending scarcity [3]. However, these measures are merely stopgaps. The fundamental issue remains: the frenetic pace of ever-expanding models is likely nearing its end.

Optimization and Specialization

The successful introduction of ChatGPT 3 set off a furious race to build ever-larger models, trained on increasingly larger datasets with impressive results, such as the introduction of the extremely functional GPT-4. However, the lack of significantly more available training data will cap model performance. This will cause a significant paradigm shift in the industry. The days of simply building bigger models will be over, potentially democratizing the field. Smaller companies and startups, which previously devoted resources to compete with the immense size and workforce of heavily funded AI model companies, may find themselves on a more level playing field. Companies that once struggled to keep up with the pace of AI model upgrades may now feel confident they can focus on using current models to build their AI systems and leverage AI for their specific needs without worrying that a future release of GPT , for example, would drive them out of business. The focus will likely shift to refinement of existing models and the introduction of niche applications, providing an ideal opportunity for startups to innovate. Rather than creating ever-larger general-purpose models, companies are beginning to explore how to make existing models more cost-efficient, and effective, applying AI in targeted, meaningful ways.

This trend is already underway. Recently, Meta announced that the open-sourced Llama 3 was downloaded almost 350 million times [4]. This is significant growth, as monthly usage of Llama is ten times larger from January 2024 to July 2024. This is a clear indication that companies feel that Llama 3 is “good enough” and that developers are comfortable integrating the current generation of LLMs into their businesses—a feat that would have been challenging just a few years ago, as AI integrations would run the risk of quickly becoming outdated with the release of bigger and better models.

As the rate of improvement in general-purpose LLMs slows, the focus is likely to accelerate the shift toward creating specialized language models tailored for specific tasks or industries. The models could be open source, like Llama 3 and others, or cheaper versions of proprietary models. Even Microsoft has moved in this direction with their announcement of the Phi-3 family of Small Language Models (SLMs) [5]. SLMs could offer superior or more cost-effective performance within their niche areas compared to their larger, more generalized counterparts [6].

Techniques such as knowledge distillation may become increasingly important in this new era. Distillation, as we have discussed in a previous post [7], allows smaller, more efficient models to learn from larger ones, compressing the knowledge of a massive model into a smaller, more specialized version. This approach could enable the deployment of AI in environments with limited computational resources, such as on edge devices.

Hardware

As software evolves, so will hardware. In past computer system evolutions, hardware was developed first and then the software application layer came afterward. The LLM explosion inverted that order. Companies like Nvidia were the beneficiaries because their existing general-purpose products saw massive success due to the insatiable demand for chips to train ever-larger models. Once the rapid growth of LLM scaling slows down, even the hardware companies will have to adjust their innovation strategies.

It is possible we see a move from general-purpose GPUs to specialized processors and data center hardware optimized solely for running a specific model like Llama 3. Previously TPUs took a step in this direction, creating silicon optimized for machine learning models, but next we could see this taken even further, towards optimization for a specific model [8]. Will Nvidia or AMD release these new processors, or will they come from startups? These specialized chips could challenge existing vendors and become the backbone of future AI applications, enabling faster, more efficient processing and reducing the size and costs of data centers.

Looking at historical trends and patterns, the shift from general-purpose computing to specialized applications would make sense. A classic historical example of this is the Pentium family of processors in the early 2000s. Intel was a dominant supplier of CPUs when the Pentium family drove performance in early desktops, as the hardware advanced from single-core to multi-core processors. However, when the pace of change slowed down and the industry reached the level of “good enough”, the focus changed towards developing more specialized versions of the general-purpose chips. As chip manufacturers transitioned to multi-core processors, the industry’s focus changed from general CPU advancement to power optimization for laptops and performance for data center applications.

Beyond Language Models

As the focus on scaling traditional language models reaches diminishing returns, the next big breakthrough in AI may come from new areas, such as instead of language based models, models trained on images or videos. However, this only scratches the surface of AI’s potential. The next step could involve exploring models that go beyond traditional mediums, such as those trained on spectral data and other unconventional data types.

Conclusion

As the growth of LLMs training datasets reaches its natural limit, the AI industry will be forced to enter a new phase of innovation. Rather than focusing solely on scaling model training dataset size, the emphasis will likely shift towards optimization and specialization. Startups and established companies alike will find opportunities in refining current models, developing smaller, more efficient alternatives, and applying AI to niche sectors. This shift will democratize the AI landscape, allowing smaller players to compete by leveraging models that are "good enough" for specific applications, fostering creativity and more targeted innovation.

At the same time, hardware advancements will play a crucial role in this new era. The plateau of model scaling opens the door for the development of AI-optimized chips and infrastructure.

Companies that anticipate these trends, whether through AI chip innovations or niche model applications, will thrive as the industry pivots from generalized growth-driven competition to efficiency and creativity-focused opportunities. This coming plateau is not a limitation but a springboard for the next phase of AI evolution.

References

1. https://epochai.org/blog/will-we-run-out-of-data-limits-of-llm-scaling-based-on-human-generated-data

2. https://arxiv.org/pdf/2001.08361

3. https://www.wired.com/story/conde-nast-openai-deal/

4. https://ai.meta.com/blog/llama-usage-doubled-may-through-july-2024/

5. https://azure.microsoft.com/en-us/blog/introducing-phi-3-redefining-whats-possible-with-slms/

6. https://www.tensilityvc.com/insights/small-models-big-impact-part-i

7. https://www.tensilityvc.com/insights/slimming-down-ai-models-the-power-of distillation

8. https://cloud.google.com/tpu

Armando Pauker