March 5, 2025

ikayaniaamirshahzad@gmail.com

Challenges & Solutions For Monitoring at Hyperscale

“What is not measured, cannot be improved.” This quote has become a guiding principle for teams training foundation models. When you’re dealing with complex, large-scale AI systems, things can spiral quickly without the right oversight. Operating at hyperscale poses significant challenges for teams, from the large volume of data generated to the unpredictability of hardware failures and the need for efficient resource management. These issues require strategic solutions, that’s why monitoring isn’t just a nice-to-have—it’s the backbone of transparency, reproducibility, and efficiency. During my talk at NeurIPS, I broke down five key lessons learned from teams facing large-scale model training and monitoring. Let’s get into it.

Real-time monitoring prevents costly failures

Imagine this: you’re training a large language model on thousands of GPUs at a cost of hundreds of thousands of dollars per day. Now imagine discovering, hours into training, that your model is diverging or that hardware issues are degrading your performance. The financial and operational implications are staggering. This is why live monitoring—the ability to act immediately—is so critical.

Live monitoring allows teams to see experiment progress as it happens, rather than waiting for checkpoints or the end of a run. This real-time visibility is a game-changer for identifying and fixing problems on the fly. In addition, automated processes allow you to set up monitoring workflows once and reuse them for similar experiments. This streamlines the process of comparing results, analyzing results, and debugging issues, saving time and effort.

However, achieving true live monitoring is far from simple. Hyperscale training generates an overwhelming volume of data, often reaching up to a million data points per second. Traditional monitoring tools struggle under such loads, creating bottlenecks that can delay corrective action. Some teams try to cope by batching or sampling metrics, but these approaches sacrifice real-time visibility and add complexity to the code.

The solution lies in systems that can handle high-throughput data ingestion while providing accurate, real-time insights. Tools like neptune.ai make this possible by providing dashboards that visualize metrics without delaying training. For example, live tracking of GPU utilization or memory usage can reveal early signs of bottlenecks or out-of-memory errors, allowing engineers to proactively adjust course. See here some testimonials:

One thing we’re always keeping track of is what the utilization is and how to improve it. Sometimes, we’ll get, for example, out-of-memory errors, and then seeing how the memory increases over time in the experiment is really helpful for debugging as well.

James Tu

Research Scientist, Waabi

For some of the pipelines, Neptune was helpful for us to see the utilization of the GPUs. The utilization graphs in the dashboard are a perfect proxy for finding some bottlenecks in the performance, especially if we are running many pipelines.

Wojtek Rosiński

CTO, ReSpo.Vision

Real-time visualization of GPU memory usage (top) and power consumption (bottom) during a large-scale training run. These metrics help identify potential bottlenecks, such as out-of-memory errors or inefficient hardware utilization, enabling immediate corrective actions to maintain optimal performance. | Source: Author

Troubleshooting hardware failures is challenging: simplify it with debugging

Distributed systems are prone to failure, and hardware failures are notoriously difficult to troubleshoot. A single hardware failure can cascade into widespread outages, often with cryptic error messages. Teams often waste time sifting through stack traces, trying to distinguish between infrastructure problems and code bugs.

At Cruise, engineers used frameworks like Ray and Lightning to improve error reporting. By automatically labeling errors as either “infra” or “user” issues and correlating stack traces across nodes, debugging became much faster.

Igor Tsvetkov

Former Senior Staff Software Engineer, Cruise

AI teams automating error categorization and correlation can significantly reduce debugging time in hyperscale environments, just as Cruise has done. How? By using classification strategies to identify if failures originated from hardware constraints (e.g., GPU memory leaks, network latency) or software bugs (e.g., faulty model architectures, misconfigured hyperparameters).

Intuitive experiment tracking optimizes resource utilization

Another relevant aspect of hyperscale monitoring is optimizing resource utilization, in particular in a scenario where hardware failures and training interruptions can set teams back significantly. Picture a scenario where training jobs suddenly deviate: loss metrics spike, and you’re left deciding whether to let the job run or terminate it. Advanced experiment trackers allow for remote experiment termination, eliminating the need for teams to manually access cloud logs or servers.

Use checkpoints at frequent intervals so you do not have to restart from scratch, but just warm-start from the previous checkpoint. Most mature training frameworks already offer automated checkpointing and warm-starts from previous checkpoints. But most of these, by default, save the checkpoints in the same machine. This doesn’t help if your hardware crashes, or, for example, you are using spot instances and they are reassigned.

For maximum resilience and to prevent losing data if hardware crashes, checkpoints should be linked to your experiment tracker. This does not mean that you upload GBs worth of checkpoints to the tracker (although you can and some of our customers, especially self-hosted customers, do this for security reasons), but rather have pointers to the remote location, like S3, where the checkpoints have been saved. This enables you to link the checkpoint with the corresponding experiment step, and efficiently retrieve the relevant checkpoint at any given step.

A comparison of training workflows with and without advanced experiment tracking and checkpointing. On the left, failed training runs at various stages lead to wasted time and resources. On the right, a streamlined approach with checkpoints and proactive monitoring ensures consistent progress and minimizes the impact of interruptions. | Source: Author

However, there are two caveats to successfully restarting an experiment from a checkpoint: assuming that the experimentation environment is constant, or at least reproducible, and addressing deterministic issues like Out-of-Memory errors (OOMs) or bottlenecks that may require parameter changes to avoid repeating failures. This is where forking can play a significant role in improving recovery and progress.

Track months-long model training with more confidence. Use neptune.ai forking feature to iterate faster and optimize the usage of GPU resources.

With Neptune, users can visualize forked training out of the box. This means you can:

Test multiple configs at the same time. Stop the runs that don’t improve accuracy. And continue from the most accurate last step.
Restart failed training sessions from any previous step. The training history is inherited, and the entire experiment is visible on a single chart.

In addition, checkpointing strategies are critical for optimizing recovery processes. Frequent checkpointing ensures minimal loss of progress, allowing you to warm-start from the most recent state instead of starting from scratch. However, checkpointing can be resource-intensive in terms of storage and time, so we need to strike a balance between frequency and overhead.

For large-scale models, the overhead of writing and reading weights to persistent storage can significantly reduce training efficiency. Innovations like redundant in-memory copies, as demonstrated by Google’s Gemini models, enable rapid recovery and improved training goodput (defined by Google as the time spent computing useful new steps over the elapsed time of the training job), increasing resilience and efficiency.

Features like PyTorch Distributed’s asynchronous checkpointing can significantly reduce checkpointing times making frequent checkpointing more viable without compromising training performance.

Beyond models, checkpointing the state of dataloaders remains a challenge due to distributed states across nodes. While some organizations like Meta have developed in-house solutions, general frameworks have yet to fully address this issue. Incorporating dataloader checkpointing can further enhance resilience by preserving the exact training state during recovery.

Reproducibility and transparency are non-negotiable

Reproducibility is the bedrock of reliable research, but it’s notoriously difficult at scale. Ensuring reproducibility requires consistent tracking of environment details, datasets, configurations, and results. This is where Neptune’s approach excels, linking every experiment’s lineage—from parent runs to dataset versions—in an accessible dashboard.

This transparency not only aids validation but also accelerates troubleshooting. Consider ReSpo.Vision’s challenges in managing and comparing results across pipelines. By implementing organized tracking systems, they gained visibility into pipeline dependencies and experiment parameters, streamlining their workflow.

A single source of truth simplifies data visualization and management at large-scale data

Managing and visualizing data at scale is a common challenge, amplified in the context of large-scale experimentation. While tools like MLflow or TensorBoard are sufficient for smaller projects with 10–20 experiments, they quickly fall short when handling thousands or even hundreds of experiments. At this scale, organizing and comparing results becomes a logistical hurdle, and relying on tools that cannot effectively visualize or manage this scale leads to inefficiencies and missed insights.

A solution lies in adopting a single source of truth for all experiment metadata, encompassing everything from input data and training metrics to checkpoints and outputs. Neptune’s dashboards address this challenge by providing a highly customizable and centralized platform for experiment tracking. These dashboards enable real-time visualization of key metrics, which can be tailored to include “custom metrics”—those not explicitly logged at the code level but calculated retrospectively within the tool. For instance, if a business requirement shifts from using precision and recall to the F1 score as a performance indicator, custom metrics allow you to calculate and visualize these metrics across existing and future experiments without rerunning them, ensuring flexibility and minimizing duplicated effort.

Consider the challenges faced by Waabi and ReSpo.Vision. Waabi’s teams, running large-scale ML experiments, needed a way to organize and share their experiment data efficiently. Similarly, ReSpo.Vision required an intuitive system to visualize multiple metrics in a standardized format that any team member—technical or non-technical—could easily access and interpret. Neptune’s dashboards provided the solution, allowing these teams to streamline their workflows by offering visibility into all relevant experiment data, reducing overhead, and enabling collaboration across stakeholders.

I like those dashboards because we need several metrics, so you code the dashboard once, have those styles, and easily see it on one screen. Then, any other person can view the same thing, so that’s pretty nice.

Łukasz Grad

Chief Data Scientist, ReSpo.Vision

The benefits of such an approach extend beyond visualization. Logging only essential data and calculating derived metrics within the tool reduces latency and streamlines the experimental process. This capability empowers teams to focus on actionable insights, enabling scalable and efficient experiment tracking, even for projects involving tens of thousands of models and subproblems.

Visualizing large datasets

We generally do not think of dataset visualization as part of experiment monitoring. However, preparing the dataset for model training is an experiment in itself, and while it may be an upstream experiment not in the same pipeline as the actual model training, data management and visualization is critical to LLMOps.

Large-scale experiments often involve processing billions of data points or embeddings. Visualizing such data to uncover relationships and debug issues is a common hurdle. Tools like Deepscatter and Jupyter Scatter have made progress in scaling visualizations for massive datasets, offering researchers valuable insights into their data distribution and embedding structures.

Moving forward

The path to efficient hyperscale training lies in combining robust monitoring, advanced debugging tools, and comprehensive experiment tracking. Solutions like Neptune Scale are designed to address these challenges, offering the scalability, precision, and transparency researchers need.

How about being one of the first to access Neptune Scale?

Neptune Scale is our upcoming product release built for teams that train foundation models. It offers enhanced scalability and exciting new features. You can join our beta program to benefit from Neptune Scale earlier.

If you’re interested in learning more, visit our blog or join the MLOps community to explore case studies and actionable strategies for large-scale AI experimentation.

Acknowledgments

I would like to express my gratitude to Prince Canuma, Dr. Shantipriya Parida, and Igor Tsvetkov for their valuable time and insightful discussions on this topic. Their contributions and perspectives were instrumental in shaping this talk.

Was the article useful?

Explore more content topics:

Source link

Challenges & Solutions For Monitoring at Hyperscale

Real-time monitoring prevents costly failures

Troubleshooting hardware failures is challenging: simplify it with debugging

Intuitive experiment tracking optimizes resource utilization

Reproducibility and transparency are non-negotiable

A single source of truth simplifies data visualization and management at large-scale data

Visualizing large datasets

Moving forward

Acknowledgments

Was the article useful?

Explore more content topics:

Latest articles

ChatGPT gained one million new users in an hour today

China police deploy real-life Robocop as humanoid tech takes huge leap forward

Runway releases Gen-4 video model with focus on consistency

Leave a Comment Cancel reply

ChatGPT gained one million new users in an hour today

China police deploy real-life Robocop as humanoid tech takes huge leap forward

Runway releases Gen-4 video model with focus on consistency

Challenges & Solutions For Monitoring at Hyperscale

Real-time monitoring prevents costly failures

Troubleshooting hardware failures is challenging: simplify it with debugging

Intuitive experiment tracking optimizes resource utilization

Reproducibility and transparency are non-negotiable

A single source of truth simplifies data visualization and management at large-scale data

Visualizing large datasets

Moving forward

Acknowledgments

Was the article useful?

Explore more content topics:

Latest articles

ChatGPT gained one million new users in an hour today

China police deploy real-life Robocop as humanoid tech takes huge leap forward

Runway releases Gen-4 video model with focus on consistency

Leave a Comment Cancel reply

Featured articles

ChatGPT gained one million new users in an hour today

China police deploy real-life Robocop as humanoid tech takes huge leap forward

Runway releases Gen-4 video model with focus on consistency