March 26, 2025

ikayaniaamirshahzad@gmail.com

The GPU Cloud ClusterMAX™ Rating System

The ClusterMAX™ Rating System and content within this article were prepared independently by SemiAnalysis. No part of SemiAnalysis’s compensation by our clients was, is, or will be directly or indirectly related to the specific tiering, ratings or comments expressed.

Introduction

The exuberance in the GPU rental market has cooled off. We predicted this in our December 2023 GPU Cloud Economics Report and re-iterated this view in our AI Neocloud Anatomy and Playbook Report released in October 2024. Technological improvements mean the cost of computing goes down over time, and we now believe it’s a buyers’ market for GPU rentals, especially for the Hopper class and MI300 class GPU. There is widespread availability from over 100+ AI Neoclouds and Hyperscalers.

Part of this is due to new entrants, and with more options to rent. Currently, there is no “how-to guide” to rent a GPU or any independent evaluation of GPU clouds until today.

For the past 12 months we have spent time creating the GPU Cloud ClusterMAX™ Rating System, or ClusterMAX ™ for short. We have independently tested and/or collected customer feedback from as many GPU clouds as possible. We believe that with this first GPU Cloud Rating, we will cover 90% of the GPU rental market by GPU volume. We hope to include more providers in our next rating exercise so that we can evaluate their quality.

This isn’t an exhaustive list of GPU providers. We have a much more extensive list of players that we are aware of, and the entire market map is shown in the following image. This list appears to be expanding daily, but many of the neoclouds are not yet ready for customers. This is the point of ClusterMAX™, as it’s a simple tool to help you navigate complexity. It is probably worth spending your dollar on a ClusterMAX™ rated provider.

Our rating classifications are Platinum, Gold, Silver, Bronze, and UnderPerform. We will explain each rating in more detail later in this report.

In addition we will also be discussing the market for H100 rentals, where it’s headed from here, Hyperscaler vs Neocloud pricing, cluster level TCO, cluster returns and scenario analysis, various debates around demand, and applying this framework/analysis to Coreweave and their IPO.

Executive Summary

SemiAnalysis has developed the world’s first GPU Cloud Rating System – we have named this system ClusterMAX™. We look into rating GPUs from the perspective of an average reasonable customer.
SemiAnalysis has independently tested dozens of GPUs and ClusterMAX™ currently has approximately 90% coverage of the entire GPU market by GPU volume.
The bar across the GPU cloud industry is currently very low. ClusterMAX™ aims to provide a set of guidelines to help raise the bar across the whole GPU cloud industry. ClusterMAX™ guidelines evaluate features that most GPU renters care about.
ClusterMAX™ has five different tiers: Platinum, Gold, Sliver, Bronze and UnderPerform.
We will be conducting regular ClusterMAX™ rating and evaluation exercises every 3-6 months so that various GPUs can have their improvements reflected, and customer can have the latest information on GPUs.
ClusterMAX™ Platinum represents GPU clouds that are raising the industry bar and there is only one GPU cloud, CoreWeave, that provides services at this tier.
CoreWeave is the only non-hyperscaler currently that is experienced at operating large-scale 10k+ H100 clusters reliably.
Some of these providers in ClusterMAX™ Bronze category are already making a considerable effort to catch up such as Google Cloud. We believe Google Cloud is on a Rocketship path toward ClusterMAX™ Gold or ClusterMAX™ Platinum by the next time we re-evaluate them.
Enterprises mainly rent GPUs from Hyperscalers + CoreWeave. Enterprises rarely rent from Emerging Neoclouds.
Hyperscalers’ GPU rental prices are higher than that of Neocloud Giants and Emerging Neoclouds as Hyperscalers mainly serve the enterprise market.
Oracle comes in at one of the lowest GPU Rental price points amongst the Hyperscalers.
Amongst GPU clouds that are highly competent on the technical front, Nebius offers the lowest absolute price and the best terms for short to medium-term rents. Crusoe also offers reasonable pricing and contract terms in additional to strong technical competency.
As we first discussed in our article on GPU Cloud Economics published back in December 2023, technological improvements mean the cost of compute goes down over time, and we now believe it’s a buyers’ market for GPU rentals. There are 100 of GPU clouds all competing for mostly the same customers.
The launch of DeepSeek provided a momentary short-term stabilization and even increase in H200 rental prices but in the medium to long term, prices are still declining.
Jensen Huang, Chief Revenue Destroyer, said last week, “When Blackwells start shipping in volume, you couldn’t even give Hoppers away”. From perspective of the GPU operator,this should be a call for GPU Rental providers to ensure that they engage in contracts that protect them from rapid compute price declines – i.e. sign long term contracts where possible. From the customer’s perspective, they may prefer flexibility in their commitments and opt for shorter-term contracts.
We will talk more about GPU Rental pricing and the IRR of GPUs as well as recent GPU Rental Market Rates for different contract lengths at the end of the article. Scroll down to the end if you are the reader who cares mainly about the finance side of GPUs or how to think about unit economics for the GPU rental business.

The GPU Cloud ClusterMAX™ Rating System

The goal of the ClusterMAX™ rating is to evaluate and benchmark the more than 100 GPU providers. This provides the broader ML community with an understanding of the capabilities, features, advantages, and disadvantages of each GPU provider. This better informs a provider which GPU cloud(s) can best meet their needs. Our second objective is to provide a set of guidelines to help raise the bar across the whole GPU cloud industry. Currently, the bar is lower than you can imagine.

ClusterMAX ™ hopefully will raise the bar, and we’ve stepped into the deep end and done extensive benchmarking and profiling over the last year to see the best and worst practices at GPU providers. We have scaled from single nodes to 1024 GPU clusters, and today we share our findings with the broader community.

We evaluate features that GPU renters care about, such as:

Security
LifeCycle and Technical Expertise
Slurm and Kubernetes
Storage
NCCL/RCCL Networking Performance
Reliability and Service Level Agreements (SLAs)
Automated Active and Passive Health Checks and Monitoring
Consumption Models, Price Per Value, and Availability
Technical Partnerships

In today’s report we will dive deeper into specifics on how we evaluate GPUs and our guidelines later in the article.

We will re-evaluate and update our GPU Cloud ClusterMAX™ Tier list every 3-6 months to incorporate new information for the GPU rental market. We want to give GPU providers feedback and the opportunity to improve, and we are always open to and appreciate our dialogues with GPU providers.

We break our ClusterMAX™ rating system into five tiers:

ClusterMAX™ Platinum
ClusterMAX™ Gold
ClusterMAX™ Sliver
ClusterMAX™ Bronze
UnderPerform

Each tier has attributes that justify the rating it receives. Let’s discuss from best to worst.

The ClusterMAX™ Platinum tier represents the best GPU cloud providers in the industry. Providers in this category consistently excel across evaluation criteria, including security, price for value, technical expertise, reliability backed by clearly defined SLAs, seamless managed Slurm/Kubernetes offering, and best in class NCCL/RCCL networking performance. Platinum-tier providers are proactive, innovative, and maintain an active feedback loop with the community to continually raise the bar. That’s why they are the best.

ClusterMAX™ Gold tier providers deliver strong performance across most key evaluation categories, with some opportunities for improvement. They offer solid security, reliable infrastructure, and competitive pricing, and competent technical support. Although Gold-tier GPU clouds may have gaps or inconsistencies in features like active health checks, they generally demonstrate responsiveness to feedback and a commitment to improvement. They are positioned as great choices for GPU renters to maximize throughput.

Providers rated at ClusterMAX™ Silver demonstrate adequate GPU cloud offerings with a satisfactory balance of performance, security, and value, but they typically have noticeable gaps compared to Gold or Platinum-tier services. These providers meet basic standards for reliability, security, support, and have adequate networking performance but lack advanced orchestration, or have confusing pricing structures. Silver-tier GPU clouds have room for improvement and typically benefit significantly from adopting industry best practices.

The ClusterMAX™ Bronze tier includes GPU cloud providers that fulfill the minimum criteria but consistently exhibit shortcomings in our evaluation areas. Common issues may include inconsistent support, subpar networking performance, unclear SLAs, limited integration with popular tools like Kubernetes or Slurm, or less competitive pricing. Providers in this category need considerable improvements to enhance reliability and customer experience. Some of these providers in this category are already making considerable effort to catch up – Google Cloud for example – and we are excited to see what their next ClusterMAX™ result will be in 3-6 months.

GPU providers placed in the UnderPerform category fail to meet basic industry and security requirements across multiple evaluation metrics. Providers in this tier generally exhibit substantial such as insecure security practices, poor reliability or uptime, unclear or misleading marketing, limited technical knowledge or customer support, and inadequate orchestration capabilities. Most commonly, providers in the Underperform tier do not have SOC2 compliance or have security risks exposing traffic between your workload and the internet which could be logged by networking equipment. The GPU providers that land themselves in the UnderPerform category are often the same companies that spam AI generated advertisements.

Jensen Huang, Chief Revenue Destroyer

Jensen Huang, Chief Revenue Destroyer, said last week, “When Blackwells start shipping in volume, you couldn’t even give Hoppers away.”

Back in April 2024, our pricing model in the AI Cloud Total Cost of Ownership Model suggested such an outcome. GPU prices declined throughout 2024 due to a ramp up in H100 production, with this decline continuing into late 2024 as buyers pivoted to focus on their Blackwell strategy. One year later, our forecasts have been dead on, with a 2-3% margin of error for H100 SXM.

The forecast model operates with three main inputs:

1. Install base of AI Accelerators globally: We utilize our Accelerator Industry Model to determine the installed base of every GPU SKU shipped thus far and our estimates based on supply chain analysis for future GPU shipments.

2. Total Cost of Ownership for AI Clusters: We calculate the total cost of ownership for an AI Cluster, including capital costs such as the AI server, networking, storage, installation, and service, as well as operating costs such as colocation rental, power costs, remote hands and support engineers and internet connectivity.

3. Compute Throughput for AI Accelerators: Estimated and measured effective training FLOPS and inference throughput (in tokens/second/GPU). For some systems, our AI Engineering team has run training and inference profiling and benchmarks, while for others, we have estimated output based on the chip specifications and architecture.

We use the total cost of compute together with the compute throughput to calculate the cost of compute in terms of $/hr per effective PFLOP for training and $/M tokens for inference.

The market cost of compute is then determined by a weighted average of the cost of compute for each accelerator, based on its install base. With this market cost of compute, we can then multiply this by the compute capabilities of a given accelerator to calculate the “mark to market” rental cost for that accelerator.

The below table provides a simple example of the workings behind this forecast. Here, we see that the GB200 NVL72 offers a 75% lower inference unit cost in terms of $/M tokens vs an H100 and a 56% lower training cost in terms of $/hr per effective PFLOP. This means that if the GB200 NVL72 sets the market cost of compute, then the H100 would have to be priced 65% lower per hour than the GB200 NVL72 in order for buyers to be indifferent between renting the two. The H100 would have to be set at a rental of $0.98 per hour per GPU to compete with a GB200 NVL72 priced at $2.20 per hour per GPU.

The dynamic of increasing availability of systems with far lower cost of compute that drives the overall cost of compute in terms of $/M tokens and $ per effective PFLOP lower, in turn dragging rental prices for older cards down as well.

Behind the paywall (at the end) we have a deep dive into future pricing of GB200, the IRR estimates of H100 and GB200 Neoclouds, and the market prices for different contract lengths. We will also discuss how we can apply the above framework towards analyzing CoreWeave’s unit economics and potential return on investment. This was pretty much midroll advertisement for the AI TCO. Now, back to the ClusterMAX™ rating.

GPU Cloud ClusterMAX™ Rating System Guidelines

We have designed our GPU Cloud ClusterMAX™ Rating based on what most GPU Renters want from their GPU. Let’s discuss the needs of GPU users before we jump into our guidelines. Our guidelines prioritize the needs of GPU users and their preferred experience.

Most GPU Renters want GPU nodes with out-of-the-box-managed Slurm/Kubernetes, allowing them to focus on coding, training, and deploying their PyTorch/JAX models. GPU renters wish to focus on small to large-scale experiments without much consideration of managing and maintaining the infrastructure underneath. Most GPU renters recognize that GPUs can sometimes fail, and when this happens, they want full visibility into the root cause and how the provider is addressing the issue.

To develop our ClusterMAX™ Rating, we have talked to many GPU renters to consider their needs and wants from their GPU. Based on our discussions, we have considered the following attributes during our evaluation:

Security
Lifecycle and Technical Expertise
Reliability/SLA
Slurm and Kubernetes offerings
NCCL/RCCL Networking Performance
Storage
Active/Passive Health Checks and Monitoring
Pricing and Consumption Model
Technical Partnerships

Security

We start with Security, as this is a critical make-or-break factor for many GPU renters. They store their proprietary model weights on GPUs, which cost tens of thousands to tens of millions of dollars to train and are the core intellectual property of most GenAI companies. Furthermore, training and/or inferencing these ML models can involve the use of proprietary or personally identifiable information or other user data. Customers of these companies that rent GPUs do not want their data to be leaked due to the use of an insecure GPU cloud. In EU countries, the stakes are even higher, as there are heavy fines for leaking user data under GDPR law.

We also notice that across the industry, there is a long tail of Emerging Neoclouds that do not have even basic SOC2 or ISO 27001 security certifications. We even see some clouds on “AMD Alliance Instinct Cloud Partners” list that do not have basic security such as SOC2 or ISO27001. We have spoken with AMD, and they have confirmed that they are investigating the issue and committed to helping raise the industry standard on this topic.

Enterprises mostly rent from hyperscalers, as GPU users trust Hyperscalers to properly implement security measures. We are starting to see some enterprises look into renting from Neoclouds, and most are gravitating toward CoreWeave. These enterprises conduct more stringent due diligence and are risk adverse when renting from a non-Hyperscaler.

Many enterprises don’t even have a security checklist to validate their cloud as they expect security akin to the Hyperscalers renting CPUs. Neoclouds entering the enterprise market, such as CoreWeave, must demonstrate to their potential enterprise customers they are secure. CoreWeave, for example, has crossed that hurdle, with financial companies such as Jane Street among its customers. High-frequency trading companies, such as Jane Street, have the strictest security requirements, as they deal with proprietary data and algorithms, which are the secret sauce to how they generate profits.

Ensuring tenant network isolation is essential to prevent unauthorized access to data. On Ethernet, this is achieved by setting VLANs on the networking switches such that nodes from Tenant A can only communicate with nodes from Tenant A, and nodes from Tenant A cannot communicate with nodes from Tenant B, and vice versa. On Ethernet, tenant isolation can also be carried out using DPUs, such as Bluefield-3, to manage this isolation instead of having the networking switches handle it. We see only the most advanced GPU cloud operators implementing tenant isolation using DPUs – examples include CoreWeave, OCI, AWS, GCP and Azure.

Other GPU clouds lack the technical capabilities to fully utilize the DPU feature set and must resort to implementing tenant isolation on the networking switches instead. On InfiniBand, tenant isolation is done through Partition Keys (PKeys). We recommend that GPU renters explicitly request in their Cloud Master Service Agreement (MSA) that there is tenant networking isolation on both the Ethernet and InfiniBand network through VLANs or IB PKeys to ensure protection via proper tenant isolation further.

In addition to PKeys, on InfiniBand there are additional keys that a GPU Operator must set to ensure proper security and to ensure that the InfiniBand network cannot be easily hijacked:

The Subnet Manager Key (SM Key) must be set to prevent unauthorized Subnet Managers/UFMs from being deployed. The Management Key (MKey) must also be set to protect the fabric against unauthorized configuration changes. To prevent congestion control functions of the fabric from being hijacked, CongestionControl Key (CC Key) must be set, this needs to be set in nonblocking IB fabrics too. For InfiniBand fabrics that have SHARP in network reduction enabled, Aggregation Management Key (AM Key) must be enabled to prevent the InfiniBand’s SHARP Aggregation Manager from being hijacked. As one can see, there are numerous InfiniBand security keys that must be set to ensure proper security; however, there is a notable lack of public documentation from Nvidia and a general lack of awareness and education in the industry regarding these critical keys.

We recommend that Nvidia provide publicly accessible documentation and education on InfiniBand security, helping GPU clouds properly set it up. We’ve pointed many GPU clouds directly to Nvidia for best practices.

We recommend as a GPU renter, you explicitly request in your Cloud Master Service Agreement (MSA) that PKeys, AM Keys, SM Keys, M Keys, CC Keys, VS Keys be set to further solidify that the GPU you rent from have enabled these security protections.

CoreWeave used to provide their multi-tenant GPU cluster offering by having a single Kubernetes cluster housing multiple tenants and then using Kubernetes namespace isolation between each tenant on the same Kubernetes cluster, providing each tenant with a vCluster. This offering, now commonly referred to as “CoreWeave Classic,” was not a secure offering, and CoreWeave had already migrated away from the approach of implementing clusters with Kubernetes namespace isolation between tenants a couple of years ago.

CoreWeave has switched to having only one tenant per Kubernetes cluster, as this is a properly secure offering. In this implementation, a physical cluster will comprise multiple Kubernetes clusters, and each tenant will have its own Kubernetes cluster, rather than each tenant just having a Kubernetes namespace.

The reason why Kubernetes namespace isolation between tenants is not secure is that there are many container escape vulnerabilities, mainly in the GPU Driver or in the container toolkit. These container escape vulnerabilities would allow an attacker to escape a container and move laterally to another user on the same host, potentially escalating to another tenant’s host within the Kubernetes cluster.

Currently, there are monthly newly discovered known container escape vulnerabilities, but there could potentially be dozens of unknown container escape vulnerabilities. In September 2024, Wiz discovered a critical GPU container and Kubernetes vulnerability that affected over 35% of environments. Thus, doing just Kubernetes namespace isolation is not safe. The isolation boundaries should be on VLANs and each tenant getting their own Kubernetes cluster.

Many GPU offerings utilize a multi-tenant cluster, but each tenant is on different sets of specific physical servers such that no two tenants share the same physical servers. For some GPU cloud offerings, particularly with on-demand offerings, there may be more than one tenant per physical host. If there is more than one tenant per physical host, there needs to be VM isolation as container isolation between multi tenants on the same host is not stringent enough due to the above Nvidia/AMD container security issues that have arisen nearly every month. To reiterate – if only container isolation is used, then hackers can use these known container escapes to escape to the physical host privileges and potentially peek into the other tenant’s containers and examine model weights. They might even be able to access other servers and retrieve the tenant models of other servers. We strongly recommend against multiple tenants per physical host with just container-based isolation.

Lifecycle and Technical Expertise

Evaluating technical expertise is crucial when selecting a GPU cloud provider, as this directly impacts your team’s overall experience. The impact of technical expertise is felt even before onboarding, particularly in areas such as marketing clarity, sales process, transparent pricing, reasonable drafts of master service agreements (MSAs), pre-onboarding support, and data migration capabilities.

Moreover, assessing the Sales Phase—whether the communication was transparent, technical questions were promptly addressed, and commitments were clearly defined—also reflects the provider’s overall customer-centric approach. We see that experienced GPU clouds often have a technical engineer who consults with the customer to help with a smooth sales and onboarding experience.

A straightforward question to ask during the sales process is whether they are familiar with Sylvain. All the top GPU clouds that have optimized their NCCL settings and InfiniBand switch configurations and are skilled at debugging NCCL and Networking fabrics have interacted with the renowned Sylvain.

One essential item in your MSA is the exact delivery dates. Late deliveries are widespread in the GPU cloud industry. As a customer, you should ensure that you are happy with the exact delivery date in your MSA and should make sure there is an escape clause if there are any delays.

During the Preparatory Phase, leading GPUs typically enable users to migrate data into clusters in advance, significantly reducing the “time to value” by ensuring workloads can begin immediately after onboarding. All Hyperscalers such as GCP, Azure, OCI, and AWS all allow for this. Furthermore, most Neoclouds such as CoreWeave and Nebius allow customers the opportunity to upload giant datasets in advance not to waste GPU time. The GPU cloud should gather enough relevant information and ask key questions to ensure that there are no unexpected speed bumps or roadblocks.

The Onboarding Process itself is pivotal; the cluster should be provided on time, and there should be a high degree of automation for cluster provisioning, ensuring that there are no human errors or mistakes. The cluster that is provided should have undergone burn-in, and the burn-in process and acceptance tests should be publicly available on their website or in YouTube conference talk recordings. Instances should have no issues with reboot. After a reboot, all systems should be working, and there should be no need to manually set anything, for example, re-mounting networked filesystems. When we apply the ClusterMAX™ Rating System during the onboarding phase, we assess the “time to value” or “time to successfully launch useful work.” For example, if managed Slurm or Kubernetes is available out of the box, the end user does not need to spend a few days figuring out how to install it themselves, thus shortening the time to value.

Ongoing support during the Main Working Phase is critical as GPUs do have a higher failure rate than traditional CPU servers. H100s/H200s tend to experience soft or hard failures; therefore, it is critical to have excellent support services. The MI300x has more than twice the failure rate as the H100/H200 due to higher temperatures and less mature Samsung HBM. For maintenance incidents and outages, we observe that top GPU clouds communicate what is happening, the debug steps being carried out, and an ETA for implementing the fix. For failures of 1-2 GPU nodes, we observe top GPU clouds quickly spinning up new nodes within 90 seconds to provide to their tenants, ensuring that their customers don’t need to wait for troubleshooting. Top GPU clouds will also compensate customers fairly for outages. The loss of just a single GPU server for training means the unavailability of a significant portion of a cluster, as most training codebases require a specific number of GPUs and cannot operate without even a single GPU.

Lastly, for the Offboarding Phase, we evaluate if there is vendor lock-in. Hyperscalers famously have high egress fees to prevent their customers from switching to another cloud. Most Neocloud Giants and Emerging Neoclouds do not charge egress fees for moving data to another GPU cloud.

Slurm and Kubernetes

90% of customers prefer Kubernetes for inference workloads, and about 50% of the customers use Slurm for training. Top-ranking GPU providers are increasingly differentiating themselves by offering fully tested, out-of-the-box managed Kubernetes and Slurm environments, in addition to providing raw GPU node offerings without pre-configured schedulers. Customers universally prefer these managed scheduling solutions, as self-setup Slurm and self-setup Kubernetes consume valuable GPU resources and may take days to set up, directly increasing costs and delaying productive work.

We expect that Slurm will remain popular well into the distant future. A common misconception is that the choice of scheduler is independent of whether virtualization is used. You can have bare metal Slurm, or you can have Slurm with VMs. Slurm and bare metal are not mutually exclusive. The same applies to Kubernetes; you can have bare-metal Kubernetes, as is the case at CoreWeave, or have Kubernetes with virtual machines, such as in GKE or EKS.

When providers offer managed Kubernetes and Slurm platforms, this enables customers to maximize their GPU utilization, significantly reducing their time-to-useful work. Notably, even technically sophisticated organizations such as Meta and Jane Street choose to use CoreWeave’s managed Slurm and Kubernetes offerings due to their effectiveness and reliability. CoreWeave’s managed Slurm and Kubernetes go a long way to help increase goodput and increase time to value. A notable exception is OpenAI, which opts out of managed schedulers due to heightened security and operational paranoia surrounding artificial general intelligence (AGI).

We also see that a lot of providers do not have topology.conf set up for their out of the box Slurm solution. Not having topology.conf set up leads to workload slowdowns and decreases NCCL performance. Some providers’ Slurm solutions are also not properly set up and do not have the pyxis plugin which allows reproducibility environments with containers as Slurm and is widely used by customers across CoreWeave, GCP, AWS, OCI, and other major providers. Other providers do not adequately set up NVIDIA HPC-X modules or Slurm MPI integrations.

Storage

Efficient and performant storage solutions are essential for machine learning workloads, both for training and inference. We see most customers want managed high-performance parallel filesystems such as Weka, Lustre, Vast Data, DDN, and/or want a managed s3-compatible object storage.

During training, vast quantities of data must be quickly and reliably accessed to feed GPUs without bottlenecks, which means that high-performance storage is needed for model checkpoint loads and saves such that GPUs can maximize MFU and thereby significantly accelerate training time.

Managed object storage options are equally crucial for flexible, cost-effective, and scalable data storage, enabling teams to efficiently store, version, and retrieve training datasets, checkpoints, and model artifacts.

For ML inference workloads, performance-oriented storage ensures that models are loaded rapidly from storage production scenarios. Slow or inefficient storage can cause noticeable delays, degrading the end-user experience or reducing real-time responsiveness of AI-driven applications. It is, therefore, vital to assess whether GPU cloud providers offer robust managed parallel filesystem and object storage solutions, ensuring that these options are optimized and validated for excellent performance across varied workloads.

With GPUs, the main two sources of user frustration with storage are when file volumes randomly unmount and when users encounter the Lots of Small File (LOSF) problem. The solution to the random unmounting issue is to use a program called “autofs” that will automatically keep your shared filesystem mounted.

Next, the LOSF problem can easily be avoided as it is only an issue if you decide to roll out your own storage solution like an NFS-server instead of paying for a storage software vendor like Weka or Vast. An end user will very quickly notice an LOSF problem on the cluster as the time even to import PyTorch into Python will lead to a complete lag out if an LOSF problem exists on the cluster.

The diagram below, produced during our testing on Crusoe’s cluster, demonstrates how a cluster storage solution optimized and free of the LOSF problem should behave. As you can see, the time to complete importing PyTorch into the Python process stays relatively flat even when scaling up GPU count.

This is a world of difference to a cluster that is running on unoptimized shared storage, where the time required to import PyTorch in a Python multi-node training run explodes, often causing the cluster to be completely unusable. Notice the difference between high-performing storage and how another cluster with LOSF issues would behave.

NCCL/RCCL Networking Performance

When selecting GPU cloud services, thorough validation of NCCL/RCCL networking performance is vital for maximizing training and inference performance. Providers should offer validated, out-of-the-box NCCL/RCCL-tests scripts enabling customers to independently confirm network performance, particularly within the critical real-world message-size range of 16MiB to 512MiB.

A network that is half as slow on all reduce leads to a 10% drop in performance on MFU on an O(70B) training and a 15-20% drop in MFU for an O(8x7B) mixture of expert models. It is a common misconception that inference does not need high-speed networking. In reality, inference providers apply network-bandwidth-intensive techniques like disaggregated serving to achieve cost-effective, high-performance inferencing. Disaggregated serving has been an industry standard for years, and last week NVIDIA open-sourced Dynamo, a distributed inference framework, further democratized disaggregated serving and many other inference optimization techniques.

It’s essential to recognize that not all on paper 400G networks offer identical performance – actual realized network performance depends heavily on whether the network is non-blocking and rail optimized, whether it uses InfiniBand or Ethernet, and which NICs and switches are used, and finally if the GPU operator has properly configured and validated their network fabric with nccl/rccl-tests.

We observe that network fabrics utilizing ConnectX-7 NICs perform the best. We see that well-tuned network with premium switches, as is the case at OCI, can be very competitive with Spectrum-X Ethernet networking. InfiniBand still tends to perform the best, especially when enabling SHARP in-network reductions. We have run networking NCCL/RCCL benchmarks from 128 GPUs to 1024 GPUs and will release all the performance data in our upcoming NCCL/RCCL networking Deep Dive article within the next couple of months. We have tested and are releasing our analysis of the following networks in the upcoming article:

8x400GbE Spectrum-X RoCEv2 Ethernet H100
8x400G InfiniBand NDR H100
8x400G InfiniBand NDR with SHARP H100
8x400GbE Oracle Cloud RoCev2 Ethernet H100
8x200GbE Google Cloud Fastrak Ethernet a3-mega H100
8x400GbE Google Cloud RoCEv2 Ethernet a3-ultra H200
16x200GbE AWS EFAv3 Ethernet p5en H200
8x400GbE RoCEv2 Ethernet MI300X

Even if a network is non-blocking, we see that networks with larger 1-hop rail optimized pods can achieve higher NCCL performance because less traffic will need to transit between rail pods, thus leading to less congestion. On GCP’s 8x400GbE a3-ultra, their 1-hop rail pod consists of just 4 nodes, whereas for OCI Ethernet and the InfiniBand reference architecture, the 1-hop rail pod is 32 servers in size. This is one of the reasons why OCI 4x400GbE offering has better NCCL performance than GCP’s latest 8x400GbE a3-ultra offering.

GPU clouds implementing topology-aware tenant allocation means better performance as tenant nodes can minimize the number of hops when communicating with each other by binpacking tenant nodes into the minimum amount of rail pods needed. Topology-aware tenant allocation is required even for nonblocking networks.

Within a tenant’s environment, out-of-box topology-aware scheduling configurations such as Kubernetes topology or Slurm’s topology.conf are crucial, even in optimized and non-blocking network setups. We have seen that such configurations can yield performance gains of 20-30% for certain message sizes, regardless of cluster size. We observe that this has a significant impact on performance for workloads that utilize only a portion of a cluster, as well as for workloads that occupy the entire cluster. We have seen on a 1024-GPU non-blocking InfiniBand Reference architecture deployment that nccl-tests are 20-30% slower for certain message sizes. We will expand on this topic in our NCCL/RCCL deep dive article.

For Ethernet-based setups, understanding whether providers employ high-quality Arista switches with a high-quality, battle-tested NOS, such as Arista EOS, that has been properly tuned for optimal performance, or whether they are using less expensive alternatives, can significantly impact NCCL performance. Note that there are considerable performance differences between a great Tomahawk5 switch from a top switch company and a mediocre Tomahawk5 switch from a middle-of-the-pack switch company. This is especially true when it comes to WhiteBox Tomahawk5 switches that are not well-tuned and poorly configured by the GPU cloud. It is possible to tune them really well, it is just the case that many clouds have not because it is challenging to do so.

Another consideration is if SHARP in-network reduction is enabled on the InfiniBand fabric. There are only three GPU providers in the world that have correctly set up SHARP – namely CoreWeave, Azure, and Firmus/Sustainable Metal Cloud. InfiniBand SHARP provides a boost to network performance by doing reductions inside the InfiniBand switch instead of in the SMs of the GPUs. Even when SHARP is enabled by the GPU provider, it is very difficult for the end user to correctly tune and make sure their training and inference codebase have the proper versions of PyTorch, Nvidia drivers, and the proper NCCL library version to gain a performance speedup.

Due to this difficulty, there are only five customers in the world across the GPU cloud industry that are using SHARP for production training and inference workloads, and out of these five customers, one is Nvidia itself. As such, we recommend that Nvidia make SHARP easier to set up for the GPU provider and enable SHARP by default, thereby improving the overall user experience when deploying and managing SHARP.

Lastly, we see that the top 10% of GPU clouds are focused on developing NCCL profiler plugins, enabling them to gain deep performance observability and let their customers gain more insights into the NCCL performance for debugging and optimization. GPU clouds that choose to deploy NCCL monitoring plugins into production environments will enable their customers to achieve faster performance and higher goodput, leading to a better overall customer experience and greater performance per dollar spent.

Reliability and Service Level Agreements (SLAs)

Reliability and clear Service Level Agreements (SLAs) are the foundational agreement of what you expect from your GPU provider’s uptime. Precision about how providers define SLA events – whether they’re related to node failures, network disruptions like link flapping, or hardware-software-level issues such as NCCL timeouts – is essential. For example, a GPU cloud operating under a vaguely defined SLA could try to claim they are meeting their 99% SLA requirement even if a cluster is unusable if there is a NIC that flaps for one microsecond every minute.

When NICs flap even for one microsecond, it causes NCCL to stall, and the entire training workload hangs. In such a case, it may take anywhere from a couple of minutes to 30 minutes to restart the workload. Top GPU clouds typically have a low rate of NCCL timeouts and NIC flaps, as they have already undergone the process of burning in their network and transceivers, and have taken essential steps such as cleaning dust off fiber cables, which is one of the leading causes of degraded goodput.

Top GPU Providers typically outline conditions under which service credits are issued, including the mechanisms for credit reimbursement, transparency around these processes, and the turnaround time for resolving incidents. Equally important is whether the provider maintains readily available “hot spares,” which enable immediate failover in the event of hardware failure, thereby significantly reducing downtime.

We see top GPU clouds leverage specialized deployment teams for cluster burn-in and deployment. These teams will integrate and test at the individual server level and at the cluster-wide level, during which networking testing will be carried out at OEMs’ integration factory. We recommend that the cluster-wide high-temperature burn-in should last at least 3-4 weeks so as to catch all the infant mortality-related failures among the node’s components.

It is common for integration teams to pitch using LINPACK as their burn-in and acceptance process. Still, we don’t believe this is a very good test, as LINPACK does not utilize the network extensively, nor does it heavily utilize the GPU’s HBM memory; instead, it only utilizes and tests the GPU’s FP64 cores. ML Training, by contrast, is very network, HBM, and BF16/FP16/FP8 tensor core intensive, and as such, we believe that it is a burn-in and acceptance test that carries out a proper burn-in of critical components.

Testing and burn-in procedures are essential practices designed to identify potential hardware or networking issues before deployment. Providers that openly share burn-in reports and document their testing processes publicly demonstrate higher accountability and confidence in their infrastructure’s stability. This level of transparency is required to be a ClusterMAX™ Platinum GPU cloud. CoreWeave has demonstrated its in-depth acceptance and burn-in procedures through its KubeCon talk and its many public blog posts.

For failures of 1-2 GPU nodes, we see top GPU clouds typically can quickly spin up new nodes within 90 seconds to give to their tenants such that their customers don’t need to wait for troubleshooting. Top GPU clouds also fairly compensate for inaccessibility. The unavailability of just a single GPU server means the unavailability of a large chunk of a cluster, since most training codebases require a set number of GPUs and cannot afford to be missing a single GPU.

Automated Active and Passive Health Checks and Monitoring

Top GPU providers offer customers the option to implement robust, out-of-the-box passive health checks designed to identify GPU anomalies, such as critical XID errors promptly. These passive checks monitor GPUs for fallback events, mainly when GPUs degrade and revert to slower PCI bus communication modes. Providers ensure continuous awareness of GPU health, detecting conditions that might otherwise lead to reduced performance or failures. These passive health checks verify node integrity by confirming that storage mounts remain stable and accessible, ensuring data availability and system reliability.

Beyond passive monitoring, leading GPU providers equip their solutions with automated weekly scheduled and preemptible active health checks that proactively validate hardware and software performance on a weekly basis. These active diagnostics commonly include NCCL-tests, which evaluate GPU-to-GPU communication integrity and performance, ensuring clusters can efficiently perform collective operations and match the expected reference numbers. Providers also integrate the NVIDIA Data Center GPU Manager (DCGM) diagnostic suite (sudo dcgmi diag -r 4), which deeply evaluates GPU hardware health, identifying issues before they escalate.

To further strengthen detection capabilities, top GPU providers incorporate sophisticated automatically weekly scheduled Silent Data Corruption (SDC) detection checks, such as TinyMeg2, enabling early detection of subtle corruption issues within the tensor core or SIMT unit. They enhance these diagnostics with rapid ML training Megatron convergence tests—typically two-minute benchmarks—that quickly reveal deviations in computational correctness, thereby minimizing downtime and preserving data accuracy.

Finally, premium GPU providers offer extensive out-of-the-box Grafana monitoring capabilities, providing operators with deep visibility into their clusters’ operational states. This monitoring typically includes live tracking of TFLOP/s estimates, enabling an accurate and real-time assessment of GPU performance. Providers deliver comprehensive monitoring for critical interconnect links, promptly identifying link flaps or intermittent connectivity issues. Additionally, integrations with visualization platforms such as Grafana enable intuitive monitoring dashboards that include features like real-time Slurm job queue integration, performance trend visualization, and clear indicators showing when active health checks were last executed, empowering administrators with actionable insights to maintain optimal GPU performance and reliability.

Consumption Models, Price Per Value, and Availability

Pricing, consumption models, and immediate availability are one of the most important factors when selecting a GPU provider. Customers want the most comprehensive feature sets at the lowest price with the best terms.

GPU Compute prices are expressed in USD per hour per GPU – based on a typical on-demand price of USD 2.99/hr/GPU for an H100 SXM with 8x400G InfiniBand, renting one server of 8 GPUs would cost USD 23.92 per hour and $574.08 per day.

For most Neoclouds, this price is all-inclusive and bundles in the on-board CPU, networking, power usage, local NVMe storage, as well as having Slurm and drivers properly set up. Customers typically take up dedicated network storage used for training, checkpointing, or managing training and inference data, as well as object storage. Storage generally is charged separately, with prices of around 6-9c per GB/month for high-performance network storage and 2-3c per GB/month for Object storage. Internet connectivity and data ingress and egress are typically not chargeable.

There are a few different options when subscribing for GPU Compute:

On-demand: The GPU compute buyer pays based on the actual time the GPU instance/server is used, with the price of compute subject to adjustment. This affords the most flexibility and is most often used for development, burst inferencing, or hobbyist work. However, it typically has the highest price out of the three main options. The current best on-demand pricing is $2.99 USD per hour per GPU.
Spot: Also known as pre-emptive, like on-demand, usage is charged by actual time on the instance/server, but usage can be interrupted at any time to make way for other workloads or users. This is best suited for jobs that do not require real-time processing, although the smooth resumption of jobs remains a developing capability. Spot instances are most suitable for inference workloads or batch jobs that can be interrupted with one minute’s notice. Nobody uses spot instances for training, as being randomly kicked off a multi-node instance is highly disruptive. Spot pricing gives Neoclouds flexibility to quickly free up capacity for more important customers or more lucrative workloads. Spot pricing can be lower than on-demand – we have seen quotes in the $2.00 and even $1.00 to $2.00 range.
Contract/Reserved: The compute price is locked in for a given time, and usage cannot be interrupted. Ordinary contract tenors include one month, 6 months, one year, 18 months, 2 years, 3 years. Due to the wide availability of H100/H200 across 100 of GPU clouds, most customers do not sign 1-3 years deals anymore

Pre-emptible or on-demand cluster options offer more flexibility, making them ideal for intermittent or elastic workloads. A noteworthy strategy emerging among some Neocloud providers involves selling idle compute capacity at discounted rates but with specific clauses permitting providers to reclaim resources with short notice (typically seven days) should higher-paying customers emerge.

Providers like Google Cloud Platform (GCP) and Amazon Web Services (AWS) offer scheduled capacity through flex modes or capacity blocks, allowing customers predictable access at defined intervals.

From a GPU Provider’s perspective, it’s typically advantageous to secure long-term contracts, mainly because, as highlighted by NVIDIA’s “Chief Revenue Destroyer” Jensen Huang, GPU performance per dollar improves extremely rapidly each year, making early commitments favorable for the GPU cloud to lock in margins early on with long term contracts.

From the Customer’s perspective, it is typically advantageous to secure short-term contracts due to the reason that Mr. “Chief Revenue Destroyer” mentioned at his GTC Keynote – namely, the exponentially increasing GPU performance per dollar delivered each year that he launches a new GPU generation.

Most customers are engaged in an AI arms race with their competitors, meaning that availability remains a critical differentiator. Customers frequently require immediate resource provisioning and want their clusters available not just today, but they practically want them available yesterday. Both Nebius and Crusoe excel in this domain, with extensive availability and the capability to provision substantial GPU clusters (such as 128 GPUs) from initial contact to contract signing and provisioning within remarkably short timelines—often less than two days.

Nebius currently stands out by offering the best absolute pricing while still maintaining robust technical capabilities. Their approach involves aggressive cost optimization strategies, including the adoption of Original Design Manufacturer (ODM) hardware for GPU servers, significantly lowering their total cost of ownership. For example, by bypassing traditional OEM providers like Dell or Supermicro, which can price servers with as high as 10-15% gross margins, Nebius achieves cost reductions through a custom-designed ODM chassis, minimizing gross margins from a typical 10-15% down to around 2%.

This strategy not only reduces initial hardware expenditures but also lowers ongoing power consumption, allowing Nebius to pass substantial savings onto customers. All Hyperscalers use this ODM strategy, too, but Nebius stands out as the only non-Hyperscaler that deploys an ODM-built chassis.

Oracle presents a distinct advantage in pricing for customers, prioritizing Hyperscaler capabilities and stringent security measures. Their pricing model reflects enterprise-grade security and comprehensive integration with other cloud services, aligning well with organizations that require deep ecosystem integration or compliance-focused workloads. Consequently, while not necessarily the cheapest absolute option, Oracle delivers exceptional price-to-value for enterprise customers.

Technical Partnerships

Nvidia has a program called “Nvidia Cloud Partner (NCP),” where GPU clouds that can meet certain requirements can achieve NCP status, and Nvidia helps them get sales and ensure they have lines of communication with technical folks to help advance their GPU cloud offering. We tend to see that GPU clouds with NCP status have better performance than those without it.

Jensen has invested in the following GPU clouds:

TogetherAI
CoreWeave
Nebius
Crusoe
Lambda labs

These five GPU clouds tend to have a good user experience overall and performance based on our testing. 4 out of the five have offerings that are at the ClusterMAX™ Platinum standard or ClusterMAX™ Gold standard.

In contrast, we tend to see that GPU clouds that AMD has invested in do not do well on user experience, and none of the clouds that AMD invested in even rank as ClusterMAX™ Platinum or ClusterMAX™ Gold or ClusterMAX™ Sliver.

Some of the “AMD Alliance Instinct Cloud Partners” don’t even have basic security, such as SOC2. As such, we view being on the “AMD Alliance Instinct Cloud Partners” list as not a good predictor of tiering well in ClusterMAX™.

We have spoken with AMD, and they have confirmed that they are investigating the issue and committed to helping raise the industry standard on this topic.

GPU clouds that pay for support from SchedMD (the makers of Slurm) can significantly enhance the customer experience and services by providing robust and efficient management of GPU resources. By leveraging SchedMD’s expertise, GPU cloud providers can offer users a seamless and optimized experience, ensuring that computational tasks are handled efficiently and effectively.

Recommendations to AMD and Nvidia

We recommend to AMD that AMD ensures that all their “Alliance Instinct Cloud Partners” achieve SOC2 and ensure that any new “Cloud Partner” have SOC2 security before joining. We recommend to Nvidia and AMD that they need to set the industry-wide bar and help even their non-partner cloud get SOC2 security. Getting SOC2 Security should not be optional for GPU cloud, but instead, it is a must-have for GPU clouds. It is just like FAA certifications on airlines. You might want to fly on an airline that doesn’t have FAA certification, but most won’t. As most model weights and codebases are proprietary intellectual property (IP) and not open source and are worth tens of thousands to millions of dollars, most customers want basic security, such as SOC2.

Furthermore, we recommend that AMD provide pyxis container Slurm support such that running containers on Slurm has a good UX. Currently, it is challenging to run containers on Slurm, and even AMD’s own internal scripts are a mess due to this missing Pyxis capability. Verus on a correctly setup Nvidia GPU cloud, running containers with Pyxis on Slurm is an effortless experience.

For NVIDIA, we recommend that they should provide good publicly accessible documentation and education about the different keys (SMKeys, MKeys, PKeys, VSKeys, CCKeys, AMKeys, etc) needed to secure an InfiniBand network properly. We recommend that Nvidia help their GPU clouds properly secure their InfiniBand network and complete an audit of all GPU clouds that use InfiniBand.

Furthermore, we recommend that Nvidia should fix ease of use for SHARP for the GPU provider and recommend that Nvidia should enable it by default instead of making it challenging for the GPU provider to set it up and for the end user to see the real-world benefits of SHARP on their training and inference workloads.

We recommend that Nvidia runs regression tests on GCP networking, AWS networking and Oracle networking to prevent regressions on their Hyperscalers partners’ NCCL performance every time they release a new NCCL version. For example: on Oracle, since NCCL 2.21.5, there have been performance regressions on certain clouds when they attempted to deploy any of the subsequent NCCL versions up to 2.26.

ClusterMAX™ Platinum GPU Providers

The ClusterMAX™ Platinum tier represents the highest standard of GPU cloud services available in the industry. Providers in this category consistently excel across all critical evaluation criteria, including adopting robust security measures, competitive pricing for the value they offer, extensive technical expertise, outstanding reliability with clearly defined SLAs, seamless managed Slurm/Kubernetes offering, and superior NCCL/RCCL networking performance. Platinum-tier providers are proactive, innovative, and maintain an active feedback loop with the community to continually raise the industry bar, setting benchmarks for excellence. Currently, there is only one GPU cloud that is raising the industry bar and qualifies for ClusterMAX™ Platinum which is CoreWeave.

CoreWeave

CoreWeave is clearly leading in providing the best GPU cloud experience and has very high goodput and are entrusted to manage the large-scale GPU infrastructure for AGI labs like OpenAI and MetaAI, high frequency trading firms like Jane Street, and even NVIDIA’s internal clusters. CoreWeave is the experts at running large scale GPU clusters reliably.

CoreWeave provides 4 offerings:

CoreWeave Bare Metal without any managed scheduler
CoreWeave Managed SUNK (Slurm in Kubernetes)
CoreWeave Managed Slurm
CoreWeave Managed Kubernetes

For their bare metal offering, which doesn’t include any CoreWeave-managed software or scheduler, there are essentially only two customers: OpenAI/Azure and Nvidia EOS. OpenAI/Azure wants bare metal due to their security and operational paranoia surrounding artificial general intelligence (AGI) as well as the ability to control the cluster more tightly for reliability and performance reasons. For the CoreWeave 11,000 H100 EOS cluster, which Nvidia rents and uses for internal development, it is also bare metal as it is useful for Nvidia to manage their own Slurm and Kubernetes while they develop their slum and Kubernetes operators and plugins.

Although the rest of their customers have the option not to use the CoreWeave managed offering, all their customers decide to opt for it due to CoreWeave’s amazingly managed offering.

Firstly, we will talk about CoreWeave’s automated node lifecycle controller that ensure that during cluster bring up nodes receive a full burn-in test and full cluster InfiniBand network high-temperature burn in with NCCL-tests and ib_write_bw. Not only does this bring up burn-in test for hard failures, but they also compare it to reference numbers and see when nodes do not meet the performance expectations or are running into silence data corruption (SDC) issues. Nodes that do not meet this comprehensive test will be automatically drained for investigation and will not proceed to the customer cluster till it is fully resolved.

Once deployed in a customer environment, they will continuously do passive health checks every couple of seconds to ensure the GPUs are functioning properly. By doing this, it leads to high goodput and automatically drains and fixes unhealthy nodes. For passive health checks, they monitor for:

GPUs falling off the bus
PCIe Errors
Ethernet and InfiniBand events such as Link Flaps
Thermals such as GPU temperate
GPU and CPU Memory stats such as ECC error rate
Nvidia XID and Nvidia SXID error codes
Etc.

In addition to passive health checks, they automatically schedule on a weekly basis active health check to run on idle GPUs to do a full set of active testing to verify nodes are healthy. These tests include:

NVIDIA DCGM diag level 3 with Extensive Testing (EUD)
DtoH and HtoD Bandwdith for validating PCIe performance from CPU to GPU
Local NCCL all reduce tests for validating NVLink/NVSwitch/NVLS performance
Local InfiniBand all reduce test for validating InfiniBand performance and links (by force disabling NVLink/p2p/SHM)
Pairwise GPU ib_write_bw and ib_write_latency bidirectional tests to verify that the network is within specs with reference numbers.
Pairwise CPU ib_write_bw and ib_write_latency bidirectional tests to verify that the network is within specs with reference numbers.
GPUBurn for validating GPU won’t fail under load
Nvidia TinyMeg2 for validating hardware correctness and that GPU are free from SDC
Megatron Tests to test if TFLOP/s/GPU performance match reference numbers and that the loss convergence matches reference loss curve

By automatically scheduling these tests during cluster bring up and continuously during the whole lifecycle of the customer cluster, they are able to ensure that the customer has high goodput by proactively removing unhealthy nodes to prevent customers from submitting jobs from unhealth nodes. Customers can see in the dashboard when was the last time their nodes have had an active health check ran called “verification”.

On their main dashboard, they also show events such as the classic common error of “GPUFallingoffthebus” and “LinkFlaps”. At the infra layer without any changes to application/end user code, they use “DCGM_FI_PROF_PIPE_TENSOR_ACTIVE * 1979” to track the current fp8 TFLOP/s (times 989 to bf16 TFLOP/s estimates) rough estimate and has a system to correlate which alerts causes a drop in cluster or job wide TFLOP/s. for example, you can clearly see that the drop in jobwide TFLOP/s is caused by PCIeFault and IBLink flapping Fault. While using DCGM_FI_PROF_PIPE_TENSOR_ACTIVE isn’t the most accurate estimate of MFU, it does allow the customer and CoreWeave to view which events are relevant to the drop in MFU. In addition, the CoreWeave MFU infra layer estimates, customers can also calculate their MFU and TFLOP/s/GPU at their application layer for a more accurate absolute TFLOP/s/GPU.

CoreWeave has amazing out of the box dashboards to track InfiniBand and NVLink bandwidth and a whole host of other stats such as temperature and makes them all visible to the end user to help debug. As some are aware, the cold aisle temps change during the day and night and these temp changes may cause 2-3% difference in performance, CoreWeave provides the end user full visibility into the temp sensors of each node they are on.

All of these active and passive metrics are collected to detect outliers, something similar to drawing a “curve of best fit” and finding the nodes and points that don’t fit into a certain range of the median.

Furthermore, CoreWeave has improved their customer’s visibility into reliability and the state of each node by having a node controller overview that shows the state of each node from a healthy state to being triaged to being debugged or RMA’ed back to the OEM.

What we find amazing about CoreWeave is that they provide an automated, managed solution that abstracts away all the tasks that an ML engineer or scientist ideally doesn’t want to do.

The monitoring, passive and automated schedule active health checks and out-of-the-box managed scheduler, all of this loop back into what an ML engineer/scientist wants, which is to focus on non-infra tasks and have a healthy set of verified nodes that is constantly scanned by passive health checks and automatically weekly scanned by active health checks. But an ML engineer/scientist also recognizes that sometimes things may break and broken nodes are not caught by the health checks and in those cases, they want FULL visibility into everything that is going on.

Their out-of-the-box Slurm solution comes with a pyxis plugin for running reproducible containers as a first-class inside of Slurm and comes with automatically generated Slurm topology to ensure optimized NCCL collectives. CoreWeave also has open sourced their nccl-tests scripts for reproducibility. Besides Azure, CoreWeave is one of the only providers of InfiniBand SHARP in network reduction-enabled solutions. It has also worked with Nvidia and some of its customers to enable SHARP, optimizing their customers’ workloads and NCCL performance.

These are all reasons why companies like Meta or Jane Street choose to use CoreWeave-managed Slurm/Kubernetes.

CoreWeave also offers a Slurm in Kubernetes (SUNK) solution where they run Slurm inside of Kubernetes. This allows customers to schedule their Slurm training with their Kubernetes inference services dynamically. This solution has essentially zero downsides, except for one.

The one downside is that changing GPU vboost settings requires the Slurm container setting to be privileged but this is very simple fix to enable in the yaml. Some people may think that SUNK means there is vendor lockup, but this is not true. If customers want to move away from CoreWeave, they can take their batch scripts and Kubernetes yaml files to other providers since Kubernetes yaml and batch scripts are open standards and work on any Slurm solution are properly set it. The reason customers continue to stick with and renew their contracts is CoreWeave’s technical expertise and their amazing node lifecycle controller and health checks.

From our independent testing of CoreWeave H100 clusters, we noticed that the team of engineers and solution architects are the subject matter experts in GPU infrastructure and NCCL. Their onboarding experience was smooth, and they provided an onboarding doc that shows all the IP addresses and certain common FAQ. It becomes very clear to us that CoreWeave is the technical subject matter expert when we ask in-depth questions about certain PCIe AER health checks and about specific InfiniBand security keys, such as the difference between SMKeys and MKeys.

One thing that is missing from self-serve deployment, and CoreWeave’s deployment, is a bunch of complex tops, yaml files, and Kubernetes, which is typically not in the vocabulary of an ML Scientist who want to train in Slurm, but luckily, CoreWeave tasks an engineer to any customer that wants CoreWeave to do the deployment themselves. We recommend that CoreWeave work on a UI console flow for deploying their managed Slurm solution, ideally with less than four button clicks.

CoreWeave clusters commonly submit very competitive MLPerf training results. Furthermore, all the MLPerf Training results that NVIDIA submits are obtained on the CoreWeave 11,000 H100 EOS cluster, which NVIDIA rents from CoreWeave. CoreWeave runs many clusters, many of which have over 10,000 GPUs. Due to their close partnership with NVIDIA, they get access to early tranches of the next GPU allocation, and as we mentioned in the Neocloud Anatomy, the players that deploy first for each GPU cycle can lock in long-term low-risk contracts from favorable customers.

CoreWeave is the only Neocloud capable of operating clusters with 10,000+ GPUs consistently and reliably. Besides CoreWeave, the only other GPU clouds that are able to operate them reliability are the four Hyperscalers: Azure, OCI, AWS and GCP.

From the customer perspective, a downside of CoreWeave is that they rarely accept short-term rentals, and most of their rentals are giant clusters for long-term tenants. This is different from Nebius and Crusoe, which offer competitive terms for short-term rentals of GPUs.

ClusterMAX™ Gold Tier GPU Providers

ClusterMAX™ Gold tier providers deliver strong performance across most key evaluation categories, with some opportunities for improvement. They offer solid security practices, reliable infrastructure, competitive pricing models, and competent technical support. Although Gold-tier GPU clouds may have occasional gaps or inconsistencies in specific features like advanced active health checks, they generally demonstrate responsiveness to feedback and a clear commitment to continuous improvement, positioning themselves as excellent choices for GPU renters for maximizing goodput. To move from Gold to Platinum, they must have a demonstrated history of raising the industry bar such as CoreWeave.

Crusoe

We’ve been using Crusoe Cloud for the past seven months and have been consistently impressed by their offering. Their console UI is straightforward to navigate and user-friendly, which has significantly simplified resource management and deployment. The intuitive experience they offer in the dashboard sets a high standard in the GPU cloud market, particularly in terms of ease of use and accessibility.

We had a very positive experience dealing with GPU bus errors on Crusoe servers. When we discovered the GPU bus error, Crusoe sent us an email to resolve the issue. In the email, Crusoe explained that it had automatically detected a GPU-fell-off-the-bus error, reserved a health spare node, and requested that we restart the node in the console to complete the migration. Crusoe automatically identified issues, proactively fixed them, and guided users on migration. This robust fault management improves user experience.

Initially, Crusoe lacked a fully managed Slurm solution, requiring customers to set up Slurm clusters through Terraform scripts manually. However, they offset this complexity with an exceptional white-glove service experience, where Crusoe engineers personally handled Slurm setup for most customers, ensuring smooth deployment and minimal friction. Last week at GTC, Crusoe recently addressed this gap by announcing their fully managed Slurm offering called “Auto Clusters,” unveiled at GTC last week. This new service promises to simplify customer workflows further and remove previous manual deployment complexities. Their new “Auto Clusters” product will also come with automatically generated Slurm topology to ensure optimized NCCL collectives and auto node replacement when detecting unhealthy nodes.

Crusoe already provides a robust, fully managed Kubernetes offering, making it straightforward for users to deploy and scale containerized workloads. In terms of monitoring and reliability, Crusoe currently implements basic passive health checks; however, these are less detailed and comprehensive compared to those of industry leaders such as CoreWeave. They have not yet implemented automated active weekly scheduled health checks, such as dcgm diag, nccl-tests, Nvidia TinyMeg2, etc. Still, they’ve indicated that this critical feature is actively in development and will soon be integrated into both their managed Slurm and managed Kubernetes offerings and will try to advance their health checks to the level of CoreWeave.

Although they don’t do weekly scheduled active health checks, during cluster bring-up, they do burn-in and active health checks at the initial launch of the cluster; they do some level of testing and qualification. We recommend they investigate how CoreWeave does their cluster burn-in and advance their cluster burn-in to the same level as CoreWeave. CoreWeave has raised the industry bar for the most advanced cluster wide burn in.

In terms of pricing and contract terms, Crusoe offers competitive short- to medium-term contracts that are attractive to startups and some enterprises. Their prices and terms are not as competitive as Nebius, but for fast-moving startups that want a simplified UI and user experience, Crusoe is competitive.

Similarly, Crusoe presently does not offer managed Grafana dashboards for GPU monitoring, which is another area they have identified for improvement. They have communicated clear plans to introduce advanced, managed Grafana monitoring dashboards in their upcoming managed Slurm and Kubernetes solutions, further enhancing observability and usability. Overall, Crusoe demonstrates significant responsiveness to user feedback and shows a strong commitment to rapidly evolving their product to meet customer needs and compete effectively in the GPU cloud marketplace.

Nebius

Nebius is notable for providing the lowest pricing in the GPU cloud market, enabled by their financial position. With billions of dollars on their balance sheet and no existing debt, they benefit from abundant financial resources and significant maneuvering room. Their financial strength directly translates more risk taking and much stronger investment into business development. Examples of this include innovative offerings like bridging H100 contracts into B200 deployments as well as ubiquitous billboards in Santa Clara designed to capture mindshare. The result is unparalleled cost savings for their customers, as Nebius offers market-leading terms and highly competitive rates.

One of Nebius’s key strategies for maintaining such low prices is its commitment to using custom Original Design Manufacturer (ODM) chassis. By designing hardware internally and partnering directly with Original Design Manufacturers (ODMs), Nebius bypasses traditional OEM providers like Dell and Supermicro, which typically apply gross margins of around 10-15%. Nebius’s ODM strategy significantly reduces gross margins to about 2%, dramatically lowering both initial hardware investments and ongoing operating expenses, such as power consumption. This cost efficiency places Nebius uniquely among non-Hyperscalers providers, as they adopt an optimization typically only seen within hyperscale cloud providers.

Due to its roots as an ex-Russian cloud provider, it boasts an exceptionally talented team of cracked ex-Russian engineers. Nebius still lags behind competitors regarding user experience, though. Despite offering on-demand NVIDIA H100 GPUs at roughly $1.50 per hour (at least for the first thousand hours per month) —half the cost charged by competitors like Lambda Labs—Nebius struggles with customer adoption. Many users still prefer Lambda Labs primarily because Nebius’s UI and UX remain overly complex and unintuitive, creating friction that deters less technically inclined customers. Nebius is committed to fixing its UI/UX issues.

Finally, Nebius currently offers a fully managed Kubernetes solution but does not yet provide fully automated managed Slurm clusters, a significant gap in their product portfolio. They are actively developing their “Soperator” Slurm solution, which includes foundational passive and active health checks. However, these checks still fall short of industry-leading standards set by providers like CoreWeave. To match competitors’ reliability and observability, Nebius will need to invest more heavily in comprehensive, weekly scheduled active health checks and implement advanced out-of-the-box Grafana dashboards. Strengthening these operational aspects would further enhance their already compelling value proposition by increasing reliably to the level of CoreWeave and having automated node lifecycles.

Oracle Cloud Infrastructure

From our testing, OCI’s GPU experience is strong and consistently recognized as the most cost-effective among the four major Hyperscalers. Their GPU offerings come with a one-click UI deployment from the OCI marketplace called “OCI HPC stack” for both Slurm and monitoring. However, despite this impressive setup, OCI’s Slurm solution isn’t fully managed—it currently operates as a co-managed offering supported by one or two of OCI’s solution architects. To remain competitive, especially against AWS’s Hyperpod and CoreWeave’s comprehensive managed Slurm solution, we strongly recommend OCI invests in a dedicated engineering team to develop a fully managed offering called “Oracle Managed Slurm (OMS).” We believe that if they built a fully managed Slurm offering, most of their customers would want to use it and benefit from it. The only exception customers are OpenAI and Google, which want complete control of everything and has their scheduler. On CoreWeave, most of their customers, except for OpenAI, opt into using the CoreWeave fully managed Slurm and Kubernetes offering due to CoreWeave’s exceptional node lifecycle controller and automated active health checks.

In terms of monitoring, reliability, and passive health checks, OCI offers a decent solution through their Slurm HPC stack marketplace, complete with DCGM, Grafana monitoring, and passive health checks. Nevertheless, OCI currently lacks the option for advanced active health checks and automated node lifecycle management found in CoreWeave’s offerings, such as automated weekly scheduled active health checks (e.g., NCCL-tests, ib_write_bw, dcgm diag) and automatically marking the ones that are unhealthy. Oracle has confirmed that this functionality is already on their roadmap and is scheduled for completion in Q2. Another thing that is missing from OCI HPC stack offering for Slurm is the integrated with high speed parallel filesystems such as OCI’s Managed Lustre offering.

OCI also provides automated topology.conf configurations out of the box, enabling topology-aware scheduling that enhances network performance—an important feature still overlooked by many emerging GPU cloud providers.

Unlike GCP, OCI has been operating RoCE networks for quite a while now, even before the age of GenAI GPUs. They have a stacked bench of high-performance networking experts such as Jag and his team. Jag and his team can reason about networking from first principles, such as industry-wide link flapping being caused by DSP recalibration, while drawing on their decades of experience. From our testing, we noticed that OCI’s RoCEv2 networking is very competitive with Spectrum-X Ethernet, assuming you are on the right nccl version and have a custom OCI tuner plugin. Even though OCI uses Nvidia CX-7 NICs as part of their networking stack, we have noticed several small nccl tuner regressions on OCI clusters above nccl 2.21.5 even when on the same number of communication SMs. We will show the comprehensive benchmark results in up to 512 GPUs on OCI we ran on real-world message sizes in our upcoming nccl/rccl networking deep dive article.

We recommend to OCI that they work with the Nvidia NCCL team to ensure that there is proper regression testing on OCI clusters before the Nvidia NCCL team puts out a new nccl version release such that the out-of-the-box NCCL tuner will be able to have optimized performance on OCI.

Ultimately, OCI’s support and service teams stand out due to their technical expertise and customer-centric approach. Throughout our interactions, the OCI team consistently demonstrated deep technical expertise and a genuine commitment to customer success. Beyond GPUs, OCI is a Hyperscalers, which means it offers services such as databases, object storage, and CPU-based VMs for tasks like data processing or web scraping.

This comprehensive infrastructure eliminates the need for customers to transfer or stream data from other hyperscale providers to specialized GPU Neoclouds, resulting in significant operational efficiency and reduced complexity. Another notable advantage of going with a Hyperscaler is their approach to long-term customer engagements. Renting long-term compute resources often comes bundled with partnership opportunities for go-to-market (GTM) collaboration. These GTM partnerships have the potential to benefit customers by expanding their market reach within OCI’s extensive customer base. However, the actual effectiveness of these GTM partnerships can vary significantly depending on individual circumstances.

Regarding security, OCI excels by providing top-notch, enterprise-grade standards, including robust tenant networking isolation, VLAN isolation on RoCEv2 fabrics, and PKEY isolation for InfiniBand fabrics. In contrast, many GPU-focused cloud providers lack even basic certifications like SOC2 or ISO27001 compliance or necessary network isolation protocols, making OCI a preferred choice for enterprises with stringent security requirements.

Azure

Azure offers a robust GPU cloud infrastructure, recognized for its exceptional networking performance. From our internal testing on clusters with up to 128 H100s, Azure demonstrated impressive capabilities, notably utilizing InfiniBand SHARP for efficient in-network reductions. This advanced networking setup makes Azure a top-tier option for high-performance, large-scale AI workloads, particularly suited to intensive multi-node training scenarios. In our upcoming NCCL/RCCL deep dive article, we will show real-world NCCL benchmarks with and without SHARP on Azure InfiniBand networking.

Security is another standout strength for Azure. It holds an exceptional reputation for robust security and compliance practices, which has made it a trusted partner for government agencies, defense contractors, and leading AGI research labs such as OpenAI. Azure’s proven reliability at scale is evident in its successful management of massive GPU clusters, including support for OpenAI’s well-known deployments involving clusters of over 100,000 NVIDIA H100 GPUs, which highlights Azure’s ability to manage demanding and sensitive workloads securely. Note that OpenAI does complain a lot about Azure’s reliability for giant clusters, but OpenAI has extremely high standards for reliability since they have such giant clusters.

For workload management, Azure offers CycleCloud, a user-friendly, web-based UI for deploying and managing Slurm clusters. CycleCloud includes basic health checks to enhance reliability and operational awareness. We look forward to doing a complete analysis of CycleCloud. However, compared to more advanced offerings like CoreWeave’s fully automated active and passive health-checking systems, Azure’s solution has room for improvement. We specifically recommend that Azure consider adopting practices similar to CoreWeave’s comprehensive approach, such as regular automated checks (including NCCL tests, ib_write_bw, and DCGM diagnostics), as well as automated node draining and replacement to improve overall reliability.

Additionally, Azure offers a managed Lustre parallel file system, which provides high-performance storage tailored specifically for large-scale HPC and AI workloads. This integrated, optimized storage solution ensures that data-intensive workloads can scale efficiently and reliably. To further enhance their offerings, Azure would benefit from adopting more extensive passive and active monitoring solutions like those implemented by industry-leading GPU cloud providers like CoreWeave, ensuring even higher reliability and improved performance monitoring for their users.

Since Azure is a Hyperscaler, it comes with the same pros and cons. Azure has all the other services of a cloud, such as Bigtable, databases, object storage, and parallel filesystem offering, which are needed for data proc and web scraping. By being a complete cloud, they mean you don’t need to copy (or stream) the data from a “main Hyperscaler cloud” where the data processing is done to a new cloud cluster, and all your data is just there. Renting long-term computing from a Hyperscaler often comes with the added benefit of a “partnership” and partnering on Go to market, where AWS helps you sell to enterprises and other AWS customers.

TogetherAI

From our testing, TogetherAI stands out prominently in the GPU cloud provider market. While their cluster offering alone would typically qualify them as a ClusterMax™ Silver-level provider, what truly elevates them to ClusterMax™ Gold is their exceptional support and technical expertise. TogetherAI’s team, led by Tri Dao—the inventor of Flash Attention—and their Together Kernel Collection (TKC), significantly boost customer performance. We estimate that roughly 30-40% of their GPU cloud customers leverage TKC. We don’t believe the value created by Together can be replicated elsewhere without cloning Tri Dao. From our testing, we have verified that TKC boosts real-world performance for training and inference and that the Tri Dao kernels are genuinely a performance boost.

Even those not utilizing TKC greatly benefit from Tri Dao’s team’s consulting expertise in debugging, optimizing, and troubleshooting training workloads. This combination creates a genuinely full-service, supportive experience, going far beyond merely renting Kubernetes or Slurm-managed clusters. Their competitive pricing further enhances their appeal.

Additionally, TogetherAI provides intuitive, user-friendly, managed Slurm and Kubernetes solutions directly accessible via their dashboard, enabling deployment with just a few clicks. As an NVIDIA portfolio company, TogetherAI also benefits from early access to new NVIDIA hardware, such as Blackwell GPUs, and collaborates closely with NVIDIA to develop optimized kernels tailored for next-generation GPUs.

However, there are areas for improvement. TogetherAI currently lacks the Pyxis plugin for container management in Slurm environments and does not offer robust passive health checks or weekly scheduled active health checks, leaving room for concerns regarding reliability. Their default Grafana dashboards are also basic compared to competitors.

We recommend that TogetherAI implement comprehensive health-check systems, similar to those offered by CoreWeave, along with richer, more detailed Grafana dashboards. Furthermore, since TogetherAI currently relies on infrastructure provided by other GPU cloud providers, such as Applied Digital or Crusoe, support resolution can be delayed due playing a game of broken telephone. Fortunately, this is set to improve significantly, as TogetherAI plans to deploy their hardware infrastructure within the year, eliminating the current reliance on external providers and streamlining issue resolution.

LeptonAI

LeptonAI is the GPU cloud founded by the co-creators of PyTorch. LeptonAI does not own any GPUs but instead provides the ML Platform software layer for managing GPUs and health checks. They will claim to be your supercomputing team. You can either rent GPUs through them, where they rent GPUs from other providers, and LeptonAI adds their software + a couple of cents per GPU hour. Or you can rent your GPUs from Nebius, which offers great pricing, and then buy LeptonAI and support for a couple of cents per GPU hour to get the whole LeptonAI platform. LeptonAI bring big tech (Google, Meta, etc) ML platform experience to the broader world, making it accessible to everyday users. The engineers at LeptonAI clearly good at what they are doing and have a strong product sense for what their customers want.

For training, they offer a Slurm similar method of submitting jobs. For our testing, it took a couple minutes to patch our sbatch scripts to work on the LeptonAI platform. It was decently intuitive to switch to the LeptonAI ML Platform for training. LeptonAI should launch a fully sbatch superset API instead of just being similar to Slurm sbatch.

In the LeptonAI platform, you can view the node lifecycle in their console dashboard and see what jobs and state each node is in. They have superior node lifecycle visualization, and the only company that has a better node lifecycle dashboard is CoreWeave.

For passive health checks, LeptonAI runs gpud which is their open sourced solution for passive GPU health checks. It provides a comprehensive passive health check coverage for most of the passive health checks. This passive GPU check is still improving but it is a strong solution.

LeptonAI also has manual active health checks such as DCGM diag and nccl-tests, but this is run manually through the UI dashboard, and it is not done automatically on a weekly scheduled basis like CoreWeave and LeptonAI do not provide reference numbers for what NCCL tests should be. We recommend that they implement an option for customers to opt-in to having automatically actively scheduled health checks. LeptonAI also does not have Megatron Loss convergence active health checks or have Nvidia TinyMeg2 SDC detector active health checks.

LeptonAI also has some beta features such as a box zero-impact NCCL profiler, which a customer can click a checkbox, and they can gain the full advantage of their custom in-house NCCL profiler to visualize collective bottlenecks and help their customers optimize network bottlenecks.

ClusterMAX™ Sliver Tier GPU Providers

Providers rated at ClusterMAX™ Silver demonstrate adequate GPU cloud offerings with a satisfactory balance of performance, security, and value; however, they typically have more noticeable gaps compared to Gold- or Platinum-tier services. While these providers meet basic industry standards for reliability, security, and support, they may lack advanced orchestration integration, exhibit moderate networking performance, or have higher total cost of ownership (TCO) leading to higher prices for their customers. ClusterMAX™ Silver is receptive to customer and SemiAnalysis feedback and is actively seeking to improve their offerings to be competitive with ClusterMAX™ Platinum in the future.

AWS

The chief complaint that we hear about AWS is that their networking is worse than InfiniBand and Spectrum-X Ethernet. This is true, and AWS has been working on it with their new p5en EFAv3 16x200GbE H200 instance, which they released in December 2024.

Their EFAv3 instances are much closer to InfiniBand/Spectrum-X performance on nccl-tests than were EFAv2 instances.

AWS’s P5 EFAv2 instance is 32x100GbE, which is worse than InfiniBand/Spectrum-X/RoCEv2 but better than the GCP a3-mega instance (8x200GbE) released in April 2024 from our nccl-tests testing. Our NCCL tests also show that their H100 p5 EFAv2 offering has better networking than GCP a3-mega, and their new H200 p5en EFAv3 (16x200GbE) offering has better networking than GCP a3-mega. GCP’s new h200 a3-ultra offering which was released to public in January 2025 which has 8x400GbE RoCEv2 ethernet has better networking performance than AWS’s new p5en EFAv3 offering. We will show the results and benchmarks we ran on real-world message sizes in our upcoming nccl/rccl networking deep dive article.

AWS is not just a pure GPU cloud but also has all the other services of a cloud, such as Bigtable, databases, object storage, and parallel filesystem offering, which is needed for data proc and web scraping. By being a complete cloud, they mean you don’t need to copy (or stream) data from a “main Hyperscaler cloud,” where data processing is done, to a new cloud cluster; all your data is already there. Renting long-term computing from a Hyperscaler often comes with the added benefit of a “partnership” and partnering on Go to market, where AWS helps you sell to enterprises and other AWS customers. Whether the GTM partnership is effective really depends.

AWS also offers a managed Lustre parallel filesystem called FSX for posix cluster-wide networked storage. For object storage, they have their famous S3 object-managed storage services too.

AWS provides a managed Slurm and Kubernetes offering named Hyperpod, which significantly simplifies cluster setup. They offer an UI dashboard for setup and easy-to-follow instructions for setting up their managed offering.

Hyperpod includes basic passive and basic active health checks, and it also integrates straightforward Grafana dashboards for monitoring system health. Unfortunately, out of the box they are missing automated active health checks such as nccl-tests and Nvidia’s tinymeg2 SDC detector and running Megatron convergence tests weekly.

To enhance their service further, AWS should continue investing in networking improvements and consider having as a default for out-of-the-box or simple checkbox for end customers to opt into advanced passive and active automated weekly scheduled health check strategies akin to those utilized by CoreWeave.

Lambda Labs

Lambda Labs is highly regarded as a go-to provider for on-demand GPU instances, primarily due to their exceptional user interface and intuitive console experience, especially their seamless JupyterLab integration. Despite the availability of other providers, such as Nebius offering H100 SXM GPUs at half the price, Lambda remains popular for on-demand GPU instances due to other offerings having poor UX or security. Lambda Labs offers H100 SXM at $2.99/hr/GPU, and they typically set the market rate for on-demand offerings due to their high on-demand volume. When Lambda Labs lowers or raises its on-demand offering price, the rest typically do the same.

Users have also expressed interest in broader base image choices beyond the standard Lambda stack base image for their on demand offering. Lots of users and SemiAnalysis’s own testing have shown that the on-demand instance boot times are excessively long, typically around 30 minutes. For comparison, Crusoe’s H100 SXM instance offers a boot up in less than 90 seconds. This should be the bar that Lambda Labs aims for. Additionally, Lambda’s default on-demand instances incorrectly set the CUDA toolkit and CLI tool paths to /usr/bin/nvcc instead of the industry-standard /usr/local/cuda/bin/nvcc, causing compatibility issues with many open-source repositories. We have spoken with the team at Lambda Labs, and they are committed to reducing the boot time of their on-demand instances.

Lambda Labs also provides managed Kubernetes services, greatly simplifying container orchestration for users. Their managed Kubernetes offering features an out-of-the-box console UI and Grafana monitoring dashboards for viewing node and GPU metrics. Furthermore, they provide out-of-the-box scripts for nccl-tests for end customers to verify their networking performance. They also offer Vast Data based high-speed parallel filesystem for networked storage.

However, their current Slurm offering is unmanaged and, from our testing, is not good. They lack the Pyxis Slurm plugin and a lot of other Slurm features. Fortunately, Lambda is actively developing a managed Slurm solution, which is expected to improve the user experience significantly.

However, Lambda still lacks several essential features, such as automated passive and active health checks, and its metrics dashboard currently omits crucial GPU metrics that are found in competitor solutions, like CoreWeave. We recommend to Lambda Labs that they look into what CoreWeave has for passive and active health checks and implement a metrics dashboard comparable to CoreWeave’s fantastic out of the box Grafana dashboard.

Firmus/Sustainable Metal Cloud, or SMC for short, is a Singapore and Australia-based AI cloud and GPU service provider dedicated to enabling sustainable AI innovation. They offer Slurm and Kubernetes scheduling solutions, including Pyxis for containerized workloads, and utilize WEKA’s high-performance storage platform to support large-scale AI applications. From our testing, it’s a fairly decent offering.

In the MLPerf Training v4.0 benchmarks, SMC demonstrated impressive performance by training the GPT-3 175B model. Additionally, SMC has submitted verified MLPerf power consumption results, confirming that H100 immersion cooling consumes less power than comparable air-cooled GPU solutions. They claim that this savings of power translates to lower TCO, and they claim that due to their lower TCO, it translates to lower prices for their customers.

They are one of the few clouds besides CoreWeave and Azure to have InfiniBand SHARP in network reductions enabled and have shown us through nccl-tests that their networking performance is superior. But none of their customers are technical experts, so their customers haven’t used it as SHARP requires tuning on the customer’s application level, too. They have shared nccl-tests without SHARP enabled and have shown competitive results for that too.

A concern noted is that SMC’s GPUs operate approximately 10 degrees warmer under load than comparable properly deployed air-cooled offerings due to the use of air-cooled heatsinks in their immersion environment, leading to a 1-2% performance reduction. Despite this, SMC asserts that their pricing more than compensates for this minor performance loss, offering better performance per TCO. They are exploring the adoption of immersion-specific heatsinks to address this issue.

Currently, SMC lacks self-service deployment options for Slurm and Kubernetes, relying on SMC own engineers to assist with setup. It is recommended that they develop a user interface or command-line interface for streamlined deployment. Additionally, implementing automated passive health checks and automated weekly scheduled active health checks, similar to those used by CoreWeave, would enhance system reliability. The absence of basic Grafana dashboards for monitoring GPU temperatures and activity is another area for improvement, and adopting CoreWeave’s out-of-the-box monitoring solutions could be beneficial.

SMC has shown receptiveness to customer and SemiAnalysis feedback, actively considering these recommendations to enhance their offerings and remain competitive in the AI cloud services market.

Scaleway

From our testing, Scaleway offers robust Slurm and Kubernetes solutions, complemented by a high-performance, managed file system powered by VAST Data. This integration ensures scalable and efficient data management for AI and HPC workloads. During our testing, we observed that Scaleway supports NVIDIA’s Pyxis plugin, which enables seamless container integration within Slurm. Their technical team demonstrates a strong understanding of these technologies.

As a GDPR-compliant provider and an NVIDIA NCP partner, Scaleway emphasizes data privacy and leverages cutting-edge GPU technology. However, their use of gold-plated DGX Hopper chassis results in a higher total cost of ownership (TCO). This increased cost is often passed on to customers. We recommend that Scaleway explore OEM alternatives, such as Dell or Supermicro HGX SKUs, or consider ODM chassis options, which can deliver the same performance at a reduced cost. Note that it is not recommended to buy gold-plated chassis to be an NVIDIA NCP partner.

Currently, Scaleway lacks self-service deployment options for Slurm and Kubernetes. Although their engineers assist with deployments, implementing a UI or CLI tool for self-service would enhance user experience. Additionally, the absence of automated passive health checks and automated weekly scheduled active health checks is unfortunate. We suggest that Scaleway examine CoreWeave’s approach to health checks and consider adopting similar practices.

Furthermore, Scaleway does not provide basic Grafana dashboards for monitoring GPU metrics such as temperature and SM activity. Implementing these dashboards would offer valuable insights into system performance. Encouragingly, Scaleway has been receptive to customer and SemiAnalysis feedback, actively seeking to address these gaps and enhance their offerings.

ClusterMAX™ Bronze Tier GPU Providers

The ClusterMAX™ Bronze tier includes GPU cloud providers that fulfill minimum essential criteria but consistently exhibit significant shortcomings in key evaluation areas. Common issues may include inconsistent technical expertise or support, subpar networking performance, unclear SLAs, limited integration with popular tools like Kubernetes or Slurm, or less competitive pricing. Providers in this category typically need considerable improvements to enhance reliability and customer experience. Another reason that GPU providers land themselves in this tier is if they have provided subpar solutions for the past couple of years

Some of the providers in this category are already making considerable effort to catch up. Google Cloud is one such example – and we believe GCP and some other providers are already on a fast path towards ClusterMAX™ Platinum/Gold by our next ClusterMAX™ exercise in 3-6 months.

Google Cloud

For the longest time, GCP has provided inferior GPU offerings with worse networking and worse out of the box features. It has been in “catchup” mode since April 2024. Many customers have complained about their GPU offerings, but Google Cloud Platform (GCP) is taking in feedback, rapidly improving, and trying to catch up to its competition.

To provide historical context, their first H100 offering, called “a3-high,” was released in August 2023 and featured 800Gbit/s “Fastrak TCP” networking bandwidth per node. At that time, Oracle, Microsoft, all Neocloud giants, and most Emerging Neoclouds were offering 3200Gbit/s of on-paper networking speeds. This means GCP had 25% of the networking bandwidth compared to their competitors. Most of the GCP customers that used a3-high were not very happy. We will call this phase of GCP’s GPU journey the “not good at all phase.”

Google recognized this feedback from customers and in April 2024 they released their second and improved H100 offering called “a3-mega”, which doubles networking bandwidth per node from 800Gbit/s to 1600Gbit/s “Fastrak TCP”. While this was a significant improvement, it is still 50% slower than its competitors, such as Oracle, Microsoft, CoreWeave, and AWS.

According to our NCCL tests, they are twice as slow as their competitors on real-world message sizes. On end-to-end training performance, by being twice as slow on networking nccl performance, it translates to 10% worse MFU on O(Llama 70B) size training and 15-20% worse on MFU of O(8x7B) mixture of experts spare models. For the longest time, this offering did not have LL128 nccl protocol, leading to even worse training and nccl networking performance and required the end user to set complex env vars to get their NCCL net/tuner plugin to work. Furthermore, their Slurm recipe was buggy and hard to set up. We will refer to this as the “catch-up phase,” where GCP is clearly trying to improve, but it is still not yet on par with its competitors.

GCP continues to gather customer feedback, and in January 2025, they launched their a3-ultra instance, which finally offers 3200Gbit/s of RDMA Ethernet networking with ConnectX-7 NICs per node, effectively increasing the networking bandwidth per node. This update brings GCP closer to matching the capabilities of its competitors, including Oracle, Microsoft, and CoreWeave.

In practice, they are still not quite on par with real-world NCCL collective networking, which we will explain more about below. With this new a3-ultra SKU, they have moved from TCP to RDMA over Ethernet. As most people are aware, RDMA is often chosen as the collective network protocol over TCP due to its lower latency and higher AI collective performance. We are glad that GCP finally moved towards a more industry networking setup for GPUs, but this comes late and 18 months after their competitors launched their 3200Gbit/s RDMA networking offerings. We will refer to this as the “almost caught up” phase. Note that currently the majority of their customers and their GPU fleet are A3-Mega, which means that the majority of their customers are still experiencing subpar networking and have anywhere from 10-20% worse performance when using A3-Mega.

By mid-2025, GCP will be making their latest A4 B200 and A4X GB200 instances generally available, which will be competitive on paper with AWS, Azure, OCI, and the other Neoclouds that will offer 400Gbit/s per GPU. GCP will also continue to improve and launch new software features, which will be setting industry standards. We will call this “setting the bar” phase.

Due to the subpar experience and performance of A3-High and A3-Mega, they have lost a significant amount of customer confidence in their product, which will take time to regain. We believe that by mid 2025 that GCP will finish “catching up” and will soon be raising the bar across the industry and regaining customer confidence. We believe that Google GPU offering could lead towards a ClusterMAX™ Gold or ClusterMAX™ Platinum tier GPU cloud.

In January 2025, we reached out to Google, showing them our NCCL performance tests and a list of all the customer complaints and feedback GCP customers were telling us. The GCP team was quite receptive to the feedback and is working quickly to address it.

The first feedback they acknowledge is the subpar networking performance of A3-High and A3-Mega, which comprise the majority of their GPU fleet. They are working on addressing this with the launch of a3-ultra, which comes with the industry-standard 3200Gbit/s of RDMA bandwidth per node. For their upcoming A4 B200 and A4X GB200 offerings, it will be competitive on paper in terms of speeds with other B200 and GB200 offerings in the industry.

A3-mega instances were also lacking LL128 protocol, which meant that their real-world NCCL performance for real-world message sizes was degraded. In January 2025, they released a fix to all of their customers enabling LL128 protocol on a3-mega. A3-ultra comes out of the box with LL128 NCCL protocol, so it was great to see the improvements in their newer SKUs. A3-ultra still has slightly worse performance than OCI Ethernet and Azure InfiniBand, but on end-to-end training performance, GCP is only 1-2% MFU less than a comparable InfiniBand reference offering. Note that each rail group size for GCP a3-ultra is still only 4 nodes, versus on OCI, Azure and most Neoclouds, it is 32 nodes. This means it will take more hops to do collectives leading to more congestion and slower performance. We will explain this more in our NCCL deep dive article. For a3-mega, they are currently still missing NVLSTree NCCL algorithm. NVLSTree NCCL algo allows for faster networking collective performance by utilizing the NVLS functions in the NVSwitch for multi-node performance. They are currently working on implementing it. For a3-ultra, they have NVLSTree & NVLS & RING & TREE & PAT algorithm support out of the box so it was good to see that GCP is shipping fully functioning products in their latest SKU. In our upcoming NCCL/RCCL deep dive article, we will show the performance benchmarks we conducted across GCP instances and how that compares to other offerings.

From talking to GCP customers, all of them complained about the activation energy needed to properly set NCCL environment variables and correctly link the GCP network/tuner plugin and debug it to ensure it hours. This wastes expensive GPU time while their customers debug their NCCL env vars versus on Azure and OCI; NCCL works out of the box. GCP acknowledged this feedback and is looking into how they can make this experience smoother. The next thing that customers have complained about is that GCP does not automatically use slurm topology.conf for Slurm topology-aware scheduling but instead makes the user do the topology ordering in their sbatch script instead. GCP has addressed this feedback and implemented the fix this year.

The third feedback from GCP customers is that they currently don’t have a fully managed Slurm offering. GCP acknowledged this feedback and are actively working on investigating this. GCP currently has Cluster Toolkit which many customers use for their clusters but it currently does not have a GUI based option for setting up and is not managed and does not have options for setting up out of the box scheduled automated weekly active health checks. While Cluster Toolkit is a massive improvement over what they had 6 months for unmanaged Slurm recipe, it is still missing many features such as being managed.

The fourth feedback from GCP customers, which GCP acknowledges, is that they are improving their customer technical support by assigning an engineer in charge to be responsible for the entire lifecycle of the customer and their tickets, from creation to resolution. Currently, GCP just sends a bunch of people to hop on a call, but what customers want is just one of their subject matter engineers to “own” the issue from triage to hotfix to long-term resolution. The issue of “having dozens of product managers and engineers” hopping on a call to customers is not just exclusive to GCP’s GPU offering but is a Google-wide issue that they need to address.

Note that most of Google’s internal teams are doing GenAI training and inferencing on TPUs; as such, the GCP GPU experience is not the same as the internal Google ML infra experience. One of the few internal Google teams that utilizes cloud GPUs is DeepMind’s Isomorphic Labs. Although there is a tight back loop between GCP’s customers and GCP’s solution architect team that does dogfooding, it is nowhere near the level of dogfooding as what happens in a company such as AWS, which famously dog foods everything.

Unlike something like OCI, or CoreWeave, monitoring is not setup out of the box, although there an relatively easy to setup monitoring dashboard with OpsAgent, it is nowhere near as advanced as CoreWeave’s monitoring Grafana dashboard and metrics. Every single customer wants to monitor for GPUs; as such, we recommend this should be set up out of the box. In terms of health checks, GCP does run passive health checks on the VMs, but there is no out-of-the-box solution to run weekly scheduled active health checks on idle nodes, unlike CoreWeave and Nebius. GCP does have cluster-health-scanner it is not weekly automatedly scheduled and not an out of the box solution. We recommend that GCP spends some time and money trying out Corewave SUNK’s offering for themselves and seeing how they perform health checks and monitoring.

GCP is not just a GPU cloud but also has all the other services of a cloud, such as Bigtable, databases, object storage, and parallel filesystem offering, which is needed for data proc and web scraping. By being a complete cloud, they mean you don’t need to copy (or stream) the data from a “main Hyperscaler cloud” where the data proc is done to a Neocloud cluster and all your data is just there.

In terms of security, GCP’s security is top-notch and world-class and including properly doing tenant networking isolation and encryption in transit. Any enterprises that have strict security requirements should probably go with a Hyperscaler.

Other Bronze Providers

Other providers land themselves in the ClusterMAX™ Bronze tier by not having non-beta out-of-box Slurm and/or Kubernetes offering or having buggy Slurm and/or Kubernetes offerings that are not properly set up. We have given feedback to them, and most of them are receptive to the feedback and are currently building and launching Slurm and/or Kubernetes out-of-the-box offerings. Some of these providers in the ClusterMAX™ Bronze tier have been running GPU cloud services for ages now but only recently obtained SOC2 compliance within the last month. While we are grateful that they have SOC2 compliance, we cannot place them any higher for now since they only recently got SOC2 compliance.

For what it’s worth, for some of these providers, such as DataCrunch’s on-demand single-node offering, it is quite suitable for development work. We evaluated the DataCrunch on-demand single-node offering, and we quite enjoyed it. But unfortunately, their production cluster is not suitable for inferencing or training.

TensorWave also has a beta managed Slurm and managed Kubernetes offering and are developing passive and active health checks. We believe that TensorWave’s offering has the potential to be ClusterMAX™ Sliver by the next time we evaluate them.

GPU providers placed in the UnderPerform category fail to meet critical basic industry and security requirements across multiple important evaluation metrics. Providers in this tier generally exhibit substantial issues, including inadequate security practices, poor reliability or uptime, unclear or misleading marketing, limited technical knowledge or customer support, and insufficient orchestration capabilities.

Most providers land themselves in this category by not having even basic security certifications, such as SOC2 or ISO 27001. Some of these providers also fall into this category by hosting underlying GPU providers that are not SOC 2 compliant either.

Security is a critical make-or-break factor for many GPU renters, as they store their proprietary model weights on GPU clouds, which have cost tens of thousands to tens of millions of dollars to train and are the core intellectual property of most genAI companies. Furthermore, training and/or inferencing these ML models can involve the use of proprietary or personally identifiable information or other user data involved. Customers of these companies that rent from GPU clouds do not want their data to be leaked from using an insecure GPU cloud. In EU countries, the stakes are higher as there are heavy fines for leaking user data as per GPDR law. This is similar to an airline having FAA certification – some people might want to fly on airlines that don’t have FAA certification, but most won’t.

Some of these GPU providers even told us that they have lost potential sales due to not having SOC2 and are in the process of gaining SOC2 compliance. We welcome providers in this category to obtain SOC2 compliance.

Some GPU Providers in this category even admit on their public website that there may be security and privacy concerns, and traffic between the GPU servers and the internet may be heavily logged by a 3^rd parties networking equipment.

Some GPU providers such as Massed Compute land themselves in the UnderPerform by being unhelpful to the community by inundating the internet with a bunch of AI-generated SEO junk articles with incorrect information. This is harmful to the ML community as it adds a bunch of noise to an already noisy internet and actively leads people astray.

For example, when searching for “H100 vs A100 L2 Cache” on Google, the Massed Compute AI generated junk article with incorrect information shows up first. They actively are spreading misleading information, which is a horrible starting point for a GPU provider.

If you click into the link, it starts the H100 L2 cache size is 256MB which is completely wrong. We recommend that Massed Compute stop spamming the internet with AI-generated junk.

Some providers in this category also lack the correct networking drivers and GPU drivers, leading to worse NCCL performance. Additionally, some providers in this category have potential known security issues, such as failing to implement proper tenant isolation using VLANs and pKeys.

This is the conclusion of our first ever ClusterMAX™ Rating System update. Please stay tuned for further articles and additional ClusterMAX™ updates.

AI Neocloud GPU Rental Pricing Trends in 2024

Although many observing trends in GPU pricing may characterize AI Neocloud GPU rental pricing for H100s as “collapsing,” – we have not been surprised at all – and see this as a reasonable and logical decline in the cost of computing as the supply of H100s has improved. Equally important is the increasing availability of the B200 and GB200, where we are already seeing term rental contracts being signed, which is starting to push down the market cost of computing and, therefore, the rental price for H100s.

In the following section, we will recap the GPU rental pricing trends seen in 2024, the outlook for 2025, and how to analyze the total cost of ownership and returns for AI Neoclouds. We will also discuss the upcoming CoreWeave IPO and how we apply our SemiAnalysis AI Total Cost of Ownership framework towards analyzing return on investment and unit economics for CoreWeave.

With a SemiAnalysis subscription you’ll get access to newsletter articles and article discussions.

Model access not included – please reach out to sales@semianalysis.com for our institutional offerings.

By subscribing, you agree to the Privacy Policy and Terms and Conditions.

Source link

Latest articles

How to Build a Prototype X-ray Judgment Tool (Open Source Medical Inference System) Using TorchXRayVision, Gradio, and PyTorch [Colab Notebook Included)

ikayaniaamirshahzad@gmail.com

Sourcetable Raises $4.3M to Launch the World’s First Self-Driving Spreadsheet, Powered by AI

ikayaniaamirshahzad@gmail.com

Amazon launches AI agent toolkit with Nova Act SDK

ikayaniaamirshahzad@gmail.com

The GPU Cloud ClusterMAX™ Rating System

Introduction

Executive Summary

The GPU Cloud ClusterMAX™ Rating System

Jensen Huang, Chief Revenue Destroyer

GPU Cloud ClusterMAX™ Rating System Guidelines

Security

Lifecycle and Technical Expertise

Slurm and Kubernetes

Storage

NCCL/RCCL Networking Performance

Reliability and Service Level Agreements (SLAs)

Automated Active and Passive Health Checks and Monitoring

Consumption Models, Price Per Value, and Availability

Technical Partnerships

Recommendations to AMD and Nvidia

ClusterMAX™ Platinum GPU Providers

CoreWeave

ClusterMAX™ Gold Tier GPU Providers

Crusoe

Nebius

Oracle Cloud Infrastructure

Azure

TogetherAI

LeptonAI

ClusterMAX™ Sliver Tier GPU Providers

AWS

Lambda Labs

Scaleway

ClusterMAX™ Bronze Tier GPU Providers

Google Cloud

Other Bronze Providers

AI Neocloud GPU Rental Pricing Trends in 2024

Subscribe for full access to this article

Latest articles

How to Build a Prototype X-ray Judgment Tool (Open Source Medical Inference System) Using TorchXRayVision, Gradio, and PyTorch [Colab Notebook Included)

Sourcetable Raises $4.3M to Launch the World’s First Self-Driving Spreadsheet, Powered by AI

Amazon launches AI agent toolkit with Nova Act SDK

Leave a Comment Cancel reply

Featured articles

How to Build a Prototype X-ray Judgment Tool (Open Source Medical Inference System) Using TorchXRayVision, Gradio, and PyTorch [Colab Notebook Included)

Sourcetable Raises $4.3M to Launch the World’s First Self-Driving Spreadsheet, Powered by AI

Amazon launches AI agent toolkit with Nova Act SDK