Highlights:

  • Run:ai, also known as Runai Labs Ltd., provides software to optimize the performance of server clusters equipped with graphics processing units.
  • The company’s software also prevents what are known as memory collisions, which occur when two AI workloads attempt to utilize the same section of a GPU’s memory simultaneously.

Recently, Nvidia Corp. announced that it has acquired Run:ai, a startup specializing in software designed to optimize the performance of graphics card clusters.

The terms of the Nvidia Run:ai acquisition deal were not disclosed. However, according to the renowned online newspaper, citing two sources familiar with the matter, the transaction values Run:ai at USD 700 million. Before the acquisition, the Tel Aviv-based startup had raised funding nearly six times less than the reported acquisition value.

Run:ai, also known as Runai Labs Ltd., provides software to optimize the performance of server clusters equipped with graphics processing units.
Per the company’s claims, its technology enables a GPU environment to handle up to 10 times more AI workloads than achievable. It enhances AI performance by rectifying numerous common processing inefficiencies that frequently impact GPU-powered servers.

Run:ai tackles the initial issue arising from the fact that AI models are frequently trained using multiple graphics cards. To distribute a neural network across a cluster of GPUs, developers divide it into multiple software fragments and train each one on a different chip. During the training process, these AI fragments must frequently exchange data with one another, which can potentially lead to performance issues.

If an AI fragment needs to exchange data with a different part of the neural network that is not currently active, it will need to pause processing until the latter module becomes available. The resulting delays impede the efficiency of the AI training workflow. Run:ai ensures that all the necessary AI fragments required for data exchange are online simultaneously, eliminating unnecessary processing delays.

The company’s software also prevents what are known as memory collisions, which occur when two AI workloads attempt to utilize the same section of a GPU’s memory simultaneously. GPUs can automatically resolve such errors, but the troubleshooting process consumes time. Throughout an AI training session, the time spent resolving memory collisions can accumulate significantly and impede processing speed.

Running multiple AI workloads on the same GPU cluster can also result in various other types of bottlenecks. If one of the workloads demands more hardware resources than expected, it could utilize infrastructure designated for different applications, thus slowing them down. Run:ai incorporates features to guarantee that each AI model receives sufficient hardware resources to fulfill its assigned task without experiencing delays.

In a blog post, Vice President and General Manager of Nvidia’s DGX Cloud division Alexis Bjorlin stated, “The company has built an open platform on Kubernetes, the orchestration layer for modern AI and cloud infrastructure. It supports all popular Kubernetes variants and integrates with third-party AI tools and frameworks.”

Run:ai offers its core infrastructure optimization platform alongside two other software tools. The first, Run:ai Scheduler, furnishes an interface for assigning hardware resources to development teams and AI projects. The company also provides Run:ai Dev, which aids engineers in swiftly configuring the coding tools required for training neural networks.

Nvidia includes Run: ai’s software with several of its products, including Nvidia Enterprise, a suite of development tools provided for its data center GPUs, and its DGX series of AI-optimized appliances. Run:ai is also accessible on DGX Cloud, which allows companies to utilize Nvidia’s AI appliances within major public clouds.

Bjorlin stated that the chipmaker will maintain the current pricing model for Run: ai’s tools “for the foreseeable future.” Simultaneously, Nvidia will introduce product enhancements for the software, prioritizing features aimed at optimizing DGX Cloud environments. “Customers can expect to benefit from better GPU utilization, improved management of GPU infrastructure, and greater flexibility from the open architecture,” Bjorlin elaborated.