March 19, 2025

ikayaniaamirshahzad@gmail.com

Building Scalable Synthetic Data Generation Pipelines for Perception AI with Databricks and NVIDIA Omniverse


Training AI models for real-world applications require vast amounts of labeled data, which can be costly, time-consuming, and difficult to obtain at scale. Synthetic data generation in simulated environments offers a powerful alternative by enabling AI models to learn from physically accurate, controlled, and scalable virtual datasets before deployment.

Leveraging Omniverse Replicator, a core extension of Isaac Sim, a reference robotic simulation application, with the Databricks’ Data Intelligence Platform provides an end-to-end workflow for developing domain-specific AI models in industries like manufacturing, logistics, healthcare diagnostics, and robotics. By combining synthetic data generation, automated AI workflows, and scalable cloud infrastructure, organizations can accelerate AI development while reducing data acquisition challenges and improving model accuracy.

This blog explores the technical foundations of this integration, real-world applications, and demonstrates how the collaboration between Databricks and NVIDIA is supercharging machine vision applications. By fusing Databricks’ Data Intelligence Platform with NVIDIA’s unparalleled high-performance computing, enterprises can now build, train, and deploy vision models at speeds previously thought impossible. This blog explores the technical foundations of this integration and its real-world applications.

Architecture Patterns

The technical foundations of the integration start with a reference architecture that defines interfaces, data models, and communication protocols. Below is a generalized workflow that demonstrates the integration of applications developed with NVIDIA Omniverse and the Databricks Data Intelligence Platform to provide an end-to-end AI model training pipeline.

The steps within the workflow are as follows:

  1. Provide initial input data and parameters to define synthetic data generation
    • Example: 3D artifacts of an object and scene descriptions of specific lighting with randomization and variability parameters to showcase expected variation.
  2. Generate synthetic data with Omniverse Replicator for Isaac Sim.
    • Example: Generate images of different variations of a specific CAD object captured in different angles.
  3. Process the data within a Lakehouse format, such as Delta Lake, to prepare for Mosaic AI Model Training.
    • Example: Configure Databricks Lakeflow Pipelines to transform and harmonize the dataset and associate metadata for additional context.
  4. Train/fine-tune models for domain-specific use cases on Databricks
    • Example: Experiment tracking across various model training runs for the You Only Look Once (YOLO) machine vision model. Store models in Databricks Unity Catalog for model governance throughout the MLOps lifecycle.
  5. Serve the domain-specific models for inference in pipelines, applications, and workflows.
    • Example: Register models in Databricks Unity Catalog and serve in easy to deploy Databricks Model Serving end-points.

Within this architecture, Delta Lake is used as the integration layer between NVIDIA Omniverse and Databricks. We bridge the two platforms by leveraging a prototype, custom writer, which allows an application developed with Omniverse to write synthetic data directly into the Lakehouse. Using this approach, instead of writing the data to disk in the form of PNG and NumPy files, Omniverse powered applications can write the generated synthetic images and corresponding metadata into Delta Lake format. The files land directly into cloud storage and are registered to Unity Catalog where they are further processed using Databricks so they are available for downstream model training.

A New Pattern for Machine Vision MLOps

The NVIDIA Omniverse and Databricks integration establishes a new paradigm for machine vision development encompassing synthetic data generation and easy-to-use, industrial-grade AI. Within manufacturing environments, defect detection models often encounter three primary challenges: identifying new defects, adapting to new products, and performing in diverse real-world environments.

To tackle these challenges, the NVIDIA Omniverse platform enables customers to build custom synthetic generation pipelines. NVIDIA Omniverse enables developers to create entirely new camera angles, lighting conditions, and physical scenarios in their applications, significantly enhancing model robustness and adaptability beyond traditional methods, such as rotating or brightening images.

By automating image generation, the synthetic data generation process becomes a tunable parameter within Databricks’ Managed MLflow. These adjustments can be made alongside traditional hyperparameters like learning rate and batch size. As you identify which variations impact model accuracy, you can refine your training approach to focus on the most effective combinations of synthetic data and hyperparameters while minimizing time spent on less productive configurations.

Unlocking New Use Cases

By having synthetic data as a tunable parameter, new use cases are unlocked for manufacturers without disrupting actual operations:

  1. Defect Detection within Manufacturing Quality Control – Out of the box machine vision models are only able to recognize objects based on available real-world data they have been trained on. With this workflow, manufacturers can now seamlessly generate synthetic images comprising various defects such as corrosion, texture, hairline fracture, or physical traits color/size variations using the 3D CAD models of their products enabling companies to fine-tune models and serve them on Databricks to catch defects before the products ship.
  2. Generative Product Design – Before products transition from concept to production, design teams first create detailed 3D renderings of what reality will look like in CAD software tools. Using these same designs alongside Omniverse Replicator, we can now generate the synthetic data required to allow generative design models to be fine-tuned in Databricks, enabling design space exploration long before physical manufacturing begins. This integrated approach will help manufacturers generate viable and optimized design solutions (represented as 2D/3D models) from a given set of requirements and predict their performance faster than traditional simulation studies. Thanks to the DevOps and scheduling capabilities of Databricks such processes can be triggered and executed together as one end-to-end pipeline (for example when a new version of the CAD representation is available).
  3. Proprioception of Robotics and Automation – Developers can integrate Omniverse Replicator into their workflow to generate synthetic datasets that encompass countless environment configurations, camera angles, and lighting scenarios. Robotics manufacturers can use Databricks to store various point-of-view images from OpenUSD scenes and run parallel, distributed model tuning experiments to rapidly develop better AI comprehension of particular robotic arm movements in specific manufacturing environments.

These approaches enable manufacturers to train a broader variety of machine vision models to solve business problems proactively. Rare defects with data that was previously too sparse to train on can now be augmented with numerous realistic examples, allowing businesses to catch defects before they escape while preparing enterprises for the new age of Data Intelligence.

Solving a Healthcare Company’s Data Gaps

Siemens Healthineers, a joint healthcare customer of Databricks and NVIDIA inspired this integration architecture after experiencing challenges. The fragmented workflow—with one engineer generating synthetic data through an application developed with NVIDIA Omniverse on-premises and another moving that data to the cloud for ML training and deployment in Databricks—created delays.

By implementing Databricks Unity Catalog to centralize all data, functions, and models under a single governance framework and directly integrating the Omniverse platform’s synthetic data generation capabilities, the organization dramatically reduced model iteration cycles “from weeks to days,” improved data integration and traceability, and accelerated time to market.

 

If you are attending NVIDIA GTC 2025, visit us at our Databricks Booth #1733 or request a Meeting with Databricks at GTC.

For more about NVIDIA Omniverse and the Databrick Data Intelligence Platform please see additional resources below:

  • Omniverse Replicator is created as an Omniverse Kit extension and conveniently distributed through Omniverse Code.
    • To use the replicator you need to download the Omniverse which is found here.
    • For more details on the Omniverse launcher check this Video out.
  • If you’ve never used the Databricks Intelligence Platform hands-on, sign up for a free trial account. You can also find a full list of Databricks Academy offerings, training, and certifications.

 

NVIDIA Omniverse Website

 

Databricks Data Intelligence Platform Website

 

Databricks <> NVDA Partnership Announcement

 

Databricks – ML Ops Documentation

 



Source link

Leave a Comment