Generating 3D Worlds with CommonSim-1

A longstanding dream in generative AI has been to digitally synthesize dynamic 3D environments – interactive simulations of objects, spaces, and agents. Simulations enable us to create synthetic worlds and data for robotics and perception, new content for games and metaverses, and adaptive digital twins. Simulators are also the key to imaginative machines endowed with a world model; they allow them to visualize, predict and consider possibilities before acting. But today’s simulators are too hard to build and too limited for humanity’s needs. Generative AI will change this. 


Generative AI requires compressing diverse human experience and culture into a large-scale model. Indeed, recent large generative models (such as Dall-E 2, StableDiffusion, Imagen, Make-A-Scene, and many others) have shown remarkable results in creating entirely new imagery from scratch. A natural evolution in this progression is for AI systems to create a new class of simulators that learn to generate 3D models, long-horizon videos, and efficient actions. These feats are becoming attainable due to recent advances in neural rendering, diffusion models, and attention architectures. 


Common Sense Machines has built a neural simulation engine we call CommonSim-1. Instead of laboriously creating content and spending weeks of effort navigating complex tools, users and robots interact with CommonSim-1 via images, actions, and text. These interfaces enable us to create digital replicas of real-world environments or imagine new worlds as simply as taking photos or describing a scene in text. Today, we are previewing these model capabilities. We are excited to see what developers, creators, researchers, and others are able to create!


Sign up below for early access to our application, APIs, checkpoints, and code.


CommonSim-1: A simulator that grows and adapts from experience


A new simulation engine requires new interfaces. Instead of specialized tools, CommonSim-1 is operated with images, language, and action. A user (machine or human) shows or describes what they want to simulate and then controls the kinds of outputs they want to measure and observe. We have built mobile and web interfaces that let anyone access the model and REST APIs for developers and embodied systems to directly connect to CommonSim-1.


At the heart of CommonSim-1 is a foundation model of the 3D world that is trained on a large-scale, growing dataset of diverse human (and non-human) experience across a wide range of tasks. We combine publicly available data, our own internal datasets, and task-specific data provided by our partners. 


Controllable video generation


With as little as one frame and a set of actions (camera or body movements), CommonSim-1 generates high-resolution videos (512×512). With the right data, the same architecture can be trained to solve tasks across many embodiments (camera, robot, car, and more). Since this model imagines the future, one can use its imagination (1) as training data for 3D generation and perception and (2) as part of another system’s predictive model. 



3D content generation


To create 3D assets, users upload videos of objects via our mobile or web app. All processing happens on the cloud, and results are easily exported into universal 3D file formats (obj, usdz, blend, glTF, etc) and Neural Radiance Fields (NeRFs). 



These 3D assets can be imported into existing 3D engines for scene compositing and rendering. Our Hybrid Renderer (current support for Blender with more on the way) allows for NeRFs to be composed with traditional 3D assets. The Hybrid Renderer combines the best of both worlds, allowing for highly accurate but fast photorealistic rendering of scenes. 



Our hybrid renderer also supports physics simulations with NeRF assets, as shown below:



Natural language is a powerful means to describe imaginary situations. For instance, if you had a 3D model of a chair and wanted to render it in a completely new environment, it would normally take days or weeks to achieve satisfactory results. Text-to-image models can automate this process. With a mesh or NeRF generated by CommonSim-1, one can type natural-language descriptions into a text prompt and generate unlimited new hybrid scenes.



These content generation mechanisms are sufficiently precise and flexible to create auto-labeled synthetic data and train perception systems.



Here is an example of an end-to-end vision system for detection, segmentation, and 6-DOF pose tracking that was trained without any human-annotated data or human-created models. We’ve tested these systems on objects from grocery stores, warehouses, factories, labs, medical theaters, and more.



This is just a preview of what we’ve been developing and scaling up. Click below if you’d like to join the waitlist for early access. We are also looking for exceptional talent across the board – contact us at