Oct 17, 2022
A longstanding dream in generative AI has been to digitally synthesize dynamic 3D environments – interactive simulations consisting of objects, spaces and agents. Simulations enable us to create synthetic worlds and data for robotics and perception, new content for games and metaverses, and adaptive digital twins of everyday and industrial objects and spaces. Simulators are also the key to imaginative machines endowed with a world model; they allow them to visualize, predict and consider possibilities before taking actions. But today’s simulators are too hard to build and too limited for our needs. Generative AI will change this.
Generative AI requires compressing diverse human experience and culture into a large-scale model. Indeed, recent large generative models (such as Dall-E 2, StableDiffusion, Imagen, Make-A-Scene and many others) have shown that this dream may be possible today. A natural evolution in this progression is for AI systems to create a new class of simulator that learns to generate 3D models, long-horizon videos, and efficient actions. These feats are now becoming attainable due to recent advances in neural rendering, diffusion models, and attention architectures.
Common Sense Machines has built a neural simulation engine we call CommonSim-1. Instead of laboriously creating content and spending weeks of effort navigating complex tools, users and robots interact with CommonSim-1 via natural images, actions and text. These interfaces will enable us to rapidly create digital replicas of real-world scenarios or make imagining entirely new ones as natural as taking a photo or describing a scene in text. Today, we are previewing these model capabilities. We are excited to see what developers, creators, researchers, and more can create! Sign up below for early access to our application, APIs, checkpoints, and code.
CommonSim-1: A simulator that grows and adapts from experience
A new simulation engine requires new interfaces. Instead of needing a specialized toolchain, CommonSim-1 is operated with images, language, and action. A user (machine or human) shows or describes what they want to simulate and then controls the kinds of outputs they want to measure and observe. We have built mobile and web interfaces that will let anyone access the model and REST APIs for developers and embodied systems that connect directly to CommonSim-1.
At the heart of CommonSim-1 is a foundation model of the 3D world that is trained on a large-scale, growing dataset of diverse human (and non-human) experiences across a wide range of tasks. We combine publicly available data, our internal datasets, and task-specific data provided by our partners.
CommonSim-1 unlocks:
Controllable Video generation
With as little as 1 frame and a set of actions (camera or body movements), our foundation world model can generate high-resolution videos (512×512). The same architecture can be trained to solve tasks across a variety of embodiments (camera, robot, car). Since this model imagines the future, one can use their imagination 1) as training data for upstream tasks (3D generation, perception), and 2) as part of a system’s predictive model.
3D content generation
To create 3D assets we let users upload videos of objects via our mobile or web app. All processing is done on the cloud and results can be exported into standard 3D (obj, usdz, blend, glTF, etc) and Neural Radiance Fields (NeRFs).
Standard 3D assets can be imported into existing 3D engines for scene compositing and rendering. Our Hybrid Renderer (current support for Blender with more on the way) allows for NeRFs to be composed with traditional 3D assets. The Hybrid Renderer combines the best of both worlds, allowing for highly accurate but fast photorealistic rendering of scenes with physics.
In addition to rendering pixels, our CommonSim-1 also infers data types such as bounding boxes, masks, and 6-DOF poses. These data are useful for training perception systems without the cost of data annotation.
Natural language is a powerful means to describe imaginary situations. For instance, if you had a 3D model of a chair and wanted to render it in a completely new environment, it would normally take days or weeks to achieve satisfactory results. Text-to-image models can automate this process. With a mesh or NeRF generated by CommonSim-1, one can type natural-language descriptions into a text prompt and generate unbounded new hybrid scenes.
The above content generation mechanisms are sufficiently precise and flexible to generate large-scale synthetic data and train perception systems. Below are examples where a vision system for detection, segmentation, and 6-DOF pose tracking was trained without any human-annotated data or human-created models. We’ve tested these systems on objects from grocery stores, warehouses, factories, labs, medical theaters, and more.
This is just a preview of what we’ve been developing and scaling up. We are also hiring exceptional talent across the board – contact us at hello@csm.ai.