Controllable 3D World Generation from Any Input

Controllable 3D World Generation from Any Input

Feb 22, 2024

A long-standing challenge in 3D generative AI is the creation of dynamic, editable 3D environments. These environments are interactive simulations made up of objects, spaces, and agents. To achieve this, there have generally been two pathways: traditional explicit game engines and implicit game engines learned through multi-modal foundation models. Explicit game engines require generating 3D assets complete with rigging, animation, and programs. This method is flexible but challenging to learn with gradient descent. Implicit game engines, on the other hand, are large neural networks trained end-to-end with raw sensory inputs. While these models offer impressive generalization, they have limited controllability and are not easily programmable. Our approach introduces a third option: an implicitly learned game engine grounded in an explicitly learned 3D framework that features both high flexibility and controllability.

This release is designed to empower creators and developers at all skill levels to construct world simulators with richly detailed objects and characters, leveraging both off-the-shelf and cutting-edge neural game engines. In the future, 3D artists will be able to transform concept art, text descriptions, and styles into working prototypes in minutes instead of weeks. Product designers can rapidly materialize their imaginations into 3D prints. Designers can sketch in 2D or 3D to craft AAA-quality renderings with unmatched precision. Gamers can create new avatars and assets for their favourite games. Pioneers in mixed reality can bring to life their version of the Holodeck. Our mission is to democratize the creation of world simulators, enabling anyone to build environments filled with objects and characters that exhibit common sense.

Take me to Cube app | Visit CSM Discord

In 2022, we embarked on our journey by creating predictive videos and 3D models from images and actions [CommonSim-1: Generating 3D Worlds]. In late 2023, we introduced Cube [cube.csm.ai], our first self-serve product, through a public beta. Cube has set a new precedent in converting single images into 3D assets, catalyzing the 3D generative AI revolution. Driven by incredible user feedback, we have significantly enhanced the quality and speed of Cube since its inception. Additionally, we've overhauled the user experience and interface, making it more intuitive for all creators to engage with our AI models.

We are introducing a 3D canvas that ingests different 3D formats (meshes, animations, gaussian splats) and a new rendering engine that synthesizes these into realistic simulation. With this release, we close the gap between AI-generated 3D assets and the creation of controllable worlds.

Fast 3D Generation from Any Input

We have developed a new 3D foundation model, from the ground up, capable of converting single images into textured 3D assets within seconds. This new model uses a blend of state-of-the-art AI techniques including diffusion models, transformers, and neural radiance fields (NeRFs), and is the first in a series of more advanced AI trained on our H100 cluster. This advance paves the way for interactive workflows and in-game generation of 3D content. We expose manual segmentation, text, and sketch interfaces to the user, allowing precise and creative opportunities from any data source.

Additionally, advanced users can customize style-consistent image generators on their own datasets and use text prompts to create 3D assets in their own style.

High-Resolution 3D Generation

Fast 3D assets can be refined to produce high-resolution 3D assets in just tens of minutes. Users can customize their outputs by selecting preferences for pixel alignment, model resolution, and mesh topology (quads/triangles).

With pixel alignment, users control how closely the 3D output aligns with the input image. Lower pixel alignment gives the AI more creative freedom, which can lead to sharper textures and geometries at the cost of deviation from the input. Higher alignment enforces faithfulness to the input, which leads to smoother textures and details.

Model resolution allows users the ability to balance between runtime and level of detail. Low model resolutions produce outputs faster with lower frequency details, while high model resolutions are slower, baking in high-frequency details into both texture and geometry.

We also provide enhanced control over the final refined mesh. Users can download their refined meshes in quadrilaterals or triangles, and also select for mesh and texture resolutions. These controls unify workflows across 3D modeling, animation, sculpting, rendering, and more.

Animating 3D Assets

Rigging and animation are complicated and time-consuming endeavours for even the most talented 3D designers.

With this release, we are also introducing a new suite of animation tools. This includes automatic rigging of humanoid 3D assets, an ever-growing animation library to apply pre-constructed motions to your rigged assets, and even the ability to generate customizable and complex animations from simple text prompts. This allows users to bring their assets to life directly within our Cube app. Cube is the first platform that seamlessly integrates the entire process from mesh creation to animation.

Neural Rendering Engine

Currently, the predominant usage of AI-generated assets is as imports into standard game engines. However, the gap between importable assets and full-scale simulated environments is still large and requires complex programmability involving physics, lighting, mesh interactions, etc. To this end, we have created a new real-time diffusion-based rendering engine that synthesizes the complex input conditions from our 3D system. This approach bridges the visual fidelity gap of virtual environments while grounding them in 3D representations from first principles. We believe that these hybrid neural networks will allow users to iterate, create, and navigate breathtaking worlds with fine-grained control, and will be a central component of creating AI that can experiment, plan, and act.

This is the first demonstration of a world simulator where 3D animation-ready assets can be created with text or single images, rigged, and then animated with text prompts. These assets can be combined with Gaussian splat-based environment maps within a 3D scene. This game-engine-like state is then passed onto a diffusion model which does the final video rendering. This paves the path for fully controllable video generation. In the future, as the AI rendering engine gets faster, we believe this will power game engines.

We are excited to see what you create with Cube!

Take me to Cube app | Visit CSM Discord