InsActor: Instruction-driven Physics-based Characters


Generating life-like natural motions in a simulated environment has been the focus of physics-based character animation. To enable user interaction with the generated motion, various conditions such as waypoints (🚩) have been introduced to control the generation process. In particular, language instructions (🗣️), which have been widely adopted in text generation and image generation, have recently drawn attention in physics-simulated character animation. The accessibility and versatility of human instructions open up new possibilities for downstream physics-based character applications.

Therefore, we investigate a novel task in this work: generating physically simulated character animation from human instruction. The task is challenging for existing approaches:

  • Motion tracking is a common approach for character animation, but it presents challenges when tracking novel motions generated from free-form human language.
  • Language-conditioned controllers have demonstrated the feasibility of managing characters using instructions, but they struggle with complex human commands.


To tackle this challenging task, we present InsActor, a framework that employs a hierarchical design for creating instruction-driven, physics-based characters.

  • At the high level, InsActor generates motion plans conditioned on human instructions. To accomplish this, InsActor utilizes a diffusion policy to generate actions in the joint space conditioned on human inputs. It allows flexible test-time conditioning, which can be leveraged to complete novel tasks like waypoint heading without task-specific training. However, the high-level diffusion policy alone does not guarantee valid states or feasible state transitions, making it insufficient for direct execution of the plans using inverse dynamics.
  • Therefore, at the low level, InsActor incorporates unsupervised skill discovery to handle state transitions between pairs of states. Given the state sequence in joint space from the high-level diffusion policy, the low-level policy first encodes it into a compact latent space to address any infeasible joint actions from the high-level diffusion policy. Each state transition pair is mapped to a skill embedding within this latent space. Subsequently, the decoder translates the embedding into the corresponding action.

This hierarchical architecture effectively breaks down the complex task into two manageable tasks at different levels, offering enhanced flexibility, scalability, and adaptability compared to existing solutions. Furthermore, thanks to the flexibility of the diffusion model, animations can be further customized by incorporating additional conditions, such as waypoints.

Here, we describe the unified hierarchical approach in more details.

  • For the high-level state diffusion policy, we treat the joint state of the character as its action. We follow the state-of-the-art approach of utilizing diffusion models to carry out conditional motion generation. In order to use large-scale datasets for motion generation in a physical simulator, it is necessary to retarget the motion database to a simulated character to obtain a collection of reference trajectories. Moreover, we use a motion diffusion model to learn the data distribution on the trajectory collection.
  • For the low-level skill discovery, it is designed to safeguard against unexpected states in poorly planned trajectories. Specifically, we train a Conditional Variational Autoencoder to map state transitions to a compact latent space in an unsupervised manner. This approach benefits from a repertoire of learned skill embedding within a compact latent space, enabling superior interpolation and extrapolation. Consequently, the motions derived from the diffusion model can be executed by natural motion primitives.


As a result, InsActor can successfully respond to language instructions and waypoint targets.

The policy also runs in real-time:

And can track multiple waypoints:

Thanks to the low level skill embedding, the policy is robust to exteral perturbations like boxes.


In conclusion, we have introduced InsActor, a principled framework for physics-based character animation generation from human instructions. By utilizing a diffusion model to interpret language instructions into motion plans and mapping them to latent skill vectors, InsActor can generate flexible physics-based animations with various and mixed conditions including waypoints. We hope InsActor would serve as an important baseline for future development of instruction-driven physics-based animation.


Project page: