PAN: Towards General World Model with Natural Language Actions and Video States

Abstract

A step towards a General World Model (GWM) that can simulate complex video scenarios with natural language actions.

PAN: Towards General World Model with Natural Language Actions and Video States
  • Diffusion Game Engine: Built an auto-regressive Image-to-Video (I2V) model capable of simulating 2D platformer games (e.g., Mario), allowing control of both characters and environmental elements using text inputs on the fly. Proposed and implemented window-slide conditioning to support the generation of game videos lasting longer than one minute.

  • Video Diffusion Model Acceleration: Spearheaded a sub-project focusing on optimizing video diffusion for real-time game generation, achieving generation speeds of under 1 second per round.

  • Complex Video Captioning: Led a sub-project aimed at enhancing video captioning for complex scenarios (e.g. game videos) where even state-of-the-art visual language models tend to falter, ensuring more accurate descriptions.

  • Large-Scale Training Data Pipeline: Designed and implemented a high-efficiency processing pipeline for video training data, processing over 10 million videos simultaneously, significantly improving the overall data quality and processing speed.

mcts