Watch Our Demo

Introduction

Planning with partial observation is a central challenge in embodied AI. A majority of prior works have tackled this challenge by developing agents that physically explore their environment to update their beliefs about the world state. However, humans can imagine unseen parts of the world through a mental exploration and revise their beliefs with imagined observations. Such updated beliefs can allow them to make more informed decisions at the current step, without having to physically explore the world first. To achieve this human-like ability, we introduce the Generative World Explorer (GenEx), a video generation model that allows an agent to mentally explore a large-scale 3D world (e.g., urban scenes) and acquire imagined observations to update its belief about the world .

GenEx Architecture

GenEx Architecture

To support exploration, we design a variant of video generation model. It is defined as fθ: ℝH × W → ℝT × H × W that takes an input image with height H and width W and outputs a sequence of T images as predicted video. We aim to generate realistic 360-degree panoramic video sequences that simulate forward movement through any environment by leveraging image-to-video diffusion models fθ. We introduce our diffuser architecture and designed spherical consistent learning for panoramic navigataion.

Exploration Consistency

To ensure generation quality, we introduce navigational cycle consistency. GenEx navigates a randomly sampled closed path, returning to the origin. In optimal cases, the start and end views are identical, ensuring consistency in world modeling.

Cycle Consistency

GenEx for Embodied AI

GenEx gives a new level of intelligence to embodied AI agents. For a single-agent scenario involving decision-making, picture yourself driving down a street when a siren sounds, but you can’t see the source. With GenEx, you project what’s ahead. It reveals an ambulance just around the corner. You stop crossing the street to make space for it to pass. For a multi-agent scenario involving interaction, suppose you are waiting at a red light. You see a pedestrian and an approaching car, and everything seems fine. But GenEx helps you recognize that the pedestrian can’t see the car. Likewise, the car can’t see the pedestrian. Your vehicle is blocking their views. Realizing this, you act immediately to warn them, preventing a collision.

Interactive Demo



Your browser does not support the HTML5 canvas element.


BibTeX

@misc{lu2024generativeworldexplorer,
  title={Generative World Explorer}, 
  author={Taiming Lu and Tianmin Shu and Alan Yuille and Daniel Khashabi and Jieneng Chen},
  year={2024},
  eprint={2411.11844},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2411.11844}, 
}