Skip to yearly menu bar Skip to main content


Oral

Controlling the World by Sleight of Hand

Sruthi Sudhakar · Ruoshi Liu · Basile Van Hoorick · Carl Vondrick · Richard Zemel

award Award Candidate
[ ] [ Visit Oral 6A: Generative Models II ] [ Paper ]
Thu 3 Oct 4:30 a.m. — 4:40 a.m. PDT

Abstract:

Humans naturally build mental models of object interactions and dynamics, allowing them to imagine how their surroundings will change if they take a certain action. While generative models today have shown impressive results on generating/editing images unconditionally or conditioned on text, current methods do not provide the ability to perform fine-grained object manipulation conditioned on actions, an important tool for world modeling and action planning. Therefore, we propose learning to model interactions through a novel form of visual conditioning: hands. Hands are a natural way to specify control through actions such as grasping, pulling, pushing, etc. Given an input image and a representation of a hand interacting with the scene, our approach, CoSHAND, synthesizes a depiction of what the scene would look like after the interaction has occurred. We show that CoSHAND is able to recover the dynamics of manipulation by learning from large amounts of unlabeled videos of human hands interacting with objects, and leveraging internet-scale latent diffusion model priors. The model demonstrates strong capabilities on a variety of actions and object types beyond the dataset, and the ability to generate multiple possible futures depending on the actions performed. CoSHAND is also able to generalize zero-shot to tasks where the agent is a robot arm rather than a human hand. Our hand-conditioned model has several exciting applications in robotic planning and augmented or virtual reality.

Chat is not available.