Skip to yearly menu bar Skip to main content


Oral

Watch Your Steps: Local Image and Scene Editing by Text Instructions

Ashkan Mirzaei · Tristan T Aumentado-Armstrong · Marcus A Brubaker · Jonathan Kelly · Alex Levinshtein · Konstantinos Derpanis · Igor Gilitschenski

[ ] [ Visit Oral 4A: Neural 3D Rendering ] [ Paper ]
Wed 2 Oct 5:20 a.m. — 5:30 a.m. PDT

Abstract:

The success of denoising diffusion models in generating and editing images has sparked interest in using diffusion models for editing 3D scenes represented via neural radiance fields (NeRFs). However, current 3D editing methods lack a way to both pinpoint the edit location and limit changes to the desired volumetric region. Consequently, these methods often over-edit, altering irrelevant parts of the scene. We introduce a new task, 3D edit localization, to automatically identify the relevant region for an editing task and restrict the edit accordingly. To achieve this goal, we initially tackle 2D edit localization, and then lift it to multiple views to address the 3D localization challenge. For 2D localization, we leverage InstructPix2Pix (IP2P) and identify the discrepancy between IP2P predictions with and without the instruction. We refer to this discrepancy as the relevance map. The relevance map conveys the importance of changing each pixel to achieve an edit, and guides downstream modifications, ensuring that pixels irrelevant to the edit remain unchanged. With the relevance maps of multiview posed images, we can define the \textit{relevance field}, defining the 3D region within which modifications should be made. This enables us to improve the quality of text-guided 3D NeRF scene editing, by performing iterative updates on the training views, guided by renders from the relevance field. Our method achieves state-of-the-art performance on both NeRF and image editing tasks. We will make the code available.

Chat is not available.