Effectively navigating a dynamic 3D world requires a comprehensive understanding of 3D geometry and motion of surrounding objects and layouts. However, existing methods for perception and planning in autonomous driving primarily rely on a 2D spatial representation, based on a Bird's Eye Perspective of the scene, which are insufficient for modeling motion characteristics and decision-making in real-world 3D settings with occlusion, partial observability, subtle motions, and differing terrains. Motivated by this key insight, we present a novel framework for learning end-to-end autonomous driving based on volumetric representations. Our proposed neural volumetric world modeling approach, NeMo, can be trained in a self-supervised manner for image reconstruction and occupancy prediction tasks, benefiting scalable training and deployment paradigms, such as imitation learning. Specifically, we demonstrate how the higher-fidelity modeling of 3D volumetric representation benefits vision-based motion planning. We further propose a motion flow module to model complex dynamic scenes, enabling additional robust spatial-temporal consistency supervision. Moreover, a temporal attention module is introduced to effectively integrate predicted future volumetric features for the planning task. Our proposed sensorimotor agent achieves state-of-the-art motion planning performance in open-loop evaluation settings in nuScenes, outperforming prior baseline methods by over 18% in L_2 error.