3D reconstruction from a sketch offers an efficient means of boosting the productivity of 3D modeling. However, such a task remains largely under-explored due to the difficulties caused by the inherent abstractive representation and diversity of sketches. In this paper, we introduce a novel deep neural network model, Sketch2Vox, for 3D reconstruction from a single monocular sketch. Taking a sketch as input, the proposed model first converts it into two different representations, i.e., a binary image and a 2D point cloud. Second, we extract semantic features from them using two newly-developed processing modules, including the SktConv module designed for hierarchical abstract features learning from the binary image and the SktMPFM designed for local and global context feature extraction from the 2D point cloud. Prior to feeding features into the 3D-decoder-refiner module for fine-grained reconstruction, the resultant image-based and point-based feature maps are fused together according to their internal correlation using the proposed cross-modal fusion attention module. Finally, we use an optimization module to refine the details of the generated 3D model. To evaluate the efficiency of our method, we collect a large dataset consisting of more than 12,000 Sketch-Voxel pairs and compare the proposed Sketch2Vox against several state-of-the-art methods. The experimental results demonstrate the proposed method is superior to peer ones with regard to reconstruction quality. The dataset is publicly available on https://drive.google.com/file/d/1aXug8PcLnWaDZiWZrcmhvVNFC4n_eAih/view?usp=sharing.