WiFi-based human pose estimation (HPE) has emerged as a promising alternative to conventional vision-based techniques, yet faces the high computational cost hindering its widespread adoption. This paper introduces a novel HPE-Li approach that harnesses multi-modal sensors (e.g. camera and WiFi) to generate accurate 3D skeletal in HPE. We then develop an efficient deep neural network to process raw WiFi signals. Our model incorporates a distinctive multi-branch convolutional neural network (CNN) empowered by a selective kernel attention (SKA) mechanism. Unlike standard CNNs with fixed receptive fields, the SKA mechanism is capable of dynamically adjusting kernel sizes according to input data characteristics, enhancing adaptability without increasing complexity. Extensive experiments conducted on two MM-Fi and WiPose datasets underscore the superiority of our method over state-of-the-art approaches, while ensuring minimal computational overhead, rendering it highly suitable for large-scale scenarios.