Referring image segmentation (RIS) aims to segment an object of interest by a given natural language expression. As fully-supervised methods require expensive pixel-wise labeling, mask-free solutions supervised by low-cost labels are largely desired. However, existing mask-free methods suffer from complicated architectures or unsatisfying performance. In this paper, we propose a gradient-driven tree-guided mask-free referring image segmentation method, GTMS, which utilizes both low-level structural information and high-level semantic information, while only using a bounding box as the supervised signal. Specifically, we first mind the structural information using a tree filter from the low-level feature. Meanwhile, we explore semantic attention by GradCAM from the high-level feature. Finally, the tree structure and attention information are used to refine the output of the segmentation model to generate pseudo labels, which in turn are used to optimize the model. To verify the effectiveness of our model, the experiments are conducted on three benchmarks, \textit{i.e.}, RefCOCO/+/g. Notably, it achieves 66.54\%, 69.98\%, and 63.41\% IoU on RefCOCO Val-Test, TestA, and TestB, outperforming most of the fully-supervised models.