A major challenge in autonomous vehicle research is agents behavior modeling, which has critical applications including constructing realistic and reliable simulations for off-board evaluation and motion forecasting of traffic agents for onboard planning. Motion prediction models trained via supervised learning have recently proven effective at modeling agents across many domains. However, these models are subject to distribution shift when deployed at test-time. In this work, we improve the reliability of agent behaviors by closed-loop fine-tuning of behavior models with reinforcement learning. We demonstrate that we can improve overall performance, as well as targeted metrics such as collision rate, on the Waymo Open Sim Agents challenge benchmark. We also introduce a novel policy evaluation ranking benchmark for directly evaluating the ability of sim agents to measure planning quality and demonstrate the effectiveness of our approach on this new benchmark.