Medicine & Health1 February 2026
Deep reinforcement learning VMAT: Speed and autonomy in radiotherapy planning
Source PublicationMedical Physics
Primary AuthorsShaffer, Mudireddy, St‐Aubin

These results were observed under controlled laboratory conditions, so real-world performance may differ.
A new algorithmic framework claims to generate clinically viable prostate cancer treatment plans in seconds, bypassing the need for commercial optimizers entirely. Historically, the optimization of Volumetric Modulated Arc Therapy (VMAT) has been a slog. It is a high-dimensional mathematical problem, typically forcing medical physicists to rely on inverse planning solutions that are computationally heavy and time-consuming.The mechanics of deep reinforcement learning VMAT
To understand the shift, one must examine the mechanics. Conventional inverse planning operates as a rigorous mathematical calculation, iteratively adjusting parameters to match dose constraints. Previous attempts to automate this via supervised machine learning have typically been limited by the quality and diversity of the training data they seek to mimic. In contrast, the deep reinforcement learning VMAT approach utilises Proximal Policy Optimization (PPO). Instead of merely copying existing plans, the RL agent learns through trial and error, much like a chess engine playing against itself. It maximises a specific reward function based on Dose-Volume Histograms (DVH). This allows the system to potentially discover novel optimization strategies that a standard algorithm, or a supervised model tethered to historical data, might miss. Technically, the distinction lies in the control variables. While standard systems often rely on abstract intermediate steps, the proposed RL framework directly controls the Multi-Leaf Collimator (MLC) positions and Monitor Units (MUs). By employing two tandem convolutional neural networks, the system predicts the precise aperture shapes and beam intensities required. This removes the abstraction layer found in traditional methods, allowing the algorithm to manipulate the machine parameters directly based on the current dose state and contoured structure masks. It is a shift from asking the computer to solve a complex equation to teaching the computer how to drive the machine to achieve a result.Performance against commercial standards
The results from the test set of 20 patients are statistically notable. The RL framework generated plans in an average of 6.3 ± 4.7 seconds. While impressive, the standard deviation suggests consistency varies, though it remains significantly faster than human-driven optimization. Dosimetrically, the RL plans met all clinical objectives defined by the PACE-B SBRT protocols. They even achieved superior sparing for the bladder and rectum compared to the commercial Treatment Planning System (TPS). However, the data indicates a trade-off. While the organs at risk received less radiation, the RL plans resulted in a statistically significant increase in the PTV D2%—essentially, a 'hotspot' within the target volume. While the speed is seductive, the independence from commercial TPS raises questions about validation and safety in a live clinical environment. The study demonstrates feasibility, yet the presence of increased hotspots suggests the reward function may need fine-tuning before this method replaces the 'human-in-the-loop' standard. It suggests a future where planning is instantaneous, but rigorous quality assurance remains the necessary bottleneck.Cite this Article (Harvard Style)
Shaffer, Mudireddy, St‐Aubin (2026). 'A tandem reinforcement learning framework for localized prostate cancer treatment planning and machine parameter optimization.'. Medical Physics. Available at: https://doi.org/10.1002/mp.70306