Medicine & Health1 February 2026

Deep reinforcement learning VMAT: Speed and autonomy in radiotherapy planning

Source PublicationMedical Physics

Primary AuthorsShaffer, Mudireddy, St‐Aubin

Visualisation for: Deep reinforcement learning VMAT: Speed and autonomy in radiotherapy planning
Visualisation generated via Synaptic Core

These results were observed under controlled laboratory conditions, so real-world performance may differ.

A new algorithmic framework claims to generate clinically viable prostate cancer treatment plans in seconds, bypassing the need for commercial optimizers entirely. Historically, the optimization of Volumetric Modulated Arc Therapy (VMAT) has been a slog. It is a high-dimensional mathematical problem, typically forcing medical physicists to rely on inverse planning solutions that are computationally heavy and time-consuming.

The mechanics of deep reinforcement learning VMAT

To understand the shift, one must examine the mechanics. Conventional inverse planning operates as a rigorous mathematical calculation, iteratively adjusting parameters to match dose constraints. Previous attempts to automate this via supervised machine learning have typically been limited by the quality and diversity of the training data they seek to mimic. In contrast, the deep reinforcement learning VMAT approach utilises Proximal Policy Optimization (PPO). Instead of merely copying existing plans, the RL agent learns through trial and error, much like a chess engine playing against itself. It maximises a specific reward function based on Dose-Volume Histograms (DVH). This allows the system to potentially discover novel optimization strategies that a standard algorithm, or a supervised model tethered to historical data, might miss. Technically, the distinction lies in the control variables. While standard systems often rely on abstract intermediate steps, the proposed RL framework directly controls the Multi-Leaf Collimator (MLC) positions and Monitor Units (MUs). By employing two tandem convolutional neural networks, the system predicts the precise aperture shapes and beam intensities required. This removes the abstraction layer found in traditional methods, allowing the algorithm to manipulate the machine parameters directly based on the current dose state and contoured structure masks. It is a shift from asking the computer to solve a complex equation to teaching the computer how to drive the machine to achieve a result.

Performance against commercial standards

The results from the test set of 20 patients are statistically notable. The RL framework generated plans in an average of 6.3 ± 4.7 seconds. While impressive, the standard deviation suggests consistency varies, though it remains significantly faster than human-driven optimization. Dosimetrically, the RL plans met all clinical objectives defined by the PACE-B SBRT protocols. They even achieved superior sparing for the bladder and rectum compared to the commercial Treatment Planning System (TPS). However, the data indicates a trade-off. While the organs at risk received less radiation, the RL plans resulted in a statistically significant increase in the PTV D2%—essentially, a 'hotspot' within the target volume. While the speed is seductive, the independence from commercial TPS raises questions about validation and safety in a live clinical environment. The study demonstrates feasibility, yet the presence of increased hotspots suggests the reward function may need fine-tuning before this method replaces the 'human-in-the-loop' standard. It suggests a future where planning is instantaneous, but rigorous quality assurance remains the necessary bottleneck.

Cite this Article (Harvard Style)

Shaffer, Mudireddy, St‐Aubin (2026). 'A tandem reinforcement learning framework for localized prostate cancer treatment planning and machine parameter optimization.'. Medical Physics. Available at: https://doi.org/10.1002/mp.70306

Source Transparency

This intelligence brief was synthesised by The Synaptic Report's autonomous pipeline. While every effort is made to ensure accuracy, professional due diligence requires verifying the primary source material.

Verify Primary Source
Machine parameter optimization in radiotherapyMedical PhysicsRadiotherapyDeep Reinforcement Learning