Case Study07/9/2019

Reinforcement learning for personalized medication dosing

Noah Salas
Data Scientist
Brock Ferguson
Co-Founder, Principal Data Scientist
Jacob Zweig
Co-Founder, Principal Data Scientist

Diabetes is a chronic disease that affects millions of Americans. In the past decade, researchers have begun to set their sights on developing tools and methods that leverage artificial intelligence for the treatment and prevention of complications associated with diabetes [3]. One promising line of research has capitalized on recent advances in reinforcement learning (RL) to build algorithms that can optimize insulin dosing to improve health outcomes for patients.

Yet applying reinforcement learning to “real world” problems such as medication dosing is challenging [8]. Rather than building algorithms that can learn online from patients through exploring different dosing strategies (which could jeopardize patient health and safety), these kinds of problems require a considerable amount of offline training and analysis to ensure that they will be effective and safe.

Reinforcement learning algorithms are also notoriously difficult to train, calibrate, and deploy at a scale that will make such an advance useful to a community of patients as large as those afflicted with diabetes.

Here we demonstrate how one can meet these challenges to apply reinforcement learning to drug dosing. As our foundation we use Strong RL, our platform for building and deploying reinforcement learning algorithms at scale, as well as two new simulation environments in which one can train and validate RL approaches to dosing problems.

From traditional approaches to reinforcement learning

Type-1 diabetes is a chronic condition in which the body lacks the ability to produce sufficient insulin, limiting the ability of glucose (sugar) to enter cells and produce energy. As a consequence, glucose buildup in the bloodstream can cause potentially fatal health complications [5]. The goal of Type-1 diabetes treatment (dosing) is to provide the body with exogenous insulin that in turn creates safe blood glucose levels, thereby avoiding hyperglycemia (levels that are too high) and hypoglycemia (levels that are too low).

Traditional approaches to automated dosing systems leverage optimal control theory to design algorithmic solutions. While models of glucose kinetics do exist (one of which is advanced enough that the FDA accepted its use as a substitute response in certain preclinical trials [4]), the process of building, evaluating, and calibrating optimal control solutions is often expensive and slow [9]. In the domain of diabetes, factors such as caloric intake (quality, amount, and timing of meals), exercise (amount and timing), traits including weight, height, family history, and age, in addition to individual pharmacokinetic and pharmacodynamic factors play important roles in determining blood glucose levels. Instead, reinforcement learning provides a highly effective and efficient framework for implementing automated dosing solutions:

  • Unlike control theory methods, most RL algorithms only depend on interactions with the system (in Markov process terms, at least partially observing environment states). They do not require any model of the environment [2].
  • The glucose kinetics process is complex and only partially known (existing models are not perfect — if they were, this wouldn’t be an active area of research).
  • Glucose kinetics are nonlinear, making them candidates to be approximated by the deep neural networks that underlie most modern RL algorithms.

Simulating diabetes treatment

Before testing RL approaches to solve the problem, we first created a novel glucose kinetics simulation environment (accessible at https://github.com/strongio/dosing-rl-gym). 

This diabetic simulator is based on an expanded version of the Bergman minimal model of glucose kinetics, which includes meal disturbances (spikes in blood glucose attributed to additional glucose from consumed food) [1]. The underlying mathematical representation of this model was adapted from code originally written by John D. Hedengren [6].

In this environment, meals are stochastically sampled around common eating times (in the U.S.), thus creating episodes (or patient treatment days) with different meal amounts across different times of the day (see Figure 1). 

Figure 1. Example patient episodes represent individual daily experiences with diabetes including meals.
Figure 1. Example patient episodes represent individual daily experiences with diabetes including meals.

Each individual episode represents a patient's daily experience with diabetes treatment. In this simulated environment, measurements occur every 10 minutes and a recommended dose (or no dose) can be administered at each time step. The duration of each daily episode therefore includes 144 distinct measurements.

The success of an interventional treatment is quantified with a reward function that specifies how well controlled an individual’s blood glucose is. To quantify this, a smooth function bounded between -1 and 1 is created based on the distance from target blood glucose levels (80 mg/dL) and and safe boundary levels (65 mg/dLand 105 mg/dL). Blood glucose values outside of this range are considered unsafe, and therefore any treatment that results in such levels are discouraged with a negative reward. This reward is visually described in Figure 2:

Figure 2. The reward function is based on how close an individual's blood glucose levels are to target values. Divergence from this target is penalized.
Figure 2. The reward function is based on how close an individual's blood glucose levels are to target values. Divergence from this target is penalized.

Actions in this environment represent possible insulin doses administered to the patient, and are discrete quantities ranging from 0 mU to 10 mU in 0.5 mU steps.

The information available to the algorithm at each timepoint (state) includes the current blood glucose level (mg/dL), gut blood glucose (mg/dL), meal disturbance amount (mmol/L-min), as well as the values of each of these in the previous ten states (i.e. the two hours of measurements). Additional individual health traits including weight, height, and age would likely provide useful information in a live implementation, but were not included in the present analysis.

Conceptualizing the problem in Strong-RL

As mentioned above, rather than writing a custom framework for building and testing our models, we have opted to use Strong-RL — our platform for building reinforcement learning based solutions.

When we designed Strong RL, we wanted to address some of the core problems researchers and end users face in building and deploying reinforcement learning solutions:

  • Maintaining flexibility in training, evaluating, and deploying different kinds of RL algorithms
  • Being able to test in different ‘environments’ (simulators, historical datasets, and real-world deployments) without having to port your RL agents (which is prone to error)
  • Validating agents pre- and post-deployment using reproducible evaluation
  • Scaling RL algorithms from a local research environment to massive clusters capable of making as many decisions as you need quickly

By addressing these challenges in a standard platform, we can accelerate the transition from design ↔ researchdeployment.

Each Strong RL application comprises several standard components, such as a datalog, data modeler, actor, and environment(s), each assembled and configured into a pipeline that can ingest new data and export recommendations.

Strong RL Platform Architecture
Strong RL Platform Architecture

In this environment, the datalog holds event-level patient data. These events are the data collected for each patient at 10 minute intervals using the above specified simulation.

The data modeler’s job is to take this event level data and create higher-level models including, most importantly, a target model representing the patients from which we want to learn and for which we want to make recommendations. The data modeler uses Spark for data processing.

After target models are built by the data modeler, they are paired with historical action data and sent to the actor. The actor manages the data exchange between the Strong RL pipeline (built on Spark) and fully customizable (or, built-in default) reinforcement learning agents. The actor observes historical actions and historical targets during learning, and acts on new targets when it comes time to make live recommendations.

Actions are selected from an action space dynamically generated based on the target’s current state. In this way, although the agent ultimately gets to select the next best action, we can tailor the possible actions it can choose from based on what we know a priori to be safe/reasonable for a given patient. Actions may be either discrete (selected from a predefined set) or continuous (selected from a range of infinite options). With our pipeline components assembled, all that’s left is the research to tailor our reinforcement learning agents to this problem.

To learn more about these concepts, see the Strong-RL Documentation.

Building & evaluating reinforcement learning agents

In many real-world applications of reinforcement learning, environment simulators are unavailable or of limited utility, requiring agents to be trained from non-optimal and potentially confounded historical data. We therefore emulate this scenario by generating a frozen dataset from a historical ‘clinician’ policy containing non-optimal sequences of actions (generated using a trained algorithm with 30% random actions). The experimental agent is next trained on this frozen dataset, and evaluated.

In the present experiment, we leveraged a discretized version of the Soft Actor-Critic (SAC policy) [7] algorithm. This agent was trained using the frozen historical ‘clinician’ dataset via fitted Q iteration until they reached convergence (stable Q value estimates).

A patient policy agent utilizing the 500 rule for carbohydrate coverage, and the 1800 rule for high blood glucose correction [10] was included as a control condition.

In scenarios where only historical data exist, agents can be evaluated with off-policy estimation. This process requires asking a counterfactual question: “What would have happened if this agent were to act?”. Evaluating this question remains challenging due to the fact that an agent’s actions may have influenced the environment (e.g., it could have caused a decrease in blood glucose levels), but these responses will not be seen in the historical data. In the present experiment, we evaluate agents using both off-policy estimation on historical ‘clinician’ data and online simulation using the previously described diabetic simulator.

We leverage weighted per-decision importance sampling (WPDIS) [11] to generate an estimate of reward attributable to our agent’s policy (πpolicy) given data generated from the historical ‘clinician’ policy (πhist). Off-policy evaluation techniques based on importance sampling assign weights to individual samples to approximate a distribution drawn from the evaluation policy using data from the historical policy. The weight (w) corresponds to how likely a given action is under the agent’s policy as compared to the historical policy.

Figure 3. Off-policy evaluation uses weighted per-decision importance sampling (WPDIS) estimates
Figure 3. Off-policy evaluation uses weighted per-decision importance sampling (WPDIS) estimates

For example, if an agent (policy) would never have selected the same action as the clinician, the outcome of that action will be largely ignored. In contrast, if the agent would always have selected the same action as the clinician, the outcome of that action will be highly relevant and thus weighted highly.

Importance sampling techniques can suffer from high-variance under data constraints. We therefore leverage the WPDIS estimator, which increases bias for reduced variance.

Table 1 shows the estimates of discounted reward attributed to the SAC agent and Patient Protocol, as compared to the historical clinician policy.

+-------------------------------+-------------+
|            Policy             | WPDIS Score |
+-------------------------------+-------------+
| Historical ‘clinician’ policy |       77.64 |
| SAC policy                    |      113.52 |
| Patient policy                |       41.13 |
+-------------------------------+-------------+
Table 1. Estimated reward for each policy with WPDIS estimates

As can be seen in the table above, the SAC policy is estimated to perform 2.7 times as well as the Patient policy on the historical data. To further validate the SAC policy as well as these off-policy estimates, we can leverage the diabetic simulator to generate novel sequences for online evaluation. As demonstrated below in Figure 4, the SAC policy maintains blood glucose levels closer to specified safe targets, with a smaller number of unsafe events.

Figure 4. Distribution of blood glucose levels for Patient and SAC policies. SAC policy maintains blood glucose levels closer to safe target levels.
Figure 4. Distribution of blood glucose levels for Patient and SAC policies. SAC policy maintains blood glucose levels closer to safe target levels.

Using the online simulation, each of the three agents were evaluated with 10,000 episodes. Table 2 demonstrates the mean reward for each agent over these sequences. The high degree of correlation (ρ=.995) between online rewards and WPDIS estimates validates the technique’s usefulness in estimating algorithm performance in scenarios where an online simulation is unavailable.

+-------------------------------+-------------+
|            Policy             | WPDIS Score |
+-------------------------------+-------------+
| Historical 'clinician' policy |       67.66 |
| SAC policy                    |      113.04 |
| Patient policy                |       34.86 |
+-------------------------------+-------------+

Table 2. Observed mean reward over 10,000 sequences for each policy

Inspecting trajectories (Figures 5 & 6 below) reveals the effectiveness SAC policy at keeping blood glucose levels near optimal levels, despite the high variability of individual doses in comparison to the Patient policy. The drastic swings of the Patient policy are minimized by the SAC policy by dynamically suggesting insulin doses taking into account individual responsiveness and historical pharmacodynamics.

Figure 5. Individual doses suggested by the SAC policy are highly variable in contrast to the Patient policy.
Figure 5. Individual doses suggested by the SAC policy are highly variable in contrast to the Patient policy.
Figure 6. Despite the high variability of individual doses, blood glucose levels under the SAC policy are less variable and closer to target values as compared to the Patient policy.
Figure 6. Despite the high variability of individual doses, blood glucose levels under the SAC policy are less variable and closer to target values as compared to the Patient policy.

Reinforcement learning for personalized medication dosing

In this study we demonstrate the remarkable effectiveness of reinforcement learning as an approach to implement personalized medication dosing systems. By leveraging the sequential decision making abilities of reinforcement learning and incorporating individual pharmacokinetic and pharmacodynamic factors, we demonstrate improved safety and performance as compared to traditional rules-based policies for insulin dosing. Importantly, while the outputs of such systems can be used for full automation, they can also be used to augment human decision-making.

The process of implementing reinforcement learning-based systems has traditionally remained a substantial technical challenge. The Strong-RL platform facilitates critical steps of building, evaluating, and scaling such systems on real-world problems.


References

[1] Gillis, R., Palerm, C. C., Zisser, H., Jovanovic, L., Seborg, D. E., Doyle, F. J., & III. (2007). Glucose estimation and prediction through meal responses using ambulatory subject data for advisory mode model predictive control. Journal of Diabetes Science and Technology, 1(6), 825–833. https://doi.org/10.1177/193229680700100605

[2] Ngo, P. D., Wei, S., Holubová, A., Muzik, J., & Godtliebsen, F. (2018). Control of Blood Glucose for Type-1 Diabetes by Using Reinforcement Learning with Feedforward Algorithm. Computational and Mathematical Methods in Medicine, 2018, 1–8. https://doi.org/10.1155/2018/4...

[3] Contreras, I., & Vehi, J. (2018). Artificial Intelligence for Diabetes Management and Decision Support: Literature Review. Journal of Medical Internet Research, 20(5), e10775. https://doi.org/10.2196/10775

[4] Man, C. D., Micheletto, F., Lv, D., Breton, M., Kovatchev, B., & Cobelli, C. (2014). The UVA/PADOVA Type 1 Diabetes Simulator: New Features. Journal of Diabetes Science and Technology, 8(1), 26–34. https://doi.org/10.1177/193229...

[5] Type 1 diabetes - Symptoms and causes - Mayo Clinic. (n.d.). Retrieved May 7, 2019, from https://www.mayoclinic.org/diseases-conditions/type-1-diabetes/symptoms-causes/syc-20353011

[6] Hedengren, J. (n.d.). Maintain Glucose in Type-I Diabetic. Retrieved May 7, 2019, from http://apmonitor.com/pdc/index.php/Main/DiabeticBloodGlucose

[7] Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., … Levine, S. (2018). Soft Actor-Critic Algorithms and Applications. Retrieved from https://arxiv.org/pdf/1812.059...

[8] Dulac-Arnold, G., Mankowitz, D., & Hester, T. (2019). Challenges of Real-World Reinforcement Learning. Retrieved from http://arxiv.org/abs/1904.1290...

[9] Kapinski, J., Deshmukh, J. V, Jin, X., Ito, H., & Butts, K. (n.d.). An overview of trAditionAl And AdvAnced modeling, testing, And verificAtion techniques Simulation-Based Approaches for Verification of Embedded Control Systems. https://doi.org/10.1109/MCS.2016.2602089

[10] American Diabetes Association. (n.d.). Getting Started with an Insulin Pump: American Diabetes Association®. Retrieved May 21, 2019, from http://www.diabetes.org/living-with-diabetes/treatment-and-care/medication/insulin/getting-started.html

[11] Doina Precup, Richard S Sutton, and Satinder P Singh. Eligibility traces for off-policy policy evaluation. In ICML, pages 759–766. Citeseer, 2000.

Share: