I Introduction
Ensuring stability of neural control policies is critical for the practical use of learningbased methods for control design in robotics. There has been exciting progress towards introducing controltheoretic approaches for enhancing stability in reinforcement learning and imitation learning, such as using Lyapunov methods
[4, 20, 9, 10, 14, 16, 6, 11], control barrier functions [7, 27, 2], or controltheoretic regularization [8, 15, 17]. However, three major questions are still open. First, while giving direct controltheoretic prior or guidance can improve performance, to what extent can a learning agent selfsupervise to achieve certifiable stability? Second, stability requirements should be inherent to a system and not affected by the choice of reward functions, which often do not focus on stability. Can we achieve stability without restricting the flexibility of reward engineering in reinforcement learning? Third, the typical formulation of stability in learning is in the form of reducing expected violation of certain constraints asymptotically, which does not easily translate to certifiable behaviors when the learned policies are practically deployed. Is it possible to obtain stronger stability claims in the standard controltheoretic sense, such as for estimating region of attraction and forward invariance properties? Our goal is to positively answer these questions to improve the reliability and usability of learningbased methods for practical control problems in robotics.We propose new methods for incorporating Lyapunov methods in deep reinforcement learning. We design a modelfree policy optimization process in which the agent attempts to formulate a Lyapunovlike function through selfsupervision, in the form of a special critic function that is then used for improving stability of the learned policy, without using or being affected by the environment reward. Our work follows the framework of actorcritic Lyapunov methods recently proposed in [14] and extends it in the following ways. We allow the agent to selflearn the Lyapunov critic function by minimizing the Lyapunov risk [6]
over its experience buffer without accessing the rewards. This Lyapunov critic function is represented as a generic feedforward neural network, randomly initialized. This design differs from the typical choice of using a positive definite neural network to construct the Lyapunov candidate, to capture values learned from appropriately designed cost functions that only reflect stability objectives, as in
[14, 21]. The lack of restriction and guidance turns out to be beneficial. In our approach, the learning agent can formulate Lyapunov landscapes that better enforce stability, compared to existing methods. In fact, this form of selflearned Lyapunov candidate can often be shown to satisfy the Almost Lyapunov conditions [18], which allows us to use samplebased analysis to estimate its region of attraction and forward invariance properties. We show that the new design is important for learning certifiably stable control policies for practical control problems such as automobile pathtracking and quadrotor control.Our work builds on the recent progress of modelfree and samplebased Lyapunov methods [18, 5, 14, 21]. Such methods allow us to use sampled Lie derivatives of candidate Lyapunov functions to estimate stability properties of neural control policies without using analytic forms of the system dynamics. We exploit the expressiveness of neural networks for capturing Lyapunov landscapes that are too complex for conventional choices such as sumofsquares polynomials. We believe the proposed methods further advance the promising direction of developing rigorous neural control and certification methods in reinforcement learning.
In all, we make the following contributions. We propose new methods for training neural Lyapunov critic functions (Section IV.A) and use it to improve stability properties of neural control policies (Section IV.B) in a modelfree policy optimization setting. We show that the learned Lyapunov critic function can be analyzed through samplebased methods based on the Almost Lyapunov conditions, for estimating its region of attraction and invariance properties (Section IV.C). We demonstrate the benefits of the proposed methods in comparison with standard policy optimization and existing Lyapunovbased actorcritic methods (Section V).
Ii Related Work
Besides the most closely related work [14] discussed above, our work is connected to various recent progress in safe reinforcement learning, Lyapunov methods in reinforcement learning, and datadriven methods for the analysis of control and dynamical systems. Safe reinforcement learning is now a large and active field [12, 1]
focusing on learning with various forms of soft or hard safety constraints, often formulated as Constrained Markov Decision Processes (CMDP)
[3]. Lyapunov methods have seen various applications in this context. It was first introduced in RL by the work of [20], which uses predefined controllers and Lyapunovbased methods for learning a switching policy with safety and performance guarantees. [4] proposed a modelbased RL framework that uses Lyapunov functions to guide safe exploration and uses Gaussian Process models of the dynamics to obtain highperformance control policies with provable stability certificates. [9] developed methods for constructing Lyapunov function in the tabular setting that can guarantee global safety of a behavior policy during training. The approach is extended to policy optimization for the continuous control setting in [10], showing benefits in various highdimensional control tasks. The work in [16] formulates the stateaction value function for safety costs as candidate Lyapunov functions and model its derivative with Gaussian Processes with statistical guarantees on the control performance. Similarly in [14], candidate Lyapunov functions are constructed from value functions with benefit in stabilizing the control performance. In the work of [21, 6], neural networks are used to learn Lyapunov functions for establishing certificates for stability and safety properties.Many other controltheoretic methods have also been proposed to improve reinforcement learning [7, 27, 2, 8, 15, 17, 11]
. These methods typically involve introducing strong controltheoretic priors, or use the neural policies as an oracle to extract lowvariance policies in simpler hypothesis classes. For instance, the work in
[7] uses control barrier functions to ensure safety of learned control policies. The work in [8] proves stability properties throughout learning by taking advantage of the robustness of controltheoretic priors. Our goal is to turn controltheoretic methods into general selfsupervision methods for enhancing stability.Iii Preliminaries
Definition 1 (Dynamical Systems).
An dimensional controlled dynamical system is defined by
(1) 
where
is a Lipschitzcontinuous vector field,
is a control function, and with defines the state space of the system. Each is called a state vector and is a control vector.Definition 2 (Stability).
We say that the system of is stable at the origin if for any , there exists such that if then for all . The system is asymptotically stable at the origin if it is stable and also for all .
Definition 3 (Lie Derivatives).
Consider the system in (1) and let be a continuously differentiable function. The Lie derivative of over is defined as
(2) 
It measures the rate of change of over time along the direction of the system dynamics of .
Definition 4 (Lyapunov Conditions for Asymptotic Stability).
Consider a controlled system with an equilibrium at the origin, i.e., s.t. . Suppose there exists a continuously differentiable function satisfying , and , and . Then is a Lyapunov function. The system is asymptotically stable at the origin if such Lyapunov function can be found.
The recent work [18] proposed an approximate notion of Lyapunov methods named Almost Lyapunov functions, which allows the Lyapunov conditions to be violated in restricted subsets of the space while still ensuring stability properties. It is the basis of our samplebased approach for validating learned Lyapunov candidates. We will discuss this notion and our approach in detail in Section IV.C.
We will use the standard notations for reinforcement learning in Markov Decision Processes (MDP) with state space , action space and transition model . A reward function defines the reward for taking action in state and transitioning into . Let denote a stochastic policy parameterized by . The goal of the learning agent is to maximize the expected discounted cumulative return . Policy optimization methods [28, 19, 23, 29] estimate policy gradient and use stochastic gradient ascent to directly improve policy performance. A standard gradient estimator is
(3) 
where is a stochastic policy and estimates the advantage that represents the difference between the Q value of an action compared with the expected value of a state, to indicate whether an action should be taken more frequently in the future. The gradient steps will move the distribution over actions in the right direction accordingly. The expectation is estimated by the empirical average over finite batch of samples. The proximal policy optimization algorithm (PPO) [25] applies clipping to the objective function to remove incentives for the policy to change dramatically, using:
(4) 
where and
is a hyperparameter. The clipping ensures the gradient steps do not overshoot in the policy parameter space in each policy update.
Iv Policy Optimization with Lyapunov Critics
We now describe how to use selflearned candidate Lyapunov functions to improve stability of neural control policies. We will name these candidate Lyapunov functions as Lyapunov critics, following [14]. We will first describe how to learn Lyapunov critics through sampled trajectories, and then how to integrate this critic values in advantage estimation for policy optimization. We describe how samplingbased certification of stability using Lyapunov critics, following the framework of Almost Lyapunov functions. The full procedure is as shown in Algorithm 1. The overall loop is close to standard PPO [25, 24] with only additional steps at Line 1112, for learning the Lyapunov critic, and Line 17, for policy optimization guided by the new critic. We will explain these steps in the following sections.
Iva SelfLearning Lyapunov Critics
We represent candidate Lyapunov functions using neural networks, draw samples of the system states, and use gradient descent to learn network parameters that maximize the satisfaction of the Lyapunov conditions in Definition 4. Following [6]
, such learning can be achieved by minimizing the following loss function named as the Lyapunov risk:
Definition 5 (Empirical Lyapunov Risk [6]).
Consider dynamical system with domain . Let be a continuously differentiable function parameterized by . The empirical Lyapunov risk of over with a sampling distribution , written as , is defined as:
(5) 
where are states sampled according to the sampling distribution . The empirical Lyapunov risk is nonnegative, and when is a true Lyapunov function for , this empirical risk attains its global minimum , regardless of the choice of the sample size and distribution .
In this original formulation, evaluating the Lyapunov risk requires the full knowledge of the system dynamics for computing . Our current setting is different. The learning agent does not know the system dynamics and can only approximate the Lie derivative along sampled trajectories of the system, through finite differences:
(6) 
where and are two consecutive states and is the time difference between them and . We define the corresponding discretized Lyapunov risk objective where the only change from (5) above is that the Lie derivative is replaced by the sampled estimate for all states.
At each iteration of the policy update (the main loop from Line 4 to 23 in Algorithm 1), we collect trajectories of samples from the controlled system
, and perform stochastic gradient descent to minimize the discretized Lyapunov risk
to learn the Lyapunov critic function (Line 1112). Note that the dependency of the dynamics on the behavior policy is important. When the policy is updated, the controlled system changes its dynamics, and the candidate Lyapunov function should be learned correspondingly. Thus, we need to be able to sample from the buffer of previous trajectories, and estimate their Lie derivative values using the new policies (Line 1011). This dependency makes our approach inherently onpolicy in the current formulation, in the sense that the critic is always learned from the behavior policy and thus not very sampleefficient. Offpolicy learning of the Lyapunov critic is possible through importance sampling and keeping track of the policy and Lyapunov critic updates, which is a promising direction that we leave open.IvB Stabilization via Policy Optimization
Once the Lyapunov critic is formulated, we use the learned candidate Lyapunov functions as an additional critic value for policy optimization. For a temporarily fixed candidate Lyapunov function , the only term in the Lyapunov risk that is affected by policy change is the Lie derivative term . Thus, in policy optimization we now impose the additional goal of ensuring to encourage the policy update to improve the Lyapunov landscape for stabilization. We combine the standard advantage estimate [25] and a clipped Lie derivative term in each policy optimization step:
(7) 
where balances the weights. With policy gradient methods on this advantage estimate, the second term penalizes actions that produce a positive Lie derivative which makes . When the Lie derivative is negative, it does not bias the advantage estimate. We emphasize that although the Lie derivative term is dependent on , it does not have a functional form but only accessed through sampled values based on Equation (6). The true dynamics of the system remains unknown to the learner.
With the definition of the advantage with Lyapunov critic in (7), we can easily replace the standard advantage estimators in various onpolicy algorithms. For instance, the standard policy gradient estimator becomes . In Algorithm 1 we use the PPO version [25] of policy update (Line 1920), which is what we use in the experiments. Optionally, we can update the parameter as a Lagrange multiplier, by also taking gradient steps on at some learning rate using , clipped between . We have not observed much performance difference in doing so, and have excluded this step in Algorithm 1 for simplicity.
IvC Validating Almost Lyapunov Conditions
A major benefit of the proposed approach is that the selflearned Lyapunov critic allows us to estimate the region of attraction of the controlled system when learning is successful. Since we do not have access to the dynamics of the system and can only estimate the Lyapunov risk at sampled states, we can not expect to certify that the learned Lyapunov critic functions are true Lyapunov functions in the standard sense. However, recent progress in relaxed conditions for Lyapunov methods enabled the use of samplebased analysis to find region of stability. In particular, the Lyapunov critic functions can be analyzed through sampling using the Almost Lyapunov conditions [18] defined as follows:
Proposition 1 (Almost Lyapunov Conditions [18]).
Consider a dynamical system in (1) defined by with domain and a continuously differentiable positive definite function . Let be two constants and define as the region between two sublevel sets for . Let be a measurable set. Suppose for some , , and and . Then there exists such that for any , if the volume of each connected component of satisfies , then there exists such that for any with , stays within for all and moreover, it converges to the sublevel set for any . Here for some constants and .
The full proof is in [18] where all constants are explicitly constructed. Conceptually, the theorem relaxes the standard Lyapunov conditions to allow a set that contains the violation states where the Lie derivative can be positive (). As long as each component of is small enough (with volume less than ), the violations do not affect stability. Under mild conditions, the appropriate region between sublevel sets of the system (written as in the definition) defines a forward invariant set for all trajectories, and an approximate form of contraction where the trajectory converges to near the lower level set of the region .
With the Almost Lyapunov conditions, we can use an net over the space, chosen based on the Lipschitz constant of the Lie derivative, such that using sampled value at the center of each cell we can identify the set of states where the Lie derivatives violate the standard conditions. In Figure 1, we show the results of such computation to compare four different types of candidate Lyapunov functions for the inverted pendulum controlled by neural network policies. Each plot visualizes the sign of the Lie derivative of the proposed Lyapunov candidate at uniformly sampled points over the space. The grey dots represent cells where the candidate Lyapunov function has negative Lie derivative values ( as required in Proposition 1), and the red dots indicate cells that violate such condition. The value of the Lyapunov candidate itself can be shown to be always nonnegative in the domain in all four cases, and the black contours represent the level sets of increasingly positive values. The red dots indicate states where the standard Lyapunov conditions are violated, and the patterns are different.
Figure 1(a) shows the selflearned Lyapunov critic obtained after learning the neural control policy using Algorithm 1. We see that we can find a sublevel set of the Lyapunov critic (inside the blue circle) where the Lie derivative is positive only at a very small number of sparse regions (a few red dots near the center). This landscape satisfies the Almost Lyapunov conditions and the sublevel set defines forward invariant set with attraction (Proposition 1).
Figure 1(b) shows the landscape generated by the Lyapunov actorcritic method (LAC) [14]. We see that the Lyapunov conditions are satisfied more globally, although with more violations closer to the origin. The function can also be established as an Almost Lyapunov function where the violation is sparse, which does not include the innermost sublevel set. The level sets is also much larger than those in (a), making it harder to find sublevel sets that are forward invariant. On the other hand, as shown later in the next section, the learned controller does always stabilize the system in all sampled trajectories. This indicates that there may exist better Lyapunov candidates for certifying the stability.
Figure 1(c) shows the Lyapunov candidate fit for the control policy learned by standard PPO. We see that much more violation states are observed and it does not allow us to find a region where Almost Lyapunov Conditions can be validated. This indicates that the lack of Lyapunov critic makes it hard to enforce stability properties.
Figure 1(d) uses the quadratic Lyapunov function for the linearized inverted pendulum obtained through LQR directly as a Lyapunov candidate for the neural controller trained by Algorithm 1, same as the one used in (a). We see that the simple Lyapunov function does not satisfy the Almost Lyapunov conditions and fails to capture the stability properties of the learned controller.
V Experiments
We now show experimental results on the proposed methods for various nonlinear control problems. Our implementation follows Algorithm 1 (referred to as LY) and optionally uses an additional entropy term in the advantage estimates. We compare the performance with the closely related work of Lyapunovbased ActorCritic (LAC) [14], and also standard implementations of Soft ActorCritic (SAC) [13] and PPO [25]. We use the following control environments: the inverted pendulum, quadrotor control, automobile pathtracking, and Mujoco Hopper and Walker.
Inverted Pendulum. For the standard inverted pendulum (Pendulumv0 in OpenAI Gym), Figure 2 shows the system trajectories under learned policies plotted in the  space. The blue dots indicate initial positions and the red dots indicate positions after stabilization. Both LY and LAC learn stable controllers that reach the upright position without falling down, i.e., showing monotonically decreasing values. In contrast, the policies trained with SAC and PPO show horizontal lines that indicate falling down and crossing , by first swinging downwards and then back to the upright position. As discussed in the previous section in Fig 1, both LY and LAC learn Almost Lyapunov functions that are consistent with the stable control behaviors.
Quadrotor control. We learn neural controller for the 6DOF quadrotor model, which has four control inputs and twelve state variables. The control inputs and are the angular velocity of each rotor. The state variables are the inertia frame positions (), velocities (), rotation angles () and angular velocities (). The equations of motion of the quadrotor are given in [22] and Figure 3(a) shows the schematic. The control problem is known to be hard for policy gradient methods, typically requiring imitation learning steps. In Figure 4, we see that the LY method can learn stable tracking controller for the system. Moreover, the learned Lyapunov critic is shown in Fig 4(a) and its Lie derivatives is in Fig 4(b). Almost Lyapunov conditions can be validated with in the blue level set which certifies stable behavior shown in the trajectory. In comparison, LAC fails to learn a working controller.
Automobile pathtracking control. The control problem of following a target path and speed command is the core of autonomous driving [26]. In the training environment, the state has four dimensions including target velocity , angular error , distance to the path , and the vehicle speed . The action space contains acceleration and a steering control . The car dynamics follows standard bicycle model [26] and Figure 3(b) shows a schematic. In this experiment, we observe that all methods can obtain working control policies in the training environment in Figure 5(a). However, the Lyapunov landscape matters when applying the policy on a different path, as illustrated in the bottom row in Fig 5. LY methods generalizes well to unseen paths because of its region of attraction, validated via the Almost Lyapunov conditions within the blue level set of the learned Lyapunov critic. In Figure 5(c), we see that the Lyapunov critic learned in LAC does not create a landscape that enforces a region of attraction in the sense of Almost Lyapunov conditions.
Mujoco Walker and Hopper. Walker and Hopper are standard highdimensional locomotion environment. For both the goal is move forward as fast as possible and not stabilization. The LAC method requires using cost functions for stabilization objectives only, and thus does not work in these environments. We can learn Lyapunov critics that are independent from the reward, just focusing on stabilizing the joint angles to hold the upright positions. As shown in Fig 6, the LY controller maintains better pose and gaits compared to SAC and PPO. The learning curves show that LY does not slow down learning, and can be used in generic highdimensional control tasks to improve performance.
Vi Conclusion
We proposed new methods for training stable neural control policies using Lyapunov critic functions. We showed that the learned Lyapunov critics can be used to estimate regions of attraction for the controllers based on Almost Lyapunov conditions. We demonstrated the benefits of the proposed methods in various nonlinear control problems. Future work includes further improving sample complexity of the Lyapunov critic learning as well as the validation process for ensuring the Almost Lyapunov conditions.
References

[1]
(2017)
Constrained policy optimization.
In
Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 611 August 2017
, pp. 22–31. Cited by: §II. 
[2]
(2018)
Safe reinforcement learning via shielding.
In
Proceedings of the ThirtySecond AAAI Conference on Artificial Intelligence, (AAAI18), the 30th innovative Applications of Artificial Intelligence (IAAI18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI18), New Orleans, Louisiana, USA, February 27, 2018
, pp. 2669–2678. Cited by: §I, §II.  [3] (1999) Constrained markov decision processes. CRC Press. Cited by: §II.
 [4] (2017) Safe modelbased reinforcement learning with stability guarantees. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 49 December 2017, Long Beach, CA, USA, pp. 908–918. Cited by: §I, §II.
 [5] (2018) Automatedsamplingbased stability verification and doa estimation for nonlinear systems. IEEE Transactions on Automatic Control 63 (11), pp. 3659–3674. External Links: Document Cited by: §I.
 [6] (2019) Neural lyapunov control. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 814 December 2019, Vancouver, BC, Canada, pp. 3240–3249. Cited by: §I, §I, §II, §IVA, Definition 5.
 [7] (2019) Endtoend safe reinforcement learning through barrier functions for safetycritical continuous control tasks. In The ThirtyThird AAAI Conference on Artificial Intelligence, AAAI 2019, pp. 3387–3395. External Links: Document Cited by: §I, §II.
 [8] (2019) Control regularization for reduced variance reinforcement learning. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, pp. 1141–1150. Cited by: §I, §II.
 [9] (2018) A lyapunovbased approach to safe reinforcement learning. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 38 December 2018, Montréal, Canada, pp. 8103–8112. Cited by: §I, §II.
 [10] (2019) Lyapunovbased safe policy optimization for continuous control. CoRR abs/1901.10031. External Links: 1901.10031 Cited by: §I, §II.
 [11] (2019) Safe interactive modelbased learning. External Links: 1911.06556 Cited by: §I, §II.
 [12] (2015) A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res. 16, pp. 1437–1480. Cited by: §II.
 [13] (201810–15 Jul) Soft actorcritic: offpolicy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 1861–1870. Cited by: §V.
 [14] (2020) Actorcritic reinforcement learning for control with stability guarantee. IEEE Robotics and Automation Letters and IROS. External Links: 2004.14288 Cited by: §I, §I, §I, §II, §IVC, §IV, §V.
 [15] (2017) Control of a quadrotor with reinforcement learning. IEEE Robotics and Automation Letters 2 (4), pp. 2096–2103. External Links: Document Cited by: §I, §II.
 [16] (2018) Controltheoretic analysis of smoothness for stabilitycertified reinforcement learning. In 57th IEEE Conference on Decision and Control, CDC 2018, Miami, FL, USA, December 1719, 2018, pp. 6840–6847. External Links: Document Cited by: §I, §II.
 [17] (202010–11 Jun) Robust regression for safe exploration in control. A. M. Bayen, A. Jadbabaie, G. Pappas, P. A. Parrilo, B. Recht, C. Tomlin, and M. Zeilinger (Eds.), Proceedings of Machine Learning Research, Vol. 120, The Cloud, pp. 608–619. Cited by: §I, §II.
 [18] (2020) Almost lyapunov functions for nonlinear systems. Automatica 113, pp. 108758. External Links: ISSN 00051098, Document Cited by: §I, §I, §III, §IVC, §IVC, Proposition 1.
 [19] (2016) Asynchronous methods for deep reinforcement learning. In ICML 2016, pp. 1928–1937. Cited by: §III.
 [20] (2002) Lyapunov design for safe reinforcement learning. J. Mach. Learn. Res. 3, pp. 803–832. Cited by: §I, §II.
 [21] (2018) The lyapunov neural network: adaptive stability certification for safe learning of dynamical systems. In 2nd Annual Conference on Robot Learning, CoRL 2018, Zürich, Switzerland, 2931 October 2018, Proceedings, pp. 466–476. Cited by: §I, §I, §II.
 [22] (202005) A survey of path following control strategies for uavs focused on quadrotors. Journal of Intelligent and Robotic Systems, pp. . External Links: Document Cited by: §V.
 [23] (2015) Trust region policy optimization. In ICML 2015, pp. 1889–1897. Cited by: §III.
 [24] (2018) Highdimensional continuous control using generalized advantage estimation. External Links: 1506.02438 Cited by: §IV, 15.
 [25] (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §III, §IVB, §IVB, §IV, §V.
 [26] (201104) Automatic steering methods for autonomous automobile path tracking. pp. . Cited by: §V.
 [27] (2019) Learning for safetycritical control with control barrier functions. External Links: 1912.10099 Cited by: §I, §II.
 [28] (1992) Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning 8 (34), pp. 229–256. Cited by: §III.
 [29] (2017) Secondorder optimization for deep reinforcement learning using kroneckerfactored approximation. In NIPS 2017, pp. 5285–5294. Cited by: §III.
Comments
There are no comments yet.