BookWikiCanonicalHigh evidence score
Reinforcement Learning: An Introduction
Richard S. Sutton, Andrew G. Barto · MIT Press · 2018
The standard textbook introduction to reinforcement learning, covering MDPs, value functions, temporal-difference learning, policy gradients, and core algorithms.
Read the breakdown →StudyWikiCanonicalHigh confidence
A Survey of Constraint Formulations in Safe Reinforcement Learning
Akifumi Wachi, Xun Shen, Yanan Sui · IJCAI · 2024
A survey of safe RL constraint formulations, representative algorithms, and the relationships among common constrained decision-making criteria.
Read the breakdown →StudyWikiCanonicalHigh confidence
A Comprehensive Survey on Safe Reinforcement Learning
Javier Garcia, Fernando Fernandez · Journal of Machine Learning Research · 2015
A classic survey of safe reinforcement learning, including risk-sensitive criteria, constrained exploration, safety during learning, and external guidance.
Read the breakdown →StudyWikiCanonicalHigh confidence
Near-Optimal Reinforcement Learning in Dynamic Treatment Regimes
Junzhe Zhang, Elias Bareinboim · NeurIPS · 2019
Connects causal reinforcement learning with dynamic treatment regimes, focusing on near-optimal sequential treatment policies.
Read the breakdown →StudyWikiCanonicalHigh confidence
An Introduction to Causal Reinforcement Learning
Elias Bareinboim, Junzhe Zhang, Sanghack Lee · CausalAI Lab Technical Report R-65 · 2024
A tutorial survey that organizes causal reinforcement learning around offline-to-online learning, intervention choice, counterfactual decision-making, transportability, causal discovery, imitation, curriculum learning, reward shaping, and causal game theory.
Read the breakdown →StudyWikiCanonicalHigh confidence
Markov Decision Processes with Unobserved Confounders: A Causal Approach
Junzhe Zhang, Elias Bareinboim · CausalAI Lab Technical Report R-23 · 2016
Extends causal reasoning to MDPs where hidden variables may affect both actions and outcomes, motivating CRL methods that reason about confounding in sequential settings.
Read the breakdown →StudyWikiCanonicalHigh confidence
Causal Imitation Learning with Unobserved Confounders
Junzhe Zhang, Daniel Kumor, Elias Bareinboim · NeurIPS · 2020
Formulates imitation learning when expert demonstrations are confounded and rewards may not be directly observed.
Read the breakdown →StudyWikiCanonicalHigh confidence
Bandits with Unobserved Confounders: A Causal Approach
Elias Bareinboim, Andrew Forney, Judea Pearl · NeurIPS · 2015
Introduces a causal treatment of bandit problems where observational feedback may be confounded, showing when causal structure can improve intervention selection.
Read the breakdown →StudyWikiCanonicalHigh confidence
Off-Policy Policy Evaluation for Sequential Decisions under Unobserved Confounding
Hongseok Namkoong, Ramtin Keramati, Steve Yadlowsky +1 more · arXiv · 2020
Studies off-policy evaluation for sequential decisions when hidden confounders may bias logged trajectories.
Read the breakdown →StudyPreprintWikiCanonicalModerate
Conservative Q-Learning for Offline Reinforcement Learning
Aviral Kumar, Aurick Zhou, George Tucker +1 more · 2020
Effectively leveraging large, previously collected datasets in reinforcement learning (RL) is a key challenge for large-scale real-world applications. Offline RL algorithms promise to learn effective policies from previously-collected, static datasets without further interaction. However, in practice, offline RL presents a major challenge, and standard off-policy RL methods can fail due to overestimation of values induced by the distributional shift between the dataset and the learned policy, especially when training on complex and multi-modal data distributions. In this paper, we propose conservative Q-learning (CQL), which aims to address these limitations by learning a conservative Q-function such that the expected value of a policy under this Q-function lower-bounds its true value. We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a policy learning procedure with theoretical improvement guarantees. In practice, CQL augments the standard Bellman error objective with a simple Q-value regularizer which is straightforward to implement on top of existing deep Q-learning and actor-critic implementations. On both discrete and continuous control domains, we show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return, especially when learning from complex and multi-modal data distributions.
Read the breakdown →StudyPreprintWikiCanonicalModerate
Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems
Sergey Levine, Aviral Kumar, George Tucker +1 more · 2020
In this tutorial article, we aim to provide the reader with the conceptual tools needed to get started on research on offline reinforcement learning algorithms: reinforcement learning algorithms that utilize previously collected data, without additional online data collection. Offline reinforcement learning algorithms hold tremendous promise for making it possible to turn large datasets into powerful decision making engines. Effective offline reinforcement learning methods would be able to extract policies with the maximum possible utility out of the available data, thereby allowing automation of a wide range of decision-making domains, from healthcare and education to robotics. However, the limitations of current algorithms make this difficult. We will aim to provide the reader with an understanding of these challenges, particularly in the context of modern deep reinforcement learning methods, and describe some potential solutions that have been explored in recent work to mitigate these challenges, along with recent applications, and a discussion of perspectives on open problems in the field.
Read the breakdown →StudyPreprintWikiCanonicalModerate
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal +2 more · 2017
We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more general, and have better sample complexity (empirically). Our experiments test PPO on a collection of benchmark tasks, including simulated robotic locomotion and Atari game playing, and we show that PPO outperforms other online policy gradient methods, and overall strikes a favorable balance between sample complexity, simplicity, and wall-time.
Read the breakdown →StudyPreprintWikiCanonicalModerate
Playing Atari with Deep Reinforcement Learning
Volodymyr Mnih, Koray Kavukcuoglu, David Silver +4 more · 2013
We present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning. The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards. We apply our method to seven Atari 2600 games from the Arcade Learning Environment, with no adjustment of the architecture or learning algorithm. We find that it outperforms all previous approaches on six of the games and surpasses a human expert on three of them.
Read the breakdown →StudyPreprintWikiCanonicalModerate
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel +1 more · 2018
Model-free deep reinforcement learning (RL) algorithms have been demonstrated on a range of challenging decision making and control tasks. However, these methods typically suffer from two major challenges: very high sample complexity and brittle convergence properties, which necessitate meticulous hyperparameter tuning. Both of these challenges severely limit the applicability of such methods to complex, real-world domains. In this paper, we propose soft actor-critic, an off-policy actor-critic deep RL algorithm based on the maximum entropy reinforcement learning framework. In this framework, the actor aims to maximize expected reward while also maximizing entropy. That is, to succeed at the task while acting as randomly as possible. Prior deep RL methods based on this framework have been formulated as Q-learning methods. By combining off-policy updates with a stable stochastic actor-critic formulation, our method achieves state-of-the-art performance on a range of continuous control benchmark tasks, outperforming prior on-policy and off-policy methods. Furthermore, we demonstrate that, in contrast to other off-policy algorithms, our approach is very stable, achieving very similar performance across different random seeds.
Read the breakdown →StudyWikiHigh confidence
Counterfactual Data-Fusion for Online Reinforcement Learners
Andrew Forney, Judea Pearl, Elias Bareinboim · ICML · 2017
Studies how online learners can combine heterogeneous observational and experimental data sources using counterfactual data-fusion principles.
Read the breakdown →StudyWikiHigh confidence
Sequential Causal Imitation Learning with Unobserved Confounders
Daniel Kumor, Junzhe Zhang, Elias Bareinboim · NeurIPS · 2021
Extends causal imitation learning to sequential settings where confounding can persist across time.
Read the breakdown →StudyWikiHigh confidence
Characterizing Optimal Mixed Policies: Where to Intervene, What to Observe
Sanghack Lee, Elias Bareinboim · NeurIPS · 2020
Characterizes policies that mix interventions and observations in causal decision problems.
Read the breakdown →Meta-analysisHigh evidence score
Mapping anhedonia onto reinforcement learning: a behavioural meta-analysis
Quentin J. M. Huys, Diego A. Pizzagalli, Ryan Bogdan +1 more · Biology of Mood & Anxiety Disorders · 2013 · 492 citations
StudyModerate
Dynamical Movement Primitives: Learning Attractor Models for Motor Behaviors
Auke Jan Ijspeert, Jun Nakanishi, H. Hoffmann +2 more · Neural Computation · 2012 · 1,568 citations
Nonlinear dynamical systems have been used in many disciplines to model complex behaviors, including biological motor control, robotics, perception, economics, traffic prediction, and neuroscience. While often the unexpected emergent behavior of nonlinear systems is the focus of investigations, it is of equal importance to create goal-directed behavior (e.g., stable locomotion from a system of coupled oscillators under perceptual guidance). Modeling goal-directed behavior with nonlinear systems is, however, rather difficult due to the parameter sensitivity of these systems, their complex phase transitions in response to subtle parameter changes, and the difficulty of analyzing and predicting their long-term behavior; intuition and time-consuming parameter tuning play a major role. This letter presents and reviews dynamical movement primitives, a line of research for modeling attractor behaviors of autonomous nonlinear dynamical systems with the help of statistical learning techniques. The essence of our approach is to start with a simple dynamical system, such as a set of linear differential equations, and transform those into a weakly nonlinear system with prescribed attractor dynamics by means of a learnable autonomous forcing term. Both point attractors and limit cycle attractors of almost arbitrary complexity can be generated. We explain the design principle of our approach and evaluate its properties in several example applications in motor control and robotics.
StudyModerate
SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient
Lantao Yu, Weinan Zhang, Jun Wang +1 more · Proceedings of the AAAI Conference on Artificial Intelligence · 2017 · 2,290 citations
As a new way of training generative models, Generative Adversarial Net (GAN) that uses a discriminative model to guide the training of the generative model has enjoyed considerable success in generating real-valued data. However, it has limitations when the goal is for generating sequences of discrete tokens. A major reason lies in that the discrete outputs from the generative model make it difficult to pass the gradient update from the discriminative model to the generative model. Also, the discriminative model can only assess a complete sequence, while for a partially generated sequence, it is non-trivial to balance its current score and the future one once the entire sequence has been generated. In this paper, we propose a sequence generation framework, called SeqGAN, to solve the problems. Modeling the data generator as a stochastic policy in reinforcement learning (RL), SeqGAN bypasses the generator differentiation problem by directly performing gradient policy update. The RL reward signal comes from the GAN discriminator judged on a complete sequence, and is passed back to the intermediate state-action steps using Monte Carlo search. Extensive experiments on synthetic data and real-world tasks demonstrate significant improvements over strong baselines.
StudyModerate
Toward Causal Representation Learning
Bernhard Schölkopf, Francesco Locatello, Stefan Bauer +4 more · Proceedings of the IEEE · 2021 · 998 citations
The two fields of machine learning and graphical causality arose and are developed separately. However, there is, now, cross-pollination and increasing interest in both fields to benefit from the advances of the other. In this article, we review fundamental concepts of causal inference and relate them to crucial open problems of machine learning, including transfer and generalization, thereby assaying how causality can contribute to modern machine learning research. This also applies in the opposite direction: we note that most work in causality starts from the premise that the causal variables are given. A central problem for AI and causality is, thus, causal representation learning, that is, the discovery of high-level causal variables from low-level observations. Finally, we delineate some implications of causality for machine learning and propose key research areas at the intersection of both communities.
StudyModerate
Reinforcement Learning: A Survey
Leslie Pack Kaelbling, Michael L. Littman, Andrew Moore · Journal of Artificial Intelligence Research · 1996 · 8,787 citations
This paper surveys the field of reinforcement learning from a computer-science perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem faced by an agent that learns behavior through trial-and-error interactions with a dynamic environment. The work described here has a resemblance to work in psychology, but differs considerably in the details and in the use of the word ``reinforcement.'' The paper discusses central issues of reinforcement learning, including trading off exploration and exploitation, establishing the foundations of the field via Markov decision theory, learning from delayed reinforcement, constructing empirical models to accelerate learning, making use of generalization and hierarchy, and coping with hidden state. It concludes with a survey of some implemented systems and an assessment of the practical utility of current methods for reinforcement learning.
StudyModerate
Interpretable machine learning: Fundamental principles and 10 grand challenges
Cynthia Rudin, Chaofan Chen, Zhi Chen +3 more · Statistics Surveys · 2022 · 828 citations
Interpretability in machine learning (ML) is crucial for high stakes decisions and troubleshooting. In this work, we provide fundamental principles for interpretable ML, and dispel common misunderstandings that dilute the importance of this crucial topic. We also identify 10 technical challenge areas in interpretable machine learning and provide history and background on each problem. Some of these problems are classically important, and some are recent problems that have arisen in the last few years. These problems are: (1) Optimizing sparse logical models such as decision trees; (2) Optimization of scoring systems; (3) Placing constraints into generalized additive models to encourage sparsity and better interpretability; (4) Modern case-based reasoning, including neural networks and matching for causal inference; (5) Complete supervised disentanglement of neural networks; (6) Complete or even partial unsupervised disentanglement of neural networks; (7) Dimensionality reduction for data visualization; (8) Machine learning models that can incorporate physics and other generative or causal constraints; (9) Characterization of the “Rashomon set” of good models; and (10) Interpretable reinforcement learning. This survey is suitable as a starting point for statisticians and computer scientists interested in working in interpretable machine learning.
StudyModerate
Deep Reinforcement Learning That Matters
Peter Henderson, Riashat Islam, Philip Bachman +3 more · Proceedings of the AAAI Conference on Artificial Intelligence · 2018 · 1,487 citations
In recent years, significant progress has been made in solving challenging problems across various domains using deep reinforcement learning (RL). Reproducing existing work and accurately judging the improvements offered by novel methods is vital to sustaining this progress. Unfortunately, reproducing results for state-of-the-art deep RL methods is seldom straightforward. In particular, non-determinism in standard benchmark environments, combined with variance intrinsic to the methods, can make reported results tough to interpret. Without significance metrics and tighter standardization of experimental reporting, it is difficult to determine whether improvements over the prior state-of-the-art are meaningful. In this paper, we investigate challenges posed by reproducibility, proper experimental techniques, and reporting procedures. We illustrate the variability in reported metrics and results when comparing against common baselines and suggest guidelines to make future results in deep RL more reproducible. We aim to spur discussion about how to ensure continued progress in the field by minimizing wasted effort stemming from results that are non-reproducible and easily misinterpreted.
StudyModerate
Transfer Learning in Deep Reinforcement Learning: A Survey
Zhuangdi Zhu, Kaixiang Lin, Anil K. Jain +1 more · IEEE Transactions on Pattern Analysis and Machine Intelligence · 2023 · 669 citations
Reinforcement learning is a learning paradigm for solving sequential decision-making problems. Recent years have witnessed remarkable progress in reinforcement learning upon the fast development of deep neural networks. Along with the promising prospects of reinforcement learning in numerous domains such as robotics and game-playing, transfer learning has arisen to tackle various challenges faced by reinforcement learning, by transferring knowledge from external expertise to facilitate the efficiency and effectiveness of the learning process. In this survey, we systematically investigate the recent progress of transfer learning approaches in the context of deep reinforcement learning. Specifically, we provide a framework for categorizing the state-of-the-art transfer learning approaches, under which we analyze their goals, methodologies, compatible reinforcement learning backbones, and practical applications. We also draw connections between transfer learning and other relevant topics from the reinforcement learning perspective and explore their potential challenges that await future research progress.
StudyModerate
Safe Learning in Robotics: From Learning-Based Control to Safe Reinforcement Learning
Lukas Brunke, Melissa Greeff, Adam W. Hall +4 more · Annual Review of Control Robotics and Autonomous Systems · 2022 · 648 citations
The last half decade has seen a steep rise in the number of contributions on safe learning methods for real-world robotic deployments from both the control and reinforcement learning communities. This article provides a concise but holistic review of the recent advances made in using machine learning to achieve safe decision-making under uncertainties, with a focus on unifying the language and frameworks used in control theory and reinforcement learning research. It includes learning-based control approaches that safely improve performance by learning the uncertain dynamics, reinforcement learning approaches that encourage safety or robustness, and methods that can formally certify the safety of a learned control policy. As data- and learning-based robot control methods continue to gain traction, researchers must understand when and how to best leverage them in real-world scenarios where safety is imperative, such as when operating in close proximityto humans. We highlight some of the open challenges that will drive the field of robot learning in the coming years, and emphasize the need for realistic physics-based benchmarks to facilitate fair comparisons between control and reinforcement learning approaches.
StudyModerate
An Introduction to Deep Reinforcement Learning
Vincent François-Lavet, Peter Henderson, Riashat Islam +2 more · Foundations and Trends® in Machine Learning · 2018 · 1,241 citations
Deep reinforcement learning is the combination of reinforcement learning (RL) and deep learning. This field of research has been able to solve a wide range of complex decision making tasks that were previously out of reach for a machine. Thus, deep RL opens up many new applications in domains such as healthcare, robotics, smart grids, finance, and many more. This manuscript provides an introduction to deep reinforcement learning models, algorithms and techniques. Particular focus is on the aspects related to generalization and how deep RL can be used for practical applications. We assume the reader is familiar with basic machine learning concepts.
StudyModerate
Model-based Reinforcement Learning: A Survey
Thomas M. Moerland, Joost Broekens, Aske Plaat +1 more · Foundations and Trends® in Machine Learning · 2023 · 481 citations
Sequential decision making, commonly formalized as Markov Decision Process (MDP) optimization, is an important challenge in artificial intelligence. Two key approaches to this problem are reinforcement learning (RL) and planning. This survey is an integration of both fields, better known as model-based reinforcement learning. Model-based RL has two main steps. First, we systematically cover approaches to dynamics model learning, including challenges like dealing with stochasticity, uncertainty, partial observability, and temporal abstraction. Second, we present a systematic categorization of planning-learning integration, including aspects like: where to start planning, what budgets to allocate to planning and real data collection, how to plan, and how to integrate planning in the learning and acting loop. After these two sections, we also discuss implicit model-based RL as an end-to-end alternative for model learning and planning, and we cover the potential benefits of model-based RL. Along the way, the survey also draws connections to several related RL fields, like hierarchical RL and transfer learning. Altogether, the survey presents a broad conceptual overview of the combination of planning and learning for MDP optimization.
StudyModerate
Challenges of real-world reinforcement learning: definitions, benchmarks and analysis
Gabriel Dulac-Arnold, Nir Levine, Daniel J. Mankowitz +4 more · Machine Learning · 2021 · 557 citations
StudyModerate
Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware
Tony Z. Zhao, Vikas Kumar, Sergey Levine +1 more · 2023 · 434 citations
ALOHA : A Low-cost Open-source Hardware System for Bimanual Teleoperation.The whole system costs <$20k with off-the-shelf robots and 3D printed components.Left: The user teleoperates by backdriving the leader robots, with the follower robots mirroring the motion.Right: ALOHA is capable of precise, contact-rich, and dynamic tasks.We show examples of both teleoperated and learned skills.Abstract-Fine manipulation tasks, such as threading cable ties or slotting a battery, are notoriously difficult for robots because they require precision, careful coordination of contact forces, and closed-loop visual feedback.Performing these tasks typically requires high-end robots, accurate sensors, or careful calibration, which can be expensive and difficult to set up.Can learning enable low-cost and imprecise hardware to perform these fine manipulation tasks?We present a low-cost system that performs end-to-end imitation learning directly from real demonstrations, collected with a custom teleoperation interface.Imitation learning, however, presents its own challenges, particularly in high-precision domains: errors in the policy can compound over time, and human demonstrations can be non-stationary.To address these challenges, we develop a simple yet novel algorithm, Action Chunking with Transformers (ACT), which learns a generative model over action sequences.ACT allows the robot to learn 6 difficult tasks in the real world, such as opening a translucent condiment cup and slotting a battery with 80-90% success, with only 10 minutes worth of demonstrations.Project website: tonyzhaozh.github.io/aloha
StudyModerate
Deep Q-learning From Demonstrations
Todd Hester, Matej Vecerík, Olivier Pietquin +11 more · Proceedings of the AAAI Conference on Artificial Intelligence · 2018 · 805 citations
Deep reinforcement learning (RL) has achieved several high profile successes in difficult decision-making problems. However, these algorithms typically require a huge amount of data before they reach reasonable performance. In fact, their performance during learning can be extremely poor. This may be acceptable for a simulator, but it severely limits the applicability of deep RL to many real-world tasks, where the agent must learn in the real environment. In this paper we study a setting where the agent may access data from previous control of the system. We present an algorithm, Deep Q-learning from Demonstrations (DQfD), that leverages small sets of demonstration data to massively accelerate the learning process even from relatively small amounts of demonstration data and is able to automatically assess the necessary ratio of demonstration data while learning thanks to a prioritized replay mechanism. DQfD works by combining temporal difference updates with supervised classification of the demonstrator’s actions. We show that DQfD has better initial performance than Prioritized Dueling Double Deep Q-Networks (PDD DQN) as it starts with better scores on the first million steps on 41 of 42 games and on average it takes PDD DQN 83 million steps to catch up to DQfD’s performance. DQfD learns to out-perform the best demonstration given in 14 of 42 games. In addition, DQfD leverages human demonstrations to achieve state-of-the-art results for 11 games. Finally, we show that DQfD performs better than three related algorithms for incorporating demonstration data into DQN.
StudyModerate
Multi-agent deep reinforcement learning: a survey
Sven Gronauer, Klaus Diepold · Artificial Intelligence Review · 2021 · 804 citations
Abstract The advances in reinforcement learning have recorded sublime success in various domains. Although the multi-agent domain has been overshadowed by its single-agent counterpart during this progress, multi-agent reinforcement learning gains rapid traction, and the latest accomplishments address problems with real-world complexity. This article provides an overview of the current developments in the field of multi-agent deep reinforcement learning. We focus primarily on literature from recent years that combines deep reinforcement learning methods with a multi-agent scenario. To survey the works that constitute the contemporary landscape, the main contents are divided into three parts. First, we analyze the structure of training schemes that are applied to train multiple agents. Second, we consider the emergent patterns of agent behavior in cooperative, competitive and mixed scenarios. Third, we systematically enumerate challenges that exclusively arise in the multi-agent domain and review methods that are leveraged to cope with these challenges. To conclude this survey, we discuss advances, identify trends, and outline possible directions for future work in this research area.
StudyModerate
Multi-Agent Deep Reinforcement Learning for Task Offloading in UAV-Assisted Mobile Edge Computing
Nan Zhao, Zhiyang Ye, Yiyang Pei +2 more · IEEE Transactions on Wireless Communications · 2022 · 394 citations
Mobile edge computing can effectively reduce service latency and improve service quality by offloading computation-intensive tasks to the edges of wireless networks. Due to the characteristic of flexible deployment, wide coverage and reliable wireless communication, unmanned aerial vehicles (UAVs) have been employed as assisted edge clouds (ECs) for large-scale sparely-distributed user equipment. Considering the limited computation and energy capacities of UAVs, a collaborative mobile edge computing system with multiple UAVs and multiple ECs is investigated in this paper. The task offloading issue is addressed to minimize the sum of execution delays and energy consumptions by jointly designing the trajectories, computation task allocation, and communication resource management of UAVs. Moreover, to solve the above non-convex optimization problem, a Markov decision process is formulated for the multi-UAV assisted mobile edge computing system. To obtain the joint strategy of trajectory design, task allocation, and power management, a cooperative multi-agent deep reinforcement learning framework is investigated. Considering the high-dimensional continuous action space, the twin delayed deep deterministic policy gradient algorithm is exploited. The evaluation results demonstrate that our multi-UAV multi-EC task offloading method can achieve better performance compared with the other optimization approaches.
StudyModerate
Multi-Agent Deep Reinforcement Learning-Based Trajectory Planning for Multi-UAV Assisted Mobile Edge Computing
Liang Wang, Kezhi Wang, Cunhua Pan +3 more · IEEE Transactions on Cognitive Communications and Networking · 2020 · 461 citations
An unmanned aerial vehicle (UAV)-aided mobile edge computing (MEC) framework is proposed, where several UAVs having different trajectories fly over the target area and support the user equipments (UEs) on the ground. We aim to jointly optimize the geographical fairness among all the UEs, the fairness of each UAV' UE-load and the overall energy consumption of UEs. The above optimization problem includes both integer and continues variables and it is challenging to solve. To address the above problem, a multi-agent deep reinforcement learning based trajectory control algorithm is proposed for managing the trajectory of each UAV independently, where the popular Multi-Agent Deep Deterministic Policy Gradient (MADDPG) method is applied. Given the UAVs' trajectories, a low-complexity approach is introduced for optimizing the offloading decisions of UEs. We show that our proposed solution has considerable performance over other traditional algorithms, both in terms of the fairness for serving UEs, fairness of UE-load at each UAV and energy consumption for all the UEs.
StudyModerate
A practical guide to multi-objective reinforcement learning and planning
Conor F. Hayes, Roxana Rădulescu, Eugenio Bargiacchi +15 more · VUBIR (Vrije Universiteit Brussel) · 2022 · 320 citations
Real-world sequential decision-making tasks are generally complex, requiring trade-offs between multiple, often conflicting, objectives. Despite this, the majority of research in reinforcement learning and decision-theoretic planning either assumes only a single objective, or that multiple objectives can be adequately handled via a simple linear combination. Such approaches may oversimplify the underlying problem and hence produce suboptimal results. This paper serves as a guide to the application of multi-objective methods to difficult problems, and is aimed at researchers who are already familiar with single-objective reinforcement learning and planning methods who wish to adopt a multi-objective perspective on their research, as well as practitioners who encounter multi-objective decision problems in practice. It identifies the factors that may influence the nature of the desired solution, and illustrates by example how these influence the design of multi-objective decision-making systems for complex problems.
StudyModerate
On-Line Building Energy Optimization Using Deep Reinforcement Learning
Elena Mocanu, Decebal Constantin Mocanu, Phuong H. Nguyen +4 more · IEEE Transactions on Smart Grid · 2018 · 617 citations
Unprecedented high volumes of data are becoming available with the growth of the advanced metering infrastructure. These are expected to benefit planning and operation of the future power systems and to help customers transition from a passive to an active role. In this paper, we explore for the first time in the smart grid context the benefits of using deep reinforcement learning, a hybrid type of methods that combines reinforcement learning with deep learning, to perform on-line optimization of schedules for building energy management systems. The learning procedure was explored using two methods, Deep Q-learning and deep policy gradient, both of which have been extended to perform multiple actions simultaneously. The proposed approach was validated on the large-scale Pecan Street Inc. database. This highly dimensional database includes information about photovoltaic power generation, electric vehicles and buildings appliances. Moreover, these on-line energy scheduling strategies could be used to provide realtime feedback to consumers to encourage more efficient use of electricity.
StudyModerate
A Survey on Offline Reinforcement Learning: Taxonomy, Review, and Open Problems
Rafael Figueiredo Prudencio, Marcos R. O. A. Máximo, Esther Luna Colombini · IEEE Transactions on Neural Networks and Learning Systems · 2023 · 284 citations
With the widespread adoption of deep learning, reinforcement learning (RL) has experienced a dramatic increase in popularity, scaling to previously intractable problems, such as playing complex games from pixel observations, sustaining conversations with humans, and controlling robotic agents. However, there is still a wide range of domains inaccessible to RL due to the high cost and danger of interacting with the environment. Offline RL is a paradigm that learns exclusively from static datasets of previously collected interactions, making it feasible to extract policies from large and diverse training datasets. Effective offline RL algorithms have a much wider range of applications than online RL, being particularly appealing for real-world applications, such as education, healthcare, and robotics. In this work, we contribute with a unifying taxonomy to classify offline RL methods. Furthermore, we provide a comprehensive review of the latest algorithmic breakthroughs in the field using a unified notation as well as a review of existing benchmarks' properties and shortcomings. Additionally, we provide a figure that summarizes the performance of each method and class of methods on different dataset properties, equipping researchers with the tools to decide which type of algorithm is best suited for the problem at hand and identify which classes of algorithms look the most promising. Finally, we provide our perspective on open problems and propose future research directions for this rapidly growing field.
StudyModerate
Rule-interposing deep reinforcement learning based energy management strategy for power-split hybrid electric vehicle
Renzong Lian, Jiankun Peng, Yuankai Wu +2 more · Energy · 2020 · 342 citations
StudyModerate
Applications of reinforcement learning in energy systems
A.T.D. Perera, Parameswaran Kamalaruban · Renewable and Sustainable Energy Reviews · 2020 · 384 citations
Energy systems undergo major transitions to facilitate the large-scale penetration of renewable energy technologies and improve efficiencies, leading to the integration of many sectors into the energy system domain. As the complexities in this domain increase, it becomes challenging to control energy flows using existing techniques based on physical models. Moreover, although data-driven models, such as reinforcement learning (RL), have gained considerable attention in many fields, a direct shift into RL is not feasible in the energy domain irrespective of the ongoing complexities. To this end, a top-down approach is used to understand this behavior by reviewing the current state of the art. We classified RL papers in the literature into seven categories based on their area of application. Subsequently, publications under each category were further examined relative to problem diversity, RL technique employed, performance improvement (compared with other white and gray box models), verification, and reproducibility; many of the articles reported a 10–20% performance improvement with the use of RL. In most studies, however, deep learning techniques and state-of-the-art actor-critic methods (e.g., twin delayed deep deterministic policy gradient and soft actor-critic) were not applied. This has remarkably hindered performance improvements and problems related to complex energy flows have not been considered. Approximately half of the publications reported the use of Q-learning. Furthermore, despite the availability of historical data in the energy system domain, batch RL algorithms have not been exploited. Emerging multi-agent RL applications may be considered as a positive development that can enable the management of complex interactions among multiple parties. Most studies lack proper benchmarking compared to model-based approaches or gray-box models, and a majority cover energy dispatch problems and building energy management. Although RL can adequately solve problems that are considerably integrated in several sectors, only a limited number of publications have discussed its broad application. The present study clearly demonstrates that even without the full utilization of RL capacity, this technique has a considerable potential in resolving the continuously increasing complexity within the energy system domain.
StudyModerate
A DRL Agent for Jointly Optimizing Computation Offloading and Resource Allocation in MEC
Juan Chen, Huanlai Xing, Zhiwen Xiao +2 more · IEEE Internet of Things Journal · 2021 · 266 citations
This article studies the joint optimization problem of computation offloading and resource allocation (JCORA) in mobile-edge computing (MEC). Deep reinforcement learning (DRL) is one of the ideal techniques for addressing the dynamic JCORA problem. However, it is still challenging to adapt traditional DRL methods for the problem since they usually lead to slow and unstable convergence in model training. To this end, we propose a temporal attentional deterministic policy gradient (TADPG) to tackle JCORA. Based on the deep deterministic policy gradient (DDPG), TADPG has two significant features. First, a temporal feature extraction network consisting of a 1-D convolution (Conv1D) residual block and an attentional long short-term memory (LSTM) network is designed, which is beneficial to high-quality state representation and function approximation. Second, a rank-based prioritized experience replay (rPER) method is devised to accelerate and stabilize the convergence of model training. Experimental results demonstrate that the decentralized TADPG-based mechanism can achieve more efficient JCORA performance than the centralized one, and the proposed TADPG outperforms a number of state-of-the-art DRL agents in terms of the task completion time and energy consumption.
StudyModerate
Joint Optimization of Multi-UAV Target Assignment and Path Planning Based on Multi-Agent Reinforcement Learning
Han Qie, Dianxi Shi, Tianlong Shen +3 more · IEEE Access · 2019 · 308 citations
One of the major research topics in unmanned aerial vehicle (UAV) collaborative control systems is the problem of multi-UAV target assignment and path planning (MUTAPP). It is a complicated optimization problem in which target assignment and path planning are solved separately. However, recalculation of the optimal results is too slow for real-time operations in dynamic environments because of the large number of calculations required. In this paper, we propose an artificial intelligence method named simultaneous target assignment and path planning (STAPP) based on a multi-agent deep deterministic policy gradient (MADDPG) algorithm, which is a type of multi-agent reinforcement learning algorithm. In STAPP, the MUTAPP problem is first constructed as a multi-agent system. Then, the MADDPG framework is used to train the system to solve target assignment and path planning simultaneously according to a corresponding reward structure. The proposed system can deal with dynamic environments effectively as its execution only requires the locations of the UAVs, targets, and threat areas. Real-time performance can be guaranteed as the neural network used in the system is simple. In addition, we develop a technique to improve the training effect and use experiments to demonstrate the effectiveness of our method.
StudyModerate
Reinforcement learning for control: Performance, stability, and deep approximators
Lucian Buşoniu, Tim de Bruin, Domagoj Tolić +2 more · Annual Reviews in Control · 2018 · 447 citations
StudyModerate
A Comprehensive Survey of Multiagent Reinforcement Learning
Lucian Buşoniu, Robert Babuška, Bart De Schutter · IEEE Transactions on Systems Man and Cybernetics Part C (Applications and Reviews) · 2008 · 2,169 citations
<para xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> Multiagent systems are rapidly finding applications in a variety of domains, including robotics, distributed control, telecommunications, and economics. The complexity of many tasks arising in these domains makes them difficult to solve with preprogrammed agent behaviors. The agents must, instead, discover a solution on their own, using learning. A significant part of the research on multiagent learning concerns reinforcement learning techniques. This paper provides a comprehensive survey of multiagent reinforcement learning (MARL). A central issue in the field is the formal statement of the multiagent learning goal. Different viewpoints on this issue have led to the proposal of many different goals, among which two focal points can be distinguished: stability of the agents' learning dynamics, and adaptation to the changing behavior of the other agents. The MARL algorithms described in the literature aim---either explicitly or implicitly---at one of these two goals or at a combination of both, in a fully cooperative, fully competitive, or more general setting. A representative selection of these algorithms is discussed in detail in this paper, together with the specific issues that arise in each category. Additionally, the benefits and challenges of MARL are described along with some of the problem domains where the MARL techniques have been applied. Finally, an outlook for the field is provided. </para>
StudyModerate
Toward an Integration of Deep Learning and Neuroscience
Adam Marblestone, Greg Wayne, Konrad P. Körding · arXiv (Cornell University) · 2016 · 688 citations
Neuroscience has focused on the detailed implementation of computation, studying neural codes, dynamics and circuits. In machine learning, however, artificial neural networks tend to eschew precisely designed codes, dynamics or circuits in favor of brute force optimization of a cost function, often using simple and relatively uniform initial architectures. Two recent developments have emerged within machine learning that create an opportunity to connect these seemingly divergent perspectives. First, structured architectures are used, including dedicated systems for attention, recursion and various forms of short- and long-term memory storage. Second, cost functions and training procedures have become more complex and are varied across layers and over time. Here we think about the brain in terms of these ideas. We hypothesize that (1) the brain optimizes cost functions, (2) the cost functions are diverse and differ across brain locations and over development, and (3) optimization operates within a pre-structured architecture matched to the computational problems posed by behavior. In support of these hypotheses, we argue that a range of implementations of credit assignment through multiple layers of neurons are compatible with our current knowledge of neural circuitry, and that the brain's specialized systems can be interpreted as enabling efficient optimization for specific problem classes. Such a heterogeneously optimized system, enabled by a series of interacting cost functions, serves to make learning data-efficient and precisely targeted to the needs of the organism. We suggest directions by which neuroscience could seek to refine and test these hypotheses.
StudyModerate
Experience-driven Networking: A Deep Reinforcement Learning based Approach
Zhiyuan Xu, Jian Tang, Jingsong Meng +4 more · 2018 · 415 citations
Modern communication networks have become very complicated and highly dynamic, which makes them hard to model, predict and control. In this paper, we develop a novel experience-driven approach that can learn to well control a communication network from its own experience rather than an accurate mathematical model, just as a human learns a new skill (such as driving, swimming, etc). Specifically, we, for the first time, propose to leverage emerging Deep Reinforcement Learning (DRL) for enabling model-free control in communication networks; and present a novel and highly effective DRL-based control framework, DRL-TE, for a fundamental networking problem: Traffic Engineering (TE). The proposed framework maximizes a widely-used utility function by jointly learning network environment and its dynamics, and making decisions under the guidance of powerful Deep Neural Networks (DNNs). We propose two new techniques, TE-aware exploration and actor-critic-based prioritized experience replay, to optimize the general DRL framework particularly for TE. To validate and evaluate the proposed framework, we implemented it in ns-3, and tested it comprehensively with both representative and randomly generated network topologies. Extensive packet-level simulation results show that 1) compared to several widely-used baseline methods, DRL-TE significantly reduces end-to-end delay and consistently improves the network utility, while offering better or comparable throughput; 2) DRL-TE is robust to network changes; and 3) DRL-TE consistently outperforms a state-of-the-art DRL method (for continuous control), Deep Deterministic Policy Gradient (DDPG), which, however, does not offer satisfying performance.
StudyModerate
Deep Reinforcement Learning for Strategic Bidding in Electricity Markets
Yujian Ye, Dawei Qiu, Mingyang Sun +2 more · IEEE Transactions on Smart Grid · 2019 · 278 citations
Bi-level optimization and reinforcement learning (RL) constitute the state-of-the-art frameworks for modeling strategic bidding decisions in deregulated electricity markets. However, the former neglects the market participants' physical non-convex operating characteristics, while conventional RL methods require discretization of state and/or action spaces and thus suffer from the curse of dimensionality. This paper proposes a novel deep reinforcement learning (DRL) based methodology, combining a deep deterministic policy gradient (DDPG) method with a prioritized experience replay (PER) strategy. This approach sets up the problem in multi-dimensional continuous state and action spaces, enabling market participants to receive accurate feedback regarding the impact of their bidding decisions on the market clearing outcome, and devise more profitable bidding decisions by exploiting the entire action domain, also accounting for the effect of non-convex operating characteristics. Case studies demonstrate that the proposed methodology achieves a significantly higher profit than the alternative state-of-the-art methods, and exhibits a more favourable computational performance than benchmark RL methods due to the employment of the PER strategy.
RCTHigh evidence score
Personalized decision making for coronary artery disease treatment using offline reinforcement learning
Peyman Ghasemi, Matthew Greenberg, Danielle A. Southern +3 more · npj Digital Medicine · 2025 · 9 citations
Choosing optimal revascularization strategies for patients with obstructive coronary artery disease (CAD) remains a clinical challenge. While randomized controlled trials offer population-level insights, gaps remain regarding personalized decision-making for individual patients. We applied off-policy reinforcement learning (RL) to a composite data model from 41,328 unique patients with angiography-confirmed obstructive CAD. In an offline setting, we estimated optimal treatment policies and evaluated these policies using weighted importance sampling. Our findings indicate that RL-guided therapy decisions outperformed physician-based decision making, with RL policies achieving up to 32% improvement in expected rewards based on composite major cardiovascular events outcomes. Additionally, we introduced methods to ensure that RL CAD treatment policies remain compatible with locally achievable clinical practice models, presenting an interpretable RL policy with a limited number of states. Overall, this novel RL-based clinical decision support tool, RL4CAD, demonstrates potential to optimize care in patients with obstructive CAD referred for invasive coronary angiography.
StudyModerate
Human-in-the-Loop Reinforcement Learning: A Survey and Position on Requirements, Challenges, and Opportunities
Carl Orge Retzlaff, Srijita Das, Christabel Wayllace +7 more · Journal of Artificial Intelligence Research · 2024 · 129 citations
Artificial intelligence (AI) and especially reinforcement learning (RL) have the potential to enable agents to learn and perform tasks autonomously with superhuman performance. However, we consider RL as fundamentally a Human-in-the-Loop (HITL) paradigm, even when an agent eventually performs its task autonomously. In cases where the reward function is challenging or impossible to define, HITL approaches are considered particularly advantageous. The application of Reinforcement Learning from Human Feedback (RLHF) in systems such as ChatGPT demonstrates the effectiveness of optimizing for user experience and integrating their feedback into the training loop. In HITL RL, human input is integrated during the agent’s learning process, allowing iterative updates and fine-tuning based on human feedback, thus enhancing the agent’s performance. Since the human is an essential part of this process, we argue that human-centric approaches are the key to successful RL, a fact that has not been adequately considered in the existing literature. This paper aims to inform readers about current explainability methods in HITL RL. It also shows how the application of explainable AI (xAI) and specific improvements to existing explainability approaches can enable a better human-agent interaction in HITL RL for all types of users, whether for lay people, domain experts, or machine learning specialists. Accounting for the workflow in HITL RL and based on software and machine learning methodologies, this article identifies four phases for human involvement for creating HITL RL systems: (1) Agent Development, (2) Agent Learning, (3) Agent Evaluation, and (4) Agent Deployment. We highlight human involvement, explanation requirements, new challenges, and goals for each phase. We furthermore identify low-risk, high-return opportunities for explainability research in HITL RL and present long-term research goals to advance the field. Finally, we propose a vision of human-robot collaboration that allows both parties to reach their full potential and cooperate effectively.
StudyModerate
Computational mechanisms of curiosity and goal-directed exploration
Philipp Schwartenbeck, Johannes Passecker, Tobias U. Hauser +3 more · eLife · 2019 · 260 citations
Successful behaviour depends on the right balance between maximising reward and soliciting information about the world. Here, we show how different types of information-gain emerge when casting behaviour as surprise minimisation. We present two distinct mechanisms for goal-directed exploration that express separable profiles of active sampling to reduce uncertainty. 'Hidden state' exploration motivates agents to sample unambiguous observations to accurately infer the (hidden) state of the world. Conversely, 'model parameter' exploration, compels agents to sample outcomes associated with high uncertainty, if they are informative for their representation of the task structure. We illustrate the emergence of these types of information-gain, termed active inference and active learning, and show how these forms of exploration induce distinct patterns of 'Bayes-optimal' behaviour. Our findings provide a computational framework for understanding how distinct levels of uncertainty systematically affect the exploration-exploitation trade-off in decision-making.
StudyModerate
Recent advances in reinforcement learning in finance
Ben Hambly, Renyuan Xu, Huining Yang · Mathematical Finance · 2023 · 173 citations
Abstract The rapid changes in the finance industry due to the increasing amount of data have revolutionized the techniques on data processing and data analysis and brought new theoretical and computational challenges. In contrast to classical stochastic control theory and other analytical approaches for solving financial decision‐making problems that heavily reply on model assumptions, new developments from reinforcement learning (RL) are able to make full use of the large amount of financial data with fewer model assumptions and to improve decisions in complex financial environments. This survey paper aims to review the recent developments and use of RL approaches in finance. We give an introduction to Markov decision processes, which is the setting for many of the commonly used RL approaches. Various algorithms are then introduced with a focus on value‐ and policy‐based methods that do not require any model assumptions. Connections are made with neural networks to extend the framework to encompass deep RL algorithms. We then discuss in detail the application of these RL algorithms in a variety of decision‐making problems in finance, including optimal execution, portfolio optimization, option pricing and hedging, market making, smart order routing, and robo‐advising. Our survey concludes by pointing out a few possible future directions for research.