Open Journal Systems

Reinforcement Learning - A Technical Introduction

Elmar Diederichs

Abstract

Reinforcement learning provides a cognitive science perspective to behavior and sequential decision making provided that RL-algorithms introduce a computational concept of agency to the learning problem. Hence it addresses an abstract class of problems that can be characterized as follows: An algorithm confronted with information from an unknown environment is supposed to find stepwise an optimal way to behave based only on some sparse, delayed or noisy feedback from some environment, that changes according to the algorithm's behavior. Hence reinforcement learning offers an abstraction to the problem of goal-directed learning from interaction. The paper offers an opintionated introduction in the algorithmic advantages and drawbacks of several algorithmic approaches such that one can understand recent developments and open problems in reinforcement learning.

Keywords

classical reinforcement learning, Markov decision processes, prediction and adaptive control in unknown environments

Full Text:

PDF

References

1. C. Anderson A. Barto, R. Sutton. Neuronlike adaptive elements that can solve di?cult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, pages 834 846, 1983.

2. W. Powell A. George. Adaptive stepsizes for recursive estimation with applications in approximate dynamic programming. Machine Learning, 65:167 198, 2006.

3. R. E. Bellman. Dynamic Programming. 1957.

4. D. Bertsekas. Dynamic programming and optimal control, Vol II. 2012.

5. D. P. Bertsekas. A new value iteration method for the average cost dynamic programming problem. SIAM Journal on Control and Optimization, 36:742 759, 1998.

6. J. N. Tsitsiklis C. H. Paradimitriou. The complexity of markov decision processes. Mathematics of Operations Research, 12:441450, 1987.

7. W. T. Scherer C. W. Zobel. An empirical study of policy convergence in markov decision process value iteration. Computers and Operations Research, 32:127142, 2005.

8. X.R. Cao. A basic formula for online policy gradient algorithms. IEEE Transactions on Automatic Control, 50:696699, 2005.

9. C. Coello Coello. Evolutionary Algorithms for Solving Multi-Objective Problems. 2007.

10. B. V. Roy D. de Farias. On the existence of ?xed points for appproximate value iteration and temporal-di?erence learning. Journal of Optimization Theory and Applications, 105:589608, 2000.

11. B. V. Roy D. de Farias. The linear programming approach to approximate dynamic programming. Operations Research, 51:850856, 2003.

12. J. Grefenstette D. Moriarty, A. Schultz. Evolutionsary algorithms for reinforcement learning. Journal of Arti?cial Intelligence Research, pages 241 276, 1999.

13. C. Maddison et al. D. Silver, A. Huang. Mastering the game of go with deep neural networks and tree search. Nature, 529:484489, 2016.

14. M. Diederik et.al. D. Steckelmacher, H. Plisnier. Sample-e?cient modelfree reinforcement learning with o?-policy critics. European Conference on Machine Learning, 2019.

15. F. L. Lewis D. Vrabie, K. G. Vamvoudakis. Optimal Adaptive Control and Di?erential Games by Reinforcement Learning Principles. 2013.

16. F. Dayan. The convergence of td(?) for general ?. Machine Learning, 8:341 362, 1992.

17. C. Derman. Finite State Markovian Decision Processes. 1970.

18. J.N. Tsitsiklis D.P. Bertsekas. Neuro-Dynamic Programming. 2006.

19. M. Drugan. Reinforcement learning versus evolutionary computation: A survey on hybrid algorithms. Swarm and Evolutionary Computation, pages 228246, 2019.

20. S. Dreyfus E. Mizutani. Totally model-free actor-critic recurrent neuralnetwork reinforcement learning in non-markovian domains. Annals of Operations Research, page 107?131, 2017.

21. A. Shwartz E.A. Feinberg. Handbook of Markov Decision Processes: Methods and Applications, chapter Total Reward Criteria. Kluwer, 2007.

22. A. J. Kleinman H. J. Kushner. Accelerated procedures for the solution of discrete markov control problems. IEEE Transactions on Automatic Control, 16:147152, 1971.

23. M. Herzberg and U. Yechiali. A k-step look-ahead analysis of value iteration algorithms for markov decision processes. Europian Jornal of Operations Research, 88:622636, 1996.

24. R.A. Howard. Dynamic Programming and Markov Processes. 1960.

25. R. Sutton at. al. H.v. Seijen. True online temporal-di?erence learning. Journal of Machine Learning Research, pages 140, 2016.

26. P.L. Bartlett J. Baxter. In?nite-horizon policy-gradient estimation. Journal of Arti?cial Intelligence Research, pages 319350, 2001.

27. J. Peters J. Kober. Policy search for motor primitives in robotics. Advances in neural information processing systems, pages 849856, 2009.

28. S. Schaal J. Peters. Natural actor-critic. Neurocomputing, 71:11801190, 2008.

29. P. Abbeel et. al. J. Schulman, S. Levine. Trust region policy optimization. ICML, pages 1889 1897, 2015.

30. P. Dhariwal et. al. J. Schulman, F. Wolski. Proximal policy optimization algorithms. page arxiv prepreint: https://arxiv.org/abs/1707.06347, 07 2017.

31. Pieter Abbeel et. al. J. Schulman, P. Moritz. High-dimensional continuous control using generalized advantage estimation. ICML, 2016.

32. R. Miikkulainen K. Stanley. Evolving a roving eye for go. 2004.

33. V.R. Konda and J.N. Tsitsiklis. On actor-critic algorithms. SIAM Journal on Control and Optimization, pages 11431166, 2003.

34. V. Krishnamurthy. Partially Observed Markov Decision Processes. 2016.

35. R. Babuka et. al. L. Busoniu. Reinforcement learning and dynamic programming using function approximators. 2009.

36. R. Harley et. al L. C. Thomas. Computational comparison of value iteration algorithms for discounted markov decision processes. Operations Research Letters, 2:7276, 1983.

37. A. Moore L. Kaelbling, M. Littman. Reinforcement learning: A survey. Journal of Arti?cial Intelligence Research, 4:237285, 1996.

38. Yuxi Li. Deep reinforcement learning: An overview. CoRR, abs/1701.07274, 2017.

39. Sean Luke. Essentials of Metaheuristics. Lulu, second edition, 2013.

40. S. E. Zin M. A. Trick. Spline approximations to value functions. Macroeconomic Dynamics, 1:255277, 1997.

41. J. Peters M. Deisenroth, G. Neumann. A survey on policy search for robotics. Robotics: Foundations and Trends, 2:1142, 2013.

42. S. Mannor et. al. M. Ghavamzadeh. Bayesian reinforcement learning: A survey. Foundations and Trends in Machine Learning, 8:359492, 2015.

43. M. Mundhenk M. L. Littman, J. Goldsmith. The computational complexity of probabilistic planning. Journal of Arti?cial Intelligence Research, 9:136, 1998.

44. M. C. Shin M. L. Puterman. Modi?ed policy iteration algorithms for discounted markov decision problems. Management Science, 64:11271137, 78.

45. M. van Otterlo (eds.) M. Wiering. Reinforcement Learning. 2012.

46. G. van Ryzin N. Gans. Dynamic vehicle dispatching: Optimal heavy tra?c performance and practical insights. Operations Research, 47:675693, 1999.

47. O. Bu?et (eds.) O. Sigaud. Markov Decision Processes in Arti?cial Intelligence. 2010.

48. M. Potter. The Design and Analysis of a Computational Model of Cooperative Coevolution. George Mason University, 1997.

49. M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. 1994.

50. W. Yue Q. Hu. Markov Decision Processes with Their Applications. 2008.

51. S. Meyn R.-R. Chen. Value iteration and optimization of multiclass queueing networks. Queueing Systems, 32:6597, 1999.

52. D. McAllester et. al. R.S. Sutton. Policy gradient methods for reinforcement learning with function approximation. Advances in Neural Information Processing Systems, pages 10571063, 2000.

53. Reazul Hasan Russel. A short survey on probabilistic reinforcement learning. arxiv, page 1901.07010v1, 2019.

54. D. Meger S. Fujimoto, H. van Hoof. Addressing function approxima approximation error in actor-critic methods. in proceedings of the international conference on machine learning. Conference paper ICML, 2018.

55. C. C. White III S. Kim, M. E. Lewis. Optimal vehicle routing with realtime tra?c information. IEEE Transactions on Intelligent Transportation Systems, 6:178188, 2005.

56. C. Finn et. al S. Levine. End-to-end training of deep visuomotor policies. Journal of Machine Learning Research, 17:140, 2016.

57. J. Peters et. al. S. Parisi, V. Tangkaratt. Td-regularized actor-critic methods. Machine Learning, pages 135, 2019.

58. R. Sutton S. Singh. Reinforcement learning with replacing eligibility traces. Machine Learning, 22:123158, 1996.

59. Hoang Pham S. V. Amari, L. McLaughlin. Cost-e?ective condition-based maintenance using markov decision processes. Proceedings of the RAMS. Annual Reliability and Maintainability Symposium, pages 464469, 2006.

60. J. Schmidhuber. A general method for multi-agent learning and incremental sel?mprovement in unrestricted environments., chapter Yao, X. (ed.), Evolutionary Computation: Theory and Applications. 1996.

61. M. Sugiyama. Statistical Reinforcement Learning. 2015.

62. R. Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. Proceedings of the Seventh International Conference on Machine Learning, 1990.

63. R. Sutton and A. Barto. Reinforcement Learning: An Introduction. 2018.

64. C. Szepesvari. Algorithms for Reinforcement Learning. 2010.

65. P. Abbeel et. al. T. Haarnoja, A. Zhou. Soft actor-critic: O?-policy maximum entropy deep reinforcement learning with a stochastic actor. Proceedings of the International Conference on Machine Learning (ICML), 2018.

66. A. Pritzel et al. T. Lillicrap, J. Hunt. Continuous control with deep reinforcement learning. International Conference on Learning Representations, 2016.

67. K. Kavukcuoglu et. al. V. Mnih. Playing atari with deep reinforcement learning. Technical report, Google DeepMind, page http://arxiv.org/abs/1312.5602, 2013.

68. M.Mirza et. al. V. Mnih, A.Badia. Asynchronous methods for deep reinforcement learning. Proceedings of The 33rd International Conference on Machine Learning, pages 19281937, 2016.

69. Peter Henderson et. al. Vincent Francois-Lavet. An introduction to deep reinforcement learning. Foundations and Trends in Machine Learning, page DOI: 10.1561/2200000071, 2018.

70. D. J. White. Markov Decision Processes. 1994.

71. S. Whiteson. Evolutionary Computation for Reinforcement Learning, pages 325355. Springer, 2012.

72. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229 256, 1992.

73. D. Wingate and K. D. Seppi. Prioritization methods for accelerating mdp solvers. Journal of Machine Learning Research, 6:851881, 2005.

74. S. Kuriyama Y. Adachia, T. Nosea. Optimal inventory control policy subject to di?erent selling prices of perishable commodities. International Journal of Production Economics, 60-61:389394, 1999.

75. Y. Ye. A new complexity result on solving the markov decision problem. Mathematics of Operations Research, 2005:733749, 38.

76. H. Yu and D. P. Bertsekas. Basis function adaptation methods for cost approximation. Proc. of IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning, 2009.

DOI: https://doi.org/10.32629/jai.v2i2.45

Refbacks

There are currently no refbacks.

Arabic	Hebrew	Polish
Bulgarian	Hindi	Portuguese
Catalan	Hmong Daw	Romanian
Chinese Simplified	Hungarian	Russian
Chinese Traditional	Indonesian	Slovak
Czech	Italian	Slovenian
Danish	Japanese	Spanish
Dutch	Klingon	Swedish
English	Korean	Thai
Estonian	Latvian	Turkish
Finnish	Lithuanian	Ukrainian
French	Malay	Urdu
German	Maltese	Vietnamese
Greek	Norwegian	Welsh
Haitian Creole	Persian

Username
Password
Remember me