1Part-1
737.10 Summary
2Tabular Solution Methods
747.11 Questions
3Part-2
758.1 Episodic semi-gradient control
4Approximate Solution Methods
768.2 Semi-gradient n-step SARSA
5Chapter 1. Introduction
778.3 Average reward
61.1 Reinforcement Learning
788.4 Deprecating the discounted setting
71.2 Examples
798.5 Differential semi-gradient n-step SARSA
81.3 Elements of RL
808.6 Summary
91.4 Applications of RL
818.7 Questions
101.5 Summary
82Chapter 9. Off-Policy Methods With Approximation
111.6 Questions
839.1 Semi-gradient methods
12Chapter 2. Multi-arm Bandits
849.2 Examples of off-policy Divergence
132.1 An n-armed bandit problem
859.3 The Deadly Triad
142.2 Action-value methods
869.4 Linear value function geometry
152.3 Incremental implementation
879.5 Gradient-descent in the Bellman error
162.4 Tracking a nonstationary problem
889.6 Bellman error is not learnable
172.5 Optimistic Initial Values
899.7 Gradient TD methods
182.6 Upper-Confidence-Bound-Action Selection
909.8 Emphatic TD methods
192.7 Gradient Bandit Algorithms
919.9 Reducing Variance
202.8 Associative Search (Contextual Bandits)
929.10 Summary
212.9 Summary
939.11 Questions
222.10 Questions
94Chapter 10. Eligibility Traces
23Chapter 3. Solving Problems with Dynamic Programming
9510.1 The λ-return
243.1 MDP
9610.2 TD(λ)
253.2 Categorizing RL algorithms
9710.3 n-step Truncated λ-return methods
263.3 Dynamic Programming
9810.4 Redoing Updates: Online l-return algorithm
273.4 Summary
9910.5 True Online TD(λ)
283.5 Questions
10010.6 Dutch Traces in MC learning
29Chapter 4. Monte Carlo Methods
10110.7 SARSA(λ)
304.1 Monte Carlo prediction
10210.8 Variable λ and γ
314.2 Monte Carlo estimation of action values
10310.10 Off-policy Traces with Control Variates
324.3 Monte Carlo Control
10410.10 Watkin’s Q(λ) to Tree-Backup (λ)
334.4 Monte Carlo Control without Exploring
10510.11 Stable off-policy methods with traces
34Starts
10610.12 Implementation Issues
354.5 Off policy Prediction via Importance Sampling
10710.13 Summary
364.6 Incremental Implementation
10810.14 Questions
374.7 Off-policy MC Control
109Chapter 11. Policy Gradient Methods
384.8 Discounting-aware Importance sampling
11011.1 Policy Approximation and its advantages
394.9 Per-decision Importance Sampling
11111.2 Policy Gradient Theorem
404.10 Summary
11211.3 REINFORCE: MC Policy Gradient
414.11 Questions
11311.4 REINFORCE with Baseline
42Chapter 5. Temporal-Difference Learning
11411.5 Actor-Critic methods
435.1 TD Prediction
11511.6 Policy Gradient for continuing problems
445.2 Advantages of TD Prediction methods
11611.7 Policy Parameterization for Continuous
455.3 Optimality of TD(0)
117Actions
465.4 SARSA: On-policy TD Control
11811.8 Summary
475.5 Q-learning: off-policy TD Control
11911.9 Questions
485.6 Expected SARSA
120Chapter 12. Planning and Learning with Tabular Methods
495.7 Maximization Bias and Double Learning
12112.1 Model and Planning
505.7 Summary
12212.2 Dyna: Integrated Planning, Acting, and
515.8 Questions
123Learning
52Chapter 6. n-step Bootstrapping
12412.3 When the Model Is Wrong
536.1 n-step TD prediction
12512.4 Prioritized Sweeping
546.2 n-step Sarsa
12612.5 Expected Vs. Sample Updates
556.3 n-step off-policy learning
12712.6 Trajectory Sampling
566.4 Per-decision methods
12812.7 Real-time Dynamic Programming
576.5 n-step Tree Backup algorithm
12912.8 Planning at Decision Time
586.6 Unifying algorithm
13012.9 Heuristic Search
596.7 Summary
13112.10 Rollout Algorithms
606.8 Questions
13212.11 Monte Carlo Tree Search (MCTS)
61Chapter 7. On-policy Prediction with Approximation
13312.12 Summary
627.1 Value Approximation & Function Approximation
13412.13 Questions
637.2 Prediction Objective(VE)
135Appendix A - Summary of Notation
647.3 Stochastic-gradient and semi-gradient
136Glossary
65methods
137Part 1
667.4 Linear methods
138Part 1
677.5 Selecting step-size parameters manually
139Tabular Solution Methods
687.6 Approximation of non-linear function:
140Part 2
69Artificial Neural Networks
141Part 2
707.7 Least-Squares TD
142Approximate Solution Methods
717.8 Memory-based Function Approximation
143On-policy control with Approximation
727.9 Kernel-based Function Approximation