Study of Sutton and Barto Reinforcement Learning: An Introduction.
Performance of bandit algorithms:
Gradient bandits and greedy bandits with optimistic expectations generally perform the best. Gradient bandits take longer to converge but have the potential to reach a higher performance.