We study the K-armed dueling bandit problem which is a variation of the classical Multi-Armed Bandit (MAB) problem in which the learner receives only relative feedback about the selected pairs of arms. We propose an efficient algorithm called Relative Exponential-weight algorithm for Exploration and Exploitation (REX3) to handle the adversarial utility-based formulation of this problem. We prove a finite time expected regret upper bound of order O({the square root of}(K ln(K)T)) for this algorithm and a general lower bound of order Ω({the square root of}KT). At the end, we provide experimental results using real data from information retrieval applications.
展开▼