You might want to look at botlzman/softmax if you want to weight the prob of sel...

You might want to look at botlzman/softmax if you want to weight the prob of selection as a function of the current estimated value. One tricky bit is figuring out a good setting for the temperature parameter. Another poster alluded to softmax. In my experience it dosn't really perform better than a simple e-greedy approach, but maybe it has worked well for others?