Imagine I have a rich feature vector $x$ associated with each user, and I have to choose between a set of prices $A$ to offer a unit quantity of a single good to the user. Imagine prices are sticky (users cannot get a new identity to get a new price) and non-transferrable (users cannot give someone else a price offered to them). When I offer a price to a user, I see their response to that price, but not to other prices that I could have offered them. I experience a reward which is either a known function of the price offered (e.g., price minus cost) if they choose to buy or zero if they do not. (How long do I wait to decide they will never buy? Good question.) That setup looks like classic learning through exploration, which could be attacked using the offset tree to warm start along with an online exploration strategy. Problem solved!

Except $\ldots$ it feels like there is more information here. Specifically, if I offer a user a particular price $a$ and they purchase, I could assume they would have purchased at any price $a^\prime \leq a$, and since the reward is a known function of the price given purchase that implies additional rewards are being revealed. Similarly if I offer a user a particular price $a$ and they do not purchase, I could assume they would not have purchased at any price $a^\prime \geq a$, so again additional rewards are being revealed. These assumptions seem reasonable for a non-luxury good.

No problem, right? The filter-offset tree can handle when multiple rewards are revealed per historical instance, so I should just use that. Unfortunately, however, the set of actions whose rewards are revealed in the filter-offset tree case are chosen independently of the rewards. Here, the set of actions whose rewards are revealed is dependent upon the rewards, which is a recipe for bias. The situation is analogous to asking a friend about a recent trip to Vegas: if they won money, they will talk about it for hours, whereas if they lost money all you get is ``it was ok.''

The setup can be formalized as such:

- World chooses $(x, \omega, r)$ from $D$ and reveals $(x, \omega)$.
- Player chooses $a \in A$ via $p (a | x, \omega)$.
- World chooses $\mathcal{A} \in \mathcal{P} (A)$ via $q (\mathcal{A} | x, \omega, r, a)$.
- Requiring $a \in \mathcal{A}$ seems reasonable. In other words, I always get to observe at least the action I picked. Maybe this isn't strictly required, but it seems to fit what happens in practice.
- World reveals $\{ r (a) | a \in \mathcal{A} \}$.

q (\mathcal{A} | x, \omega, r, a) =

\begin{cases}

\{ a^\prime | a^\prime \leq a \} & \mbox{if } r (a) > 0; \\

\{ a^\prime | a^\prime \geq a \} & \mbox{if } r (a) = 0.

\end{cases}

\] Now I'm wondering under what circumstances I can use the extra information. Clearly I can always throw the extra information away and inspect only $r (a)$, which would be the vanilla offset tree. Can I do better?

## No comments:

## Post a Comment