Skip to content Skip to navigation


Papers under Review and Research in Progress

Smoothness-Adaptive Contextual Bandits.

With Ahmadreza Momeni and Stefan Wager.

We study a non-parametric multi-armed bandit problem with stochastic covariates, where a key complexity driver is the smoothness of payoff functions with respect to covariates. Previous studies have focused on deriving minimax-optimal algorithms in cases where it is a priori known how smooth the payoff functions are. In practice, however, the smoothness of payoff functions is typically not known in advance, and misspecification of smoothness may severely deteriorate the performance of existing methods. In this work, we consider a framework where the smoothness of payoff functions is not known, and study when and how algorithms may adapt to unknown smoothness. First, we establish that designing algorithms that adapt to unknown smoothness of payoff functions is, in general, impossible. However, under a self-similarity condition, we establish that adapting to unknown smoothness is possible, and further devise a general policy for achieving smoothness-adaptive performance. Our policy infers the smoothness of payoffs throughout the decision-making process, while leveraging the structure of non-adaptive off-the-shelf policies. We establish that for problem settings with differentiable or non-differentiable payoffs this policy guarantees the regret rate that is achievable when the smoothness of payoffs is known a priori.

On the Disclosure of Promotion Value in Platforms with Learning Sellers.

With Gregory Macnamara and Daniela Saban.

In many marketplaces that facilitate trade with the objective of maximizing consumer surplus, prices are set by revenue-maximizing sellers but platforms can influence prices through (i) price-dependent promotion policies that can increase demand for a product by featuring it in a prominent position in the webpage and (ii) the information revealed to sellers about the value of being promoted. Identifying effective joint information design and promotion policies is a challenging dynamic problem as sellers can sequentially learn the promotion value from sales observations and update prices accordingly. We introduce the notion of confounding promotion policies, which are designed to prevent a Bayesian seller from learning the promotion value (at the expense of the short-run loss of diverting consumers from the best product offering). Leveraging these policies, we characterize the maximum long-run average consumer surplus that is achievable through joint information design and promotion policies when the seller sets prices myopically. We then establish a Bayesian Nash equilibrium by showing that the seller's best response to the platform's optimal policy is to price myopically at every history. The equilibrium we identify is platform-optimal within the class of horizon-maximin equilibria, in which strategies are not predicated on precise knowledge of the horizon length, and are designed to maximize payoff over the worst-case horizon. Our analysis allows one to identify effective platform policies in a broad range of demand models.

Sequential Procurement with Contractual and Experimental Learning.

With Gregory Macnamara and Daniela Saban.

We study the design of sequential procurement strategies that integrate stochastic and strategic information. We consider a buyer who repeatedly demands a certain good and is unable to commit to long-term contracts. In each time period, the buyer makes a price offer to a seller who has private, persistent information regarding his cost and quality of provision. If the offer is accepted, the seller provides the good with a stochastic quality that is not contractible. The buyer can therefore learn from the (strategic) acceptance decisions taken by the seller, and from the (stochastic) quality delivered whenever a purchase occurs. Hence, the buyer not only faces a tradeoff between exploration and exploitation, but also needs to decide how to explore: by facilitating quality observations, or by strategically separating seller types. We characterize the Perfect Bayesian Equilibria of this sequential interaction and show that the buyer’s equilibrium strategy consists of a dynamic sequence of thresholds on her belief about the seller’s type. In equilibrium, the buyer offers high prices that incentivize trade and quality experimentation at early stages of the interaction and, after gathering enough information, she may advance to offering low prices that partially separate seller types. Furtheremore, we identify the effect that strategic sellers may have on the buyer’s optimal strategy relative to more traditional dynamics, and establish that, paradoxically, when sellers are strategic, the ability to observe delivered quality is not always beneficial for the buyer.
  • finalist of Informs Revenue Management & Pricing Section 2019 student paper award (G. Macnamara)

Adaptive Sequential Experiments with Unknown Information Arrival Processes.

With Ahmadreza Momeni.

Sequential experiments are often designed to strike  a balance between maximizing immediate payoffs based on available information, and acquiring new information that is essential for maximizing future payoffs. This trade-off is captured by the multi-armed bandit (MAB) framework that has been studied and applied, typically when at each time epoch feedback is received only on the action that was selected at that epoch. However, in many practical settings, including product recommendations, dynamic pricing, retail management, and health care, additional information may become available between decision epochs. We introduce a generalized MAB formulation in which auxiliary information may appear arbitrarily over time. By obtaining matching lower and upper bounds, we characterize the minimax complexity of this family of problems as a function of the information arrival process, and study how salient characteristics of this process impact policy design and achievable performance. We establish that while Thompson sampling and UCB policies leverage additional information naturally, policies with exogenous exploration rate may not exhibit such robustness. We introduce a virtual time indexes method for dynamically controlling the exploration rate of such policies, and apply it for designing epsilon-greedy-type policies that, without any prior knowledge on the information arrival process, attain the best performance that is achievable when the information arrival process is a priori known. We use data from a large media site to analyze the value that may be captured in practice by leveraging auxiliary information for designing content recommendations.
  • finalist of Informs Applied Probability Society 2018 student paper award (A. Momeni)
  • early version appeared in Advances in Neural Information Processing Systems 31 (NeurIPS 2018), S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, R. Garnett (editors), 7473-7482. Poster available here and 3-minutes video available here

Value Loss in Allocation Systems with Provider Guarantees.

With Dan A. Iancu and Xavier Warnes.

Many operational settings share the following three features: (i) a centralized planning system allocates tasks to workers or service providers, (ii) the providers generate value by completing the tasks, and (iii) the completion of tasks influences the providers' welfare. In such cases, the planning system's allocations often entail trade-offs between the service providers' welfare and the total value that is generated, and concern arises that allocations that are good under one metric may perform poorly under the other. We propose a broad framework for quantifying the magnitude of value losses when allocations are restricted to satisfy certain desirable guarantees to the service providers. We consider a general class of guarantees that includes many considerations of practical interest arising, e.g., in the design of sustainable two-sided markets, in workforce welfare and compensation, or in sourcing and payments in supply chains, among other application domains. We derive tight bounds on the relative value loss, and show that this loss is limited for any restriction included in our general class. Our analysis shows that when many providers are present, the largest losses are driven by fairness considerations, whereas when few providers are present, they are driven by the heterogeneity in the providers' effectiveness to generate value; when providers are perfectly homogenous, losses never exceed 50%. Lastly, we demonstrate numerically using both real-world and synthetic data that the loss can be small in several cases of practical interest.

Published Research Papers

Optimal Exploration-Exploitation in a Multi-Armed-Bandit Problem with Non-Stationary Rewards.

With Omar Besbes and Assaf Zeevi. Stochastic Systems, 2019, 9 (4), 319-337.

Multi-armed bandit problems have been used for modelling a broad variety of practical settings where uncertainty on rewards associated with different actions requires the design of sequential experimentation. This class of problems have been studied extensively when the reward distributions do not change over time, and uncertainty is purely spatial and essentially amounts to identifying the optimal action. We complement this literature by developing a flexible nonparametric model for temporal uncertainty in the rewards. The extent of temporal uncertainty is measured via the cumulative change of mean rewards over the horizon, a metric we refer to as temporal variation, and regret is measured relative to a dynamic oracle that plays the pointwise optimal action at each instant in time. Assuming that nature can choose any sequence of mean rewards such that their temporal variation does not exceed some temporal uncertainty budget, we characterize the complexity of this problem via the minimax regret, which depends on this budget (the hardness of the problem), the horizon length, and the number of arms. When the uncertainty budget is not known a priori, we develop a family of policies that adapt to the realized variation in rewards and provide numerical evidence suggesting that these policies are nearly minimax optimal.
  • early version appeared in Advances in Neural Information Processing Systems 27 (NIPS 2014), Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence and K. Q. Weinberger (editors), 199-207

Learning in Repeated Auctions with Budgets: Regret Minimization and Equilibrium.

With Santiago R. Balseiro. Management Science, 2019, 65 (9), 3952-3968. 

In online advertising markets, advertisers often purchase ad placements through bidding in repeated auctions based on realized viewer information. We study how budget-constrained advertisers may compete in such sequential auctions in the presence of uncertainty about future bidding opportunities and competition. We formulate this problem as a sequential game of incomplete information, where bidders know neither their own valuation distribution, nor the budgets and valuation distributions of their competitors. We introduce a family of practical bidding strategies we refer to as adaptive pacing strategies, in which advertisers adjust their bids according to the sample path of expenditures they exhibit, and analyze the performance of these strategies in different competitive settings. We establish the asymptotic optimality of these strategies when competitors’ bids are independent and identically distributed over auctions, but also when competing bids are arbitrary. When all the bidders adopt these strategies, we establish the convergence of the induced dynamics and characterize a regime (well motivated in the context of online advertising markets) under which these strategies constitute an approximate Nash equilibrium in dynamic strategies: the benefit from unilaterally deviating to other strategies, including ones with access to complete information, becomes negligible as the number of auctions and competitors grows large. This establishes a connection between regret minimization and market stability, by which advertisers can essentially follow approximate equilibrium bidding strategies that also ensure the best performance that can be guaranteed off equilibrium.

The Competitive Facility Location Problem for a Duopoly: Advances Beyond Trees.

With Daniela Saban and Nicolas E. Stier-Moses. Operations Research, 2018, 66 (4), 1058-1067. 

We consider a competitive facility location problem on a network where consumers located on vertices wish to connect to the nearest facility. Knowing this, each competitor locates a facility on a vertex, trying to maximize market share. We focus on the two-player case and study conditions that guarantee the existence of a pure-strategy Nash equilibrium for progressively more complicated classes of networks. For general graphs, we show that attention can be restricted to a subset of vertices referred to as the central block. By constructing trees of maximal bi-connected components, we obtain sufficient conditions for equilibrium existence. Moreover, when the central block is a vertex or a cycle (for example, in cactus graphs), this provides a complete and efficient characterization of equilibria. In that case, we show that both competitors locate their facilities in a solution to the 1-median problem, generalizing a well-known insight arising from Hotelling’s model. We further show that an equilibrium must solve the 1-median problem in other classes of graphs, including grids, which essentially capture the topology of urban networks. In addition, when both players select a 1-median, the solution must be at equilibrium for strongly-chordal graphs, generalizing a previously known result for trees.
  • online companion available here

Framework Agreements in Procurement: An Auction Model and Design Recommendations.

With Lijian Lu and Gabriel Y. Weintraub. M&SOM, 2017, 19 (4), 586-603. 

Framework agreements (FAs) are procurement mechanisms commonly used by buying agencies around the world to satisfy demand that arises over a certain time horizon. This paper is one of the first in the literature that provides a formal understanding of FAs, with a particular focus on the cost uncertainty bidders face over the FA time horizon. We generalize standard auction models to include this salient feature of FAs and analyze this model theoretically and numerically. First, we show that FAs are subject to a sort of winner’s curse that in equilibrium induces higher expected buying prices relative to running first-price auctions as needs arise. Then, our results provide concrete design recommendations that alleviate this issue and decrease buying prices in FAs, highlighting the importance of (i) monitoring the price charged at the open market by the FA winner to bound the buying price; (ii) implementing price indexes for the random part of suppliers’ costs; and (iii) allowing suppliers the flexibility to reduce their prices to compete with the open market throughout the selling horizon. These prescriptions are already being used by the Chilean government procurement agency that buys US$2 billion worth of contracts yearly using FAs.
  • early version appeared in Charting a Course in Public Procurement, Innovation & Knowledge Sharing, International Public Procurement Conference Publications (practitioners' outlet), 2012
  • online companion available here and additional supporting material available here

Optimization in Online Content Recommendation Services: Beyond Click-Through Rates.

With Omar Besbes and Assaf Zeevi. M&SOM, 2016, 18 (1), 15-33. 

A new class of online services allows Internet media sites to direct users from articles they are currently reading to other content they may be interested in. This process creates a “browsing path” along which there is potential for repeated interaction between the user and the provider, giving rise to a dynamic optimization problem. A key metric that often underlies this recommendation process is the click-through rate (CTR) of candidate articles. Whereas CTR is a measure of instantaneous click likelihood, we analyze the performance improvement that one may achieve by some lookahead that accounts for the potential future path of users. To that end, by using some data of user path history at major media sites, we introduce and derive a representation of content along two key dimensions: clickability, the likelihood to click to an article when it is recommended; and engageability, the likelihood to click from an article when it hosts a recommendation. We then propose a class of heuristics that leverage both clickability and engageability, and provide theoretical support for favoring such path-focused heuristics over myopic heuristics that focus only on clickability (no lookahead). We conduct a live pilot experiment that measures the performance of a practical proxy of our proposed class, when integrated into the operating system of a worldwide leading provider of content recommendations, allowing us to estimate the aggregate improvement in clicks per visit relative to the CTR-driven current practice. The documented improvement highlights the importance and the practicality of efficiently incorporating the future path of users in real time.

Non-Stationary Stochastic Optimization.

With Omar Besbes and Assaf Zeevi. Operations Research, 2015, 63 (5), 1227-1244. 

We consider a non-stationary variant of a sequential stochastic optimization problem, in which the underlying cost functions may change along the horizon. We propose a measure, termed variation budget, that controls the extent of said change, and study how restrictions on this budget impact achievable performance. We identify sharp conditions under which it is possible to achieve long-run average optimality and more refined performance measures such as rate optimality that fully characterize the complexity of such problems. In doing so, we also establish a strong connection between two rather disparate strands of literature: (1) adversarial online convex optimization and (2) the more traditional stochastic approximation paradigm (couched in a non-stationary setting). This connection is the key to deriving well-performing policies in the latter, by leveraging structure of optimal policies in the former. Finally, tight bounds on the minimax regret allow us to quantify the “price of non-stationarity,” which mathematically captures the added complexity embedded in a temporally changing environment versus a stationary one.
  • honorable mention at the Informs 2013 G. Nicholson student paper award
  • online companion available here