1 Introduction
Human decisionmakers are very good at taking decisions under rather imprecise specification of the decisionmaking problem, both in terms of constraints as well as objective. One might argue that the human decisionmaker can pretty reliably learn from observed previous decisions – a traditional learningbyexample setup. At the same time, when we try to turn these decisionmaking problems into actual optimization problems, we often run into all types of issues in terms of specifying the model. In an optimal world, we would be able to infer or learn the optimization problem from previously observed decisions taken by an expert.
This problem naturally occurs in many settings where we do not have direct access to the decisionmaker’s preference or objective function but can observe his behaviour, and where the learner as well as the decisionmaker have access to the same information. Natural examples are as diverse as making recommendations based on user history and strategic planning problems, where the agent’s preferences are unknown but the system is observable. Other examples include knowledge transfer from a human planner into a decision support system: often human operators have arrived at finelytuned “objective functions” through many years of experience, and in many cases it is desirable to replicate the decisionmaking process both for scaling up and also for potentially including it in largescale scenario analysis and simulation to explore responses under varying conditions.
Here we consider the learning of preferences or objectives from an expert by means of observing his actions. More precisely, we observe a set of input parameters and corresponding decisions of the form . They are such that with is a certain realization of problem parameters from a given set and is an optimal solution to the optimization problem
(1)  
(2) 
where is the expert’s true but unknown objective and for some (fixed) . We assume that we have full information on the feasible set and that we can compute for any candidate objective and . We present two onlinelearning algorithms, based on the multiplicative weights update (MWU) algorithm and online gradient descent (OGD) respectively, that allow us to learn a strategy of subsequent objective function choices with the following guarantee: if we optimize according to the surrogate objective function instead of the actual unknown objective function in response to parameter realization , we obtain a sequence of optimal decisions (w.r.t. to each ) given by
that are essentially as good as the decisions taken by the expert on average. To this end, we interpret the observations of parameters and expert solutions as revealed over multiple rounds such that in each round we are shown the parameters first, then take our optimal decision according to our objective function , then we are shown the solution chosen by the expert, and finally we are allowed to update for the next round. For this setup, we will be able to show that our MWUbased algorithm attains an error bound of
(4) 
where is an upper bound on the diameter of the feasible regions with . This implies that both the deviations in true cost as well as the deviations in surrogate cost can be made arbitrarily small on average. In other words, the average regret for having decided optimally according to the surrogate objectives vs. having decided optimally for the true objective vanishes at a rate of . While this algorithm is only applicable if holds, our algorithm based on OGD works without this restriction. If is an upper bound on the diameter of the feasible regions , and is an upper bound on the diameter of the set from which both and the ’s originate, then the OGDbased algorithm achieves an error bound of
(5) 
These results show that linear objective functions over general feasible sets can be learned from relatively few observations of historical optimal parametersolutions pairs. We will derive various extensions of our scheme, such as approximately learning nonlinear objective functions and learning from suboptimal decisions. We will also, briefly, discuss the case where the objective is known, but some linear constraints are unknown in this paper.
Literature Overview
The idea of learning or inferring parts of an optimization model from data is a reasonably wellstudied problem under many different assumptions and applications and has gained significant attention in the optimization community over the last few years, as discussed for example in den Hertog and Postek (2016), Lodi (2016) or SimchiLevi (2014). These papers argue that there would be significant benefits in combining traditional optimization models with dataderived components. Most approaches in the literature focus on deriving the objective function of an expert decisionmaker in a static fashion, based on past observations of input data and the decisions he took in each instance. In almost all cases, the objective functions are learned by considering the KKTconditions or the dual of the (parameterized) optimization problem, and as such convexity for both the feasible region and the objective function is inherently assumed. Examples of this approach include Keshavarz et al. (2011), Li (2016) as well as Thai and Bayen (2018), where the latter one also considers the derivation of variational inequalities from data. Sometimes also distributional assumptions regarding the observations are made. Applications of such approaches have been heavily studied in the context of energy systems (Ratliff et al. (2014); Konstantakopoulos et al. (2016)), robot motion (Papadopoulos et al. (2016); Yang et al. (2014)), medicine (Sayre and Ruan (2014)) and revenue management (Kallus and Udell (2015); Qiang and Bayati (2016); Chen et al. (2015); Kallus and Udell (2016); Bertsimas and Kallus (2016)); also in the situation where the observed decisions were not necessarily optimal (Nielsen and Jensen (2004)).
Very closely related to our learning approach in terms of the problem formulation is Esfahani et al. (2018)
. This work studies different loss functions for evaluating a learned objective function on a data sample
, which leads the authors to the minimization of the same regret function that we consider in the present paper. However, as their solution approach is based on duality, it does not extend to the integer case like the ideas presented here. Also closely related is the research reported in Troutt et al. (2005), which was later extended in Troutt et al. (2006), where an optimization model is defined that searches for a linear optimization problem that minimizes the total difference between the observed solutions and solutions found by optimizing according to that optimization problem. In the latter case, the models are solved using LP duality and cutting planes. In the followup work Troutt et al. (2008), a genetic algorithm is used to solve the problem heuristically under rather general assumptions, but inherently without any quality guarantees, and in
Troutt et al. (2011)the authors study experimental setups for learning objectives under various stochastic assumptions, focussing on maximum likelihood estimation, which is generally the case for their line of work; we make no such assumptions.
Closely related to learning optimization models from observed data is the subject of inverse optimization. Here the goal is to find an objective function that renders the observed solutions optimal with respect to the concurrently observed parameter realizations. Approaches in this field mostly stem from convex optimization, and they are used for inverse optimal control (Iyengar and Kang (2005); Panchea and Ramdani (2015); Molloy et al. (2016)
), inverse combinatorial optimization (
D. Burton (1997); Burton and Toint (1994, 1992); Sokkalingam et al. (1999); Ahuja and Orlin (2000)), integer inverse optimization (Schaefer (2009)) and inverse optimization in the presence of noisy data, such as observed decisions that were suboptimal (Aswani et al. (2018); Chan et al. (2018)).All these approaches heavily rely on duality and thus require convexity assumptions both for the feasible region as well as the objectives. As such, they cannot deal with more complex, possibly nonconvex decision domains. This in particular includes the important case of integervalued decisions (such as yes/nodecisions or, more generally, mixedinteger programming) and also many other nonconvex setups (several of which admit efficient linear optimization algorithms). Previously, this was only possible when the structure of the feasible set could be beneficially exploited. In contrast, our approach does not make any such assumptions and only requires access to a linear optimization oracle (in short: LP oracle) for the feasible region
. Such an oracle is defined as a method which, given a vector
, returns .Also related to our work is inverse reinforcement learning and apprenticeship learning, where the reward function is the target to be learned. However, in this case the underlying problem is modelled as a Markov decision process (MDP); see, for example, the results in
Syed and Schapire (2007) and Ratia et al. (2012). Typically, the obtained guarantees are of a different form though. Similarly, our work is not to be confused with the methods developed in Taskar et al. (2005) and Daumé et al. (2005), where online algorithms are used for learning aggregation vectors for edge features in graphs, with inverse optimization as a subroutine to define the update rule. In contrast, we do inverse optimization by means of onlinelearning algorithms, which is basically the reverse setup.Our approach is based on online learning, and we mainly use the simple EXP algorithm here to attain the stated asymptotic regret bound. The EXP algorithm is commonly also called Multiplicative Weights Update (MWU) algorithm and was developed in Littlestone and Warmuth (1994), Vovk (1990) as well as Freund and Schapire (1997) (see Arora et al. (2012); Hazan (2016) for a comprehensive introduction; see also Audibert et al. (2013)). A similar algorithm was used in Plotkin et al. (1995) for solving fractional packing and covering problems. To generalize the applicability of our approach, we also derive a second algorithm based on Online Gradient Descent (OGD) due to Zinkevich (see Zinkevich (2003)). We finally point out that our feedback is stronger than bandit feedback. This requirement is not unexpected as the costs chosen by the “adversary” depend on our decision; as such the bandit model (see, for example, Dani et al. (2008), AbbasiYadkori et al. (2011)) does not readily apply.
Contribution
To the best of the authors’ knowledge, this paper makes the first attempt to learn the objective function of an optimization model from data using an onlinelearning approach.
Online Learning of Optimization Problems
Based on samples for the inputoutput relationship of an optimization problem solved by a decisionmaker, our aim is to learn an objective function which is consistent with the observed inputoutput relationship. This is indeed the best one can hope for: an adversary could play the same environment for rounds and then switch. This is less of an issue if the environments form samples that are independent and identically distributed (i.i.d.) from some distribution.
In our setup, the expert solves the decisionmaking problem repeatedly for different input parameter realizations. From these observations, we are able to learn a strategy of objective functions that emulate the expert’s unknown objective function such that the difference in solution quality between the solutions converges to zero on average.
While previous methods based on dualization or KKTsystembased approaches can lead to similar or even stronger results in the continuous/convex case, online learning allows us to relax this convexity requirement and to work with arbitrary decision domains as long as we are able to optimize a linear function over them, in particular mixedinteger programs (MIPs). Thus, we do not explicitly analyze the KKTsystem or the dual program (in the case of linear programs (LPs); see Remark
3.1). In particular, one might consider our approach as an algorithmic analogue of the KKTsystem (or dual program) in the convex case.To summarize, we stress that (a) we do not make any assumptions regarding distribution of the observations, (b) the observations can be chosen by a fullyadaptive adversary, and (c) we do not require any convexity assumptions regarding the feasible regions and only rely on access to an LP oracle. We would also like to mention that our approach can be extended to work with slowly changing objectives using appropriate onlinelearning algorithms such as, for example, those found in Jadbabaie et al. (2015) or Zinkevich (2003); the regret bounds will depend on the rate of change.
A Broad Computational Study
We conduct extensive experiments to demonstrate the effectiveness and wide applicability of our algorithmic approach. To this end, we investigate its use for learning the objective functions of several combinatorial optimization problems that frequently occur in practice (possibly as subproblems of larger problems) and explore, among other things, how well the learned objective generalizes to unseen data samples.
The present paper is the full version of an extended abstract submitted to the International Conference on Machine Learning (ICML) 2017, see
Bärmann et al. (2017).2 Problem Setting
We consider the following family of optimization problems , which depend on parameters for some :
(6)  
(7) 
where is the objective function and is the feasible region, which depends on the parameters . Of particular interest to us will be feasible regions that arise as polyhedra defined by linear constraints and their intersections with integer lattices, i.e. the cases of LPs and MIPs:
with and . However, our approach can also readily be applied in the case of more complex feasible regions, such as mixedinteger sets bounded by convex functions:
with convex – or even more general settings. In fact, for any possible choice of model for the sets of feasible decisions, we only require the availability of a linear optimization oracle, i.e. an algorithm which is able to determine for any and . We call a decision optimal for if it is an optimal solution to .
We assume that Problem models a parameterized optimization problem which has to be solved repeatedly for various input parameter realizations . Our task is to learn the fixed objective function from given observations of the parameters and a corresponding optimal solution to . To this end, we further assume that we are given a series of observations of parameter realizations together with an optimal solution to computed by the expert for ; these observations are revealed over time in an online fashion: in round , we obtain a parameter setting and compute an optimal solution with respect to an objective function based on what we have learned about so far. Then we are shown the solution the expert with knowledge of
would have taken and can use this information to update our inferred objective function for the next round. In the end, we would like to be able to use our inferred objective function to take decisions that are essentially as good as those chosen by the expert in an appropriate aggregation measure such as, for example, “on average” or “with high probability”. The quality of the inferred objective is measured in terms of cost deviation between our solutions
and the solutions obtained by the expert – details of which will be given in the next section.To fix some useful notations, let denote the th component of a vector throughout, and let for any natural number . Furthermore, let denote the allones vector in . Finally, we need a suitable measure for the diameter of a given set.
Definition 2.1.
The diameter of a set , denoted by , is the largest distance between any two points , measured in the norm, , i.e.
(8) 
As a technical assumption, we further demand that for some convex, compact and nonempty subset , which is known beforehand. This is no actual restriction, as could be chosen to be any ball according to some norm, , for example. In particular, this ensures that we do not have to deal with issues that arise when rescaling our objective.
3 Learning Objectives
Ideally, we would like to find the true objective function as a solution to the following optimization problem:
(9) 
where is an arbitrary norm on and is the optimal decision taken by the expert in round . The true objective function is an optimal solution to Problem (9) with objective value . This is because any solution is feasible and produces nonnegative summands
(10) 
for , as we assume to be optimal for with respect to .
Problem (9) contains instances of the following maximization subproblem:
(11a)  
(11b)  
For each , the corresponding Subproblem (11) asks for an optimal solution when optimizing over the feasible set with a given as the objective function. When solving Problem (9), we are interested in an objective function vector that delivers a consistent explanation for why the expert chose as his response to the parameters in round . We call an objective function from some prescribed set of objective functions consistent with the observations , , if it is optimal for the resulting Problem (9). The aim is to find an objective for which the optimal solution of Subproblem (11) attains a value as close as possible to that of the expert’s decision, averaged over all observations. The approaches we present here will provide even stronger guarantees in some cases, such as the one described in Section 3.2, showing that we can replicate the decisionmaking behaviour of the expert. 
Remark 3.1.
Note that in the case of polyhedral feasible regions, i.e. and for , as well as a polyhedral region , Problem (9) can be reformulated as a linear program by dualizing the instances of Subproblem (11). This yields
(13a)  
(13b)  
(13c)  
(13d)  
where the are the corresponding dual variables and the are the observed decisions from the expert (i.e. the latter are part of the input data). This problem asks for a primal objective function vector that minimizes the total duality gap summed over all primaldual pairs while all ’s shall be dual feasible, which makes the ’s the respective primal optimal solutions. Thus, Problem (9) can be seen as a direct generalization of the linear primaldual optimization problem. In fact, our approach also covers nonconvex cases, e.g. mixedinteger linear programs. 
Problem (9) can be interpreted as a game over rounds between a player who chooses an objective function in round and a player who knows the true objective function and chooses the observations in a potentially adversarial way. The payoff of the latter player in each round is equal to , i.e. the difference in cost between our solution and the expert’s solution as given by our guessed objective function .
As Problem (9) is hard to solve in general, we will design onlinelearning algorithms that, rather than finding an optimal objective , find a strategy of objective functions to play in each round whose error in solution quality as compared to the true objective function is as small as possible. Our aim will then be to give a quality guarantee for this strategy in terms of the number of observations.
To allow for approximation guarantees, it will not only be necessary that the set of possible objective functions to choose from is bounded, but also that the observed feasible sets have a common upper bound on their diameter.
From a meta perspective, our approach works as outlined in Algorithm 1.
It chooses an arbitrary objective in the first round, as there is no better indication of what to do at this point. Then, in each round , it computes an optimal solution over with respect to the current guess of objective function . Upon the following observation of the expert’s solution, it updates its guess of objective function to use it in the next round.
Clearly, the accumulated objective value of a strategy over rounds is given by , while that of would be . Via the proposed scheme, it would be overly ambitious to demand , or even as the following example shows.
Example 3.2.
Consider the case and for . If the first player chooses for some as his objective function guess in each round , he will obtain optimal solutions with respect to . However, both the objective functions and the objective values will be far off. Indeed, when taking the norm, we have for . And if for all , we additionally have , but for .
Altogether, we cannot expect to approximate the true objective function or the true optimal values in general. Neither can we expect to approximate the solutions , because even if we have the correct objective function in each round, the optima do not not necessarily have to be unique.
As a more appropriate measure of quality, we will show that our algorithms based on online learning produce strategies with
(15) 
of which we will see that it directly implies both
(16) 
and
(17) 
with nonnegative summands for all rounds in all three expressions. The objective error is the objective function of Problem (9) when relaxing the requirement to play the same objective function in each round and instead passing to a strategy of objective functions. Equation (16) states that the average objective error over all observations converges to zero with the number of observations going to infinity. The solution error is the cumulative suboptimality of the solutions compared to the optimal solutions with respect to the true objective function. According to Equation (17), it equally tends to zero on average with an increasing number of observations. This means it is possible to take decisions which are essentially as good as the decisions of the expert with respect to over the long run.
Our measure of quality of a strategy of objective functions (15) is derived from the notion of regret, which is commonly used in online learning to characterize the quality of a learning algorithm: given an algorithm which plays solutions from some decision set in response to loss functions observed from an adversary over rounds , it is given by . Minimizing the regret of a sequence of decisions thus aims to find a strategy that perfoms at least as good as the best fixed decision in hindsight, i.e. the best static solution that can be played with full advanceknowledge of the loss functions the adversary will play. See Hazan (2016), for example, for a broad introduction to regret minimization in online learning.
In our approach, we interpret the set of possible objective functions in Problem (9) as the set of feasible decisions from which our learning algorithms choose an objective in each round . Furthermore, we use as the corresponding loss function in round . We are then interested in the regret against , which is given by . Equation (15) states that the average of this total error tends to zero as the number of observations increases. Note that is not necessarily the best fixed objective in hindsight – the latter would be given by a standard unit vector , where , which is rather meaningless in our case.
In the following, we derive two onlinelearning algorithms for which Equation (15) holds provably as wells as an intuitive heuristic for LPs for which Equation (15) holds empirically in our experiments in Section 4.
3.1 An Algorithm based on Multiplicative Weights Updates
A classical algorithm in online learning is the multiplicative weights update (MWU) algorithm, which solves the following problem: given a set of decisions, a player is required to choose one of these decisions in each round . Each time, after the player has chosen his decision, an adversary reveals to him the costs , of the decisions in the current round. The objective of the player is to minimize his overall cost over the time horizon . The MWU algorithm solves this problem by maintaining weights which are updated from round to round, starting with the initial weights
. These weights are used to derive a probability distribution
. In round , the player samples a decision from according to . Upon observation of the costs , the player updates his weights according to(18) 
where is a suitable step size, in online learning also called learning rate, and denotes the componentwise multiplication of two vectors . The expected cost of the player in round is then given by , and the total expected cost is given by . MWU attains the following regret bound against any fixed distribution:
Lemma 3.3 (Arora et al. (2012, Corollary 2.2)).
The MWU algorithm guarantees that after rounds, for any distribution on the decisions, we have
(19) 
where the is to be understood componentwise.
The above regret bound is valid for any distribution , in particular for the best distribution in hindsight, i.e. the distribution that would have performed best given the observed cost vectors . The latter is again given by some standard unit vector.
We will now reinterpret the distributions , a suitable distribution to compare their regret to as well as the cost vectors in MWU in a way that will allow us to learn an objective function from observed solutions. Namely, we will identify the distributions with the objective functions in the strategy of the player and the distribution with the actual objective function . The difference between the optimal solution computed by the player and the optimal solution of the expert will then act as the cost vector (after appropriate normalization).
Naturally, this limits us to , i.e. the objective functions have to lie in the positive orthant (while normalization is without loss of generality). However, whenever this restriction applies, we obtain a very lightweight method for learning the objective function of an optimization problem. In Section 3.3, we will present an algorithm which works without this assumption on .
Our application of MWU to learning the objective function of an optimization problem proceeds as outlined in Algorithm 2.
For the series of objectives functions that our algorithm returns, we can establish the following guarantee:
Theorem 3.4.
Let with for all . Then we have
and in particular it also holds:

,

.
Proof.
According to the standard performance guarantee of MWU from Lemma 3.3, Algorithm 2 attains the following bound on the total error of the secuence compared to with respect to the cost vectors :
where the is to be understood componentwise. Using that each each entry of is at most and dividing by , we can conclude
and further, as ,
The righthand side attains its minimum for , which yields the bound
Substituting back for the ’s and using
we obtain
(20) 
Observe that for each summand we have as and is the maximum over this set with respect to . With a similar argument, we see that for all . Thus, we have
(21) 
and similarly for the separate terms with analogue argumentation. This establishes the claim. ∎
Note that by using exponential updates of the form
in Line 13 of the algorithm, we could attain essentially the same bound, cf. (Arora et al., 2012, Theorem 2.3). Secondly, we remark that our choice of the learning rate requires the number of rounds to be known beforehand; if this is not the case, we can use the standard doubling trick (see CesaBianchi and Lugosi (2006)) or use an anytime variant of MWU.
From the above theorem, we can conclude that the average error over all observations for when choosing objective function in iteration of Algorithm 2 instead of converges to with an increasing number of observations at a rate of roughly :
Corollary 3.5.
Let with for all . Then we have

and

.
In other words, both the average error incurred from replacing the actual objective function by the estimation as well as the average error in solution quality with respect to tend to as grows.
Moreover, using Markov’s inequality we also obtain the following quantitative bound on the deviation by more than a given from the average cost:
Corollary 3.6.
Let . Then the fraction of observations with
(22) 
is at most
(23) 
In particular, for any we have that after
(24) 
observations the fraction of observations with cost
(25) 
is at most .
Proof.
Markov’s inequality states
(26) 
for a finite set , a function and . With , for as well as , we obtain the desired upper bound on the fraction of high deviations. The second part follows from solving
(27) 
for and plugging in values. ∎
Remark 3.7.
It is straightforward to extend the result from Theorem 3.4 to a more general setup, namely the learning of an objective function which is linearly composed from a set of basis functions. To this end, we consider the problem
(28)  
(29) 
where with , on compact and parameterized in as above. In order to apply Theorem 3.4 to this case, the diameter of the image of additionally needs to be finite, which is naturally the case, for example, if is Lipschitz continuous with respect to the maximum norm with Lipschitz constant . Then we can change the cost function in Line 11 of a Algorithm 2 to
(30) 
which yields a guarantee of
(31) 
with .
We would like to point out that the requirement to observe optimal solutions to learn the objective function which produced them can be relaxed in all the above considerations. Assume that we observe optimal solutions instead, i.e. they satisfy for all and some . In this case, the upper bound
which is analoguous to what we derived in Theorem 3.4, still holds, as it does not depend on the optimality of the observed solutions. On the other hand, we have
due to the optimality of the ’s with respect to the ’s and the optimality of the ’s. Altogether, this yields
and consequently
such that in the limit, our solutions become optimal on average. Note that a similar result can be obtained if we assume an additive error in the observed solutions instead of a multiplicative one.
3.2 The Stable Case
While in most applications it is sufficient to be able to produce solutions via the surrogate objectives that are essentially equivalent to those for the true objective, we will show now that under slightly strengthened assumptions we can obtain significantly stronger guarantees for the convergence of the solutions: we will show that in the long run we learn to emulate the true optimal solutions provided that the problems have unique solutions as we will make precise now.
We say that the sequence of feasible regions is stable for for some if for any , with , and so that for we have
i.e. either the two optimal solutions coincide or they differ by at least with respect to . In particular, optimizing over leads to a unique optimal solution for all with . While this condition – which is well known as the sharpness of a minimizer in convex optimization – sounds unnatural at first, it is, for example, trivially satisfied for the important case where with is a polytope with vertices in and is a rational vector. In this case, write with and observe that the minimum change in objective value between any two vertices of the 0/1polytope with is bounded by , so that stability with holds in this case. The same argument works for more general polytopes via bounding the minimum nonzero change in objective function value via the encoding length.
We obtain the following simple corollary of Theorem 3.4.
Corollary 3.8.
Let with for all , let be stable for some , and let . Then
Proof.
We start with the guarantee from the proof of Theorem 3.4:
(32) 
Now let be as above so that
(33) 
Observe that as was optimal for together with stability. We thus obtain
(34) 
which is equivalent to
(35) 
∎
From the above corollary, we obtain in particular that in the stable case we have , i.e. the average number of times that deviates from tends to in the long run. We hasten to stress, however, that the convergence implied by this bound can potentially be slow as it is exponential in the actual encoding length of ; this is to be expected given the convergence rates of our algorithm and onlinelearning algorithms in general.
3.3 An Algorithm based on Online Gradient Descent
The algorithm based on MWU introduced in Section 3.1 has the limitation that it is only applicable for learning nonnegative objectives. In addition, it cannot make use of any prior knowledge about the structure of other than coming from the positive orthant. To lift these limitations, we will extend our approach using online gradient descent (OGD) which is an onlinelearning algorithm applicable to the following game over rounds: in each round , the player chooses a solution from a convex, compact and nonempty feasible set . Then the adversary reveals to him a convex objective function , and the player incurs a cost of . OGD proceeds by choosing an arbitrary in the first round and updates this choice after observing via
(36) 
where is the projection onto the set and is the learning rate. With the abbreviations and , the regret of the player can then be bounded as follows.
Lemma 3.9 (Zinkevich (2003, Theorem 1)).
For , , we have
(37) 
Concerning the choice of learning rate, there are a couple of things to note. Firstly, the learning rate in round does not depend on the total number of rounds of the game. This means that the resulting version of OGD works without prior knowledge of . It is even possible to improve slightly on the above result: by choosing the learning rate in round
Comments
There are no comments yet.