1 Introduction
Inspired by human learning ability that human can transfer learning skills among multiple related tasks to help learn each task, multitask learning
(Caruana, 1997; Zhang and Yang, 2017) is to identify common structured knowledge shared by multiple related learning tasks and then share it to all the tasks with the hope of improving the performance of all the tasks. In the past decades, many multitask learning models, including regularized models (Ando and Zhang, 2005; Argyriou et al., 2006; Obozinski et al., 2006; Jacob et al., 2008; Zhang and Yeung, 2010; Kang et al., 2011; Lozano and Swirszcz, 2012; Han and Zhang, 2016), Bayesian models (Bakker and Heskes, 2003; Bonilla et al., 2007; Xue et al., 2007; Zhang et al., 2010; HernándezLobato and HernándezLobato, 2013; HernándezLobato et al., 2015), and deep learning models
(Caruana, 1997; Misra et al., 2016; Liu et al., 2017; Long et al., 2017; Yang and Hospedales, 2017a, b), have been proposed and those models have achieved great success in many areas such as natural language processing and computer vision.
All the tasks under investigation usually have different difficulty levels. That is, some tasks are easy to learn but other tasks may be more difficult to learn. In most multitask learning models, tasks are assumed to have the same difficulty level and hence the sum of training losses in all the tasks is minimized. Recently, some studies consider the issue of different difficulty levels among tasks, which exists in many applications, and propose several models to handle this issue. As discussed in the next section, we classify those studies into five categories, including the direct sum approach that includes most multitask learning models by assuming tasks with the same difficulty level, the weighted sum approach (Kendall et al., 2018; Chen et al., 2018; Liu et al., 2019) that learns task weights based on humandesigned rules, the maximum approach (Mehta et al., 2012) that minimizes the maximum of training losses of all the tasks, the curriculum learning approach (Pentina et al., 2015; Li et al., 2017; Murugesan and Carbonell, 2017) that learns easy tasks first and then hard tasks, and the multiobjective optimization approach (Sener and Koltun, 2018; Lin et al., 2019) that formulates multitask learning as a multiobjective optimization problem based on multiobjective gradient descent algorithms.
As discussed in the next section, all the existing studies suffer some limitations. For example, the weighted sum approach relies on humandesigned rules to learn task weights, which may be suboptimal to the performance. The maximum approach has a nonsmooth objective function, which makes the optimization difficult. Manually designed task selection criteria of the curriculum learning approach are not optimal. It is unclear how to add additional functions such as the regularizer into the multiple objective functions in the multiobjective optimization approach.
In this paper, to alleviate all the limitations of existing studies and to learn tasks with varying difficulty levels, we propose a Balanced MultiTask Learning (BMTL) framework which can be combined with most multitask learning models. Different from most studies (e.g., the weighted sum approach, the curriculum learning approach and the multiobjective optimization approach) which minimize the weighted sum of training losses of all the tasks in different ways to learn task weights, the proposed BMTL framework proposes to use a transformation function to transform the training loss of each task and then minimizes the sum of transformed training loss. Based on an intuitive idea that a task should receive more attention during the optimization if the training loss of this task at current estimation of parameters is large, we analyze the necessary conditions of the transformation function and discover some possible families of transformation functions. Moreover, we analyze the generalization bound for the BMTL framework. Extensive experiments show the effectiveness of the proposed BMTL framework.
2 Preliminaries
In multitask learning, suppose that there are learning tasks. The th task is associated with a training dataset denoted by where denotes the th data point in the th task, is the corresponding label, and denotes the number of data points in the th task. Each data point
can be represented by a vector, matrix or tensor, which depends on the application under investigation. When facing classification tasks, each
is from a discrete space, i.e., where denotes the number of classes, and otherwiseis continuous. The loss function is denoted by
where includes model parameters for all the tasks and denotes the learning function of the th task parameterized by some parameters in. For classification tasks, the loss function can be the crossentropy function and regression tasks can adopt the square loss. Here the learning model for each task can be any model such as a linear model or a deep neural network with the difference lying in
. For example, for a linear model where denotes a linear function in terms of , can be represented as a matrix with each column as a vector of linear coefficients for the corresponding task. For a deep multitask neural network with the first several layers shared by all the tasks, consists of a common part corresponding to weights connecting shared layers and a taskspecific part which corresponds to weights connecting nonshared layers.With the aforementioned notations, the training loss for the th task can be computed as
Difficulty levels of all the tasks are usually different and hence training losses of different tasks to be minimized have different difficulty levels. There are several works to handle the problem of varying difficulty levels among tasks. In the following, we give an overview on those works.
2.1 The Direct Sum Approach
The direct sum approach is the simplest and most widely used approach in multitask learning. It directly minimizes the sum of training losses of all the task as well as other terms such as the regularization on parameters, and a typical objective function in this approach can be formulated as
(1) 
where denotes an additional function on (e.g., the regularization function). In some cases, the first term in problem (1) can be replaced by and it is easy to show that they are equivalent by scaling .
2.2 The Weighted Sum Approach
It is intuitive that a more difficult task should attract more attention to minimize its training loss, leading to the weighted sum approach whose objective function is formulated as
(2) 
where is a positive task weight. Compared with problem (1), the only difference lies in the use of . When each equals 1, problem (2) reduces to problem (1).
In this approach, the main issue is how to set . In the early stage, users are required to manually set them but without additional knowledge, users just simply set them to be a identical value, which is just equivalent to the direct sum approach. Then some works (Kendall et al., 2018; Liu et al., 2019) propose to learn or set them based on data. For example, if using the square loss as the loss function as in (Kendall et al., 2018), then from the probabilistic perspective such a loss function implies a Gaussian likelihood as , where
denotes the standard deviation for the Gaussian likelihood for the
th task. Then by viewing as the negative logarithm of the prior on , problem (2) can be viewed as a maximum a posterior solution of such a probabilistic model, where equals and can be learned from data. However, this method is only applicable to some specific loss function (i.e., square loss), which limits its application scope. Chen et al. (2018) aim to learn task weights to balance gradient norms of different tasks and propose to minimize the absolute difference between the norm of the weighted training loss of a task with respect to common parameters and the average of such gradient norms over all tasks scaled by the power of the relative loss ratio of this task. At step , Liu et al. (2019) propose a Dynamic Weight Average (DWA) strategy to define aswhere denotes the estimation of at step , is a normalization factor to ensure , and is a temperature parameter to control the softness of task weights. Here reflects the relative descending rate. However, manually setting seems suboptimal.
2.3 The Maximum Approach
Mehta et al. (2012) consider the worst case by minimizing the maximum of all the training losses and formulate the objective function as
(3) 
To see the connection between this problem and problem (2) in the weighted sum approach, we can reformulate problem (3) as
According to this reformulation, we can see that the maximum approach shares a similar formulation to the weighted sum approach but in the maximum approach can be determined automatically. However, the objective function in the maximum approach is nonsmooth, which makes the optimization more difficult.
2.4 The Curriculum Learning Approach
Curriculum learning (Bengio et al., 2009) and its variant selfpaced learning (Kumar et al., 2010), aim to solve nonconvex objective functions by firstly learning from easy data points and then from harder ones. Such idea has been adopted in multitask learning (Pentina et al., 2015; Li et al., 2017; Murugesan and Carbonell, 2017) by firstly learning from easy tasks and then from harder ones.
In the spirit of curriculum learning, Pentina et al. (2015) take a greedy approach to learn an ordering of tasks where two successive tasks share similar model parameters. However, the analysis in (Pentina et al., 2015) is only applicable to linear learners. Built on selfpaced learning, (Murugesan and Carbonell, 2017) propose a similar objective function to problem (2) with defined as
(4) 
where denotes current estimation of at the previous step and
is a positive hyperparameter. Based on such estimation equation, we can see that a task with a lower training loss in the previous step will have a larger weight at the next step, which follows the philosophy of selfpaced learning. Compared with
(Murugesan and Carbonell, 2017) which only considers the task difficulty, Li et al. (2017) apply selfpaced learning to both task and instance levels but it is only applicable to linear models.2.5 The MultiObjective Optimization Approach
Sener and Koltun (2018) and Lin et al. (2019) study multitask learning from the perspective of multiobjective optimization where each objective corresponds to minimizing the training loss of a task. Specifically, Sener and Koltun (2018) formulate the multiobjective optimization problem as
(5) 
where consists of common parameters shared by all the tasks and taskspecific parameters . One example of such model is the multitask neural network where corresponds to parameters in the first several layers shared by all the tasks and includes all the parameters in later layers for the th task. In problem (5), there are objectives to be minimized and there is different from aforementioned approaches which have only one objective. The MultiGradient Descent Algorithm (MGDA) (Désidéri, 2012) is used to solve problem (5) with respect to . In each step of MGDA, we need to solve the following quadratic programming problem as
(6) 
where , denotes the vectorized gradient of with respect to , and denotes the norm of a vector. After solving problem (6), we can obtain the optimal . If equals a zero vector, there is no common descent direction for all the tasks and hence MGDA terminates. Otherwise, is a descent direction to reduce training losses of all the tasks. In this sense, acts similarly to in the weighted sum approach. However, in this method, the additional function seems difficult to be incorporated into problem (5). Built on (Sener and Koltun, 2018)
and decompositionbased multiobjective evolutionary computing,
Lin et al. (2019) decompose problem (5) into several subproblems with some preference vectors in the parameter space and then solve all the subproblems. However, preference vectors designed by users seem suboptimal and multiple solutions induced make it difficult to choose which one to conduct the prediction in the testing phase.3 Balanced MultiTask Learning
In this section, we first analyze the limitation of existing works to deal with different levels of task difficulties and then present the proposed BMTL framework.
3.1 Analysis on Existing Studies
We first take a look at the learning procedure of the direct sum approach which is fundamental to other approaches. Suppose current estimation of is denoted by and then we wish to update as . Since is usually small, based on the firstorder Taylor expansion, we can approximate the summed training losses of all the tasks as
where denotes the inner product between two vectors, matrices or tensors with equal size and denotes the gradient of with respect to at . As consists of model parameters of all the tasks, note that some entries in will be zero and hence is sparse. Then based on problem (1), the objective function for learning can be formulated as
(7) 
In problem (7), we can see that only the gradient is involved in the learning of . Intuitively, if a task has a large training loss at current step, we hope that at the next step this task should attract more attention to minimize its training loss. So in mathematics, not only the gradient (i.e., ) but also the training loss (i.e., ) should be used to learn . However, the direct sum approach cannot satisfy this requirement as revealed in problem (7). In the next section, we will see a solution, the proposed BMTL framework, which can satisfy this requirement.
Similar to the direct sum approach, the weighted sum approach with fixed task weights has similar limitations. So the weighted sum approach and other approaches, which take similar formulations to the weighted sum approach with minor differences, propose to use dynamic task weights which depend on model parameters learned in previous step(s). This idea can handle tasks with different difficulty levels to some extent but it brings some other limitations. For example, the weighted sum approach and the curriculum learning approach usually rely on manually designed rules to update task weights, the maximum approach has a nonsmooth objective function, and it is unclear to handle additional functions in the multiobjective optimization approach which though has a solid mathematical foundation.
3.2 The BMTL Framework
Based on the analysis in the previous section, we hope to use the training losses at current step to learn the update . To achieve this, we propose a BMTL framework as
(8) 
where is a function mapping which can transform a nonnegative scalar to another nonnegative scalar. can be viewed as a transformation function on the training loss and obviously it should be a monotonically increasing function as minimizing will make small. For the gradient with respect to
, we can compute it based on the chain rule as
where denotes the derivative of with respect to its input argument. Similar to problem (7), the objective function for is formulated as
(9)  
According to problem (9), can be viewed as a weight for the th task. Here is required to be monotonically increasing as a larger loss will require more attention, which corresponds to a larger weight . In summary, both and are required to be monotonically increasing and they are nonnegative when the input argument is nonnegative. In the following theorem, we prove properties of based on those requirements.^{1}^{1}1All the proofs are put in the appendix.
Theorem 1
If satisfies the aforementioned requirements, then is strongly convex and monotonically increasing on , and it satisfies and .
According to Theorem 1, we can easily check whether a function can be used for in the BMTL framework. It is easy to show that an identity function corresponding to the direct sum approach does not satisfy Theorem 1. Moreover, based on Theorem 1, we can see that compared with the direct sum approach, the introduction of into problem (8) will keep nice computational properties (e.g., convexity) and we have the following results.
Theorem 2
If the loss function is convex with respect to , is convex with respect to . If further is convex with respect to , problem (8) is a convex optimization problem.
With Theorem 1, the question is how to find an example of that satisfies Theorem 1. It is not difficult to check that satisfies Theorem 1, where is a positive hyperparameter. In this paper we use this example to illustrate the BMTL framework and other possible examples such as polynomial functions with nonnegative coefficients that also satisfy Theorem 1 will be studied in our future work.
The BMTL framework is applicable to any multitask learning model no matter wether it is a shallow or deep model and no matter what loss function is used, since
is independent of the model architecture and the loss function. This characteristic makes the BMTL framework easy to implement. Given the implementation of a multitask learning model, we only need to add an additional line of codes in, for example, the Tensorflow package, to implement
over training losses of different tasks. Hence the BMTL framework can be integrated with any multitask learning model in a plugandplay manner.3.3 Relation to Existing Studies
When , problem (8) is lowbounded by problem (1) in the direct sum approach after scaling plus some constant. To see that, based on a famous inequality that , we have
When , problem (8) in the BMTL framework is related to problem (3) in the maximum approach. Based on a wellknown inequality that for a set of variables , we can obtain the lower and upper bound of as
So is closely related to the maximum function and it is usually called the soft maximum function which can replace the maximum function in some case to make the objective function smooth. When replacing the maximum function in problem (3) with the soft maximum function, it is similar to problem (8) in the BMTL framework with an additional logarithm function. Though the soft maximum approach takes a similar formulation to problem (8), it does not satisfy Theorem 1 and its performance is inferior to the BMTL framework as shown in the next section.
3.4 Generalization Bound
In this section, we analyze the generalization bound for the BMTL framework.
The expected loss for the th task is defined as , where denotes the underlying distribution to generate the data for the th task and defines the expectation. The expected loss for the BMTL framework is defined as . For simplicity, different tasks are assumed to have the same number of data points, i.e., equals for . It is very easy to extend our analysis to general settings. The empirical loss for the th task is defined as . The empirical loss for all the tasks is defined as . We assume the loss function has values in and it is Lipschitz with respect to the first input argument with a Lipschitz constant .
Here we rewrite problem (8) into an equivalent formulation as
(10) 
We define the constraint set on as . For problem (10), we can derive a generalization bound in the following theorem.
Theorem 3
Remark 1
Theorem 3 provide a general bound for any learning function to upperbound the expected loss by the empirical loss, the complexity of the learning model reflected in the second term of the righthand side, and the confidence shown in the last term. Based on , the confidence term is . To see the complexity of the confidence term in terms of , according to Lemma 1 in the supplementary material, we have , implying that the confidence term is .
We also consider the case where is an identity function, i.e., which is studied in the direct sum approach. For , we have the following result.
Theorem 4
When , for , with probability at least , we have
In Theorem 4, it is interesting to upperbound the the expected loss by the empirical loss with a different transformation function.
Based on Theorem 3, we can analyze the expected loss for specific models. Due to page limit, a generalization bound for linear models can be found in the supplementary material.
4 Experiments
In this section, we conduct empirical studies to test the proposed BMTL framework.
4.1 Experimental Settings
4.1.1 Datasets
We conduct experiments on four benchmark datasets for classification and regression tasks.
Office31 (Saenko et al., 2010): The dataset consists of 4,110 images in 31 categories shared by three distinct tasks: Amazon (A) that contains images downloaded from amazon.com, Webcam (W), and DSLR (D), which are images taken by the Web camera and digital SLR camera under different environmental settings.
OfficeHome (Venkateswara et al., 2017): This dataset consists of 15,588 images from 4 different tasks: artistic images (A), clip art (C), product images (P), and realworld images (R). For each task, this dataset contains images of 65 object categories collected in the office and home settings.
ImageCLEF^{2}^{2}2http://imageclef.org/2014/adaptation
: This dataset contains about 2,400 images from 12 common categories shared by four tasks: Caltech256 (C), ImageNet ILSVRC 2012 (I), Pascal VOC 2012 (P), and Bing (B). There are 50 images in each category and 600 images in each task.
SARCOS^{3}^{3}3http://www.gaussianprocess.org/gpml/data/: This dataset is a multioutput regression problem for studying the inverse dynamics of 7 SARCOS anthropomorphic robot arms, each of which corresponds to a task, based on 21 features. By following (Zhang and Yeung, 2010), we treat each output as a task and randomly sample 2000 data points to form a multitask dataset.
4.1.2 Baseline Models
Since most strategies to balance the task difficulty are independent of multitask learning models which means that these strategies are applicable to almost all the multitask learning models, baseline models consist of two parts, including multitask learning methods and balancing strategies.
Multitask Learning Methods: Deep multitask learning methods we use include () Deep MultiTask Learning (DMTL) (Caruana, 1997; Zhang et al., 2014) which shares the first hidden layer for all the tasks, () Deep MultiTask Representation Learning (DMTRL) (Yang and Hospedales, 2017a) which has three variants including DMTRL_Tucker, DMTRL_TT, and DMTRL_LAF, () Trace Norm Regularised Deep MultiTask Learning (TNRMTL) (Yang and Hospedales, 2017b) with three variants as TNRMTL_Tucker, TNRMTL_TT, and TNRMTL_LAF, and () Multilinear Relationship Networks (MRN) (Long et al., 2017).
Balancing Strategies: As reviewed in Section 2, we choose one strategy from each approach to compare. The strategies we compare include () the Direct Sum (DS) approach formulated in problem (1), () the Dynamic Weight Average (DWA) method (Liu et al., 2019) in the weighted sum approach, () the Maximum (Max) approach formulated in problem (3), () the Soft Maximum (sMAX) method discussed in Section 3.3 by minimizing , () the Curriculum Learning (CL) method (Murugesan and Carbonell, 2017) by using the selfpaced task selection in an easytohard ordering as illustrated in Eq. (4), () the MultiGradient Descent Algorithm (MGDA) method in the multiobjective optimization approach as formulated in problem (5), and () the proposed Balanced MultiTask Learning (BMTL) framework. Note that among the above seven strategies, the MGDA method is only applicable to the DMTL method while other strategies can be applicable to all the multitask learning methods in comparison.
For image datasets, we use the VGG19 network (Simonyan and Zisserman, 2014) pretrained on the ImageNet dataset (Russakovsky et al., 2015) as the feature extractor based on its fc7 layer. After that, all the multitask learning methods adopt a twolayer fullyconnected network (4096 600
classes) with the ReLU activation in the first layer. The first layer is shared by all tasks to learn a common representation and the second layer is for taskspecific outputs. The positive hyperparameter
in the proposed BMTL framework is set to 50.4.2 Experimental Results
To analyze the effect of the training proportion to the performance, we evaluate the classification accuracy of all the methods by using the training proportion as 50%, 60%, and 70%, respectively, and plot the average test accuracy of all the balancing strategies applied to all the multitask learning methods in Figures 13. Each experimental setting repeats five times and for clear presentation, Figures 13 only contain the average accuracies.
According to Figures 13, we can see that when the training proportion increases, the performance of all the balancing strategies on all the multitask learning models almost improves with some exceptions due to the sensitivity to the initial values for model parameters in multitask learning models. Moreover, we observe that compared with all the balancing strategies, the proposed BMTL framework improves every multitask baseline method with different training proportions, which proves the effectiveness and robustness of the BMTL framework.
From the results shown in Figure 1(b), 3(b) and 3(h), we can see that the MGDA method outperforms other losses balancing strategies that are based on the DMTRL_Tucker and MRN methods. One reason is that the MGDA method is specific to the DMTL method and inapplicable to other DMTL methods and hence the comparison here is not so fair. In those settings, the proposed BMTL framework still significantly boosts the performance of the DMTRL_Tucker and MRN methods.
For the SARCOS dataset, we use the mean square error as the performance measure. The results are shown in Figure 5. As shown in Figure 5, the proposed BMTL framework outperforms other balancing strategies, especially based on the TNRMTL methods, which demonstrates the effectiveness of the BMTL framework in this dataset.
4.2.1 Analysis on Training Losses
For the proposed BMTL framework based on the DMTRL_TT method, we plot the training losses of different tasks from the OfficeHome dataset in Figure 4. From the curves of the training loss, we can observe that tasks with larger training losses draw more attention and decrease faster than other tasks during the training process.
5 Conclusion
In this paper, we propose a balanced multitask learning framework to handle tasks with unequal difficulty levels. The main idea is to minimize the sum of transformed training losses of all the tasks via the transformation function. The role of the transformation function is to make tasks with larger training losses receive larger weights during the optimization procedure. Some properties of the transformation function are analyzed. Empirical studies conducted on realworld datasets demonstrate the effectiveness of the BMTL framework. In our future work, we will investigate other examples of such as polynomial functions.
References
 Tensorflow: a system for largescale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation, pp. 265–283. Cited by: §4.1.2.
 A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research 6, pp. 1817–1853. Cited by: §1.
 Multitask feature learning. In Advances in Neural Information Processing Systems 19, pp. 41–48. Cited by: §1.
 Task clustering and gating for bayesian multitask learning. Journal of Machine Learning Research 4, pp. 83–99. Cited by: §1.
 Curriculum learning. In Proceedings of the 26th International Conference on Machine Learning, pp. 41–48. Cited by: §2.4.
 Multitask Gaussian process prediction. In Advances in Neural Information Processing Systems 20, Vancouver, British Columbia, Canada, pp. 153–160. Cited by: §1.
 Convex optimization. Cambridge University Press. Cited by: Proof for Theorem 2.
 Multitask learning. Machine Learning 28 (1), pp. 41–75. Cited by: §1, §4.1.2.
 GradNorm: gradient normalization for adaptive loss balancing in deep multitask networks. In Proceedings of the 35th International Conference on Machine Learning, pp. 793–802. Cited by: §1, §2.2.
 Multiplegradient descent algorithm (MGDA) for multiobjective optimization. Comptes Rendus Mathematique 350 (5), pp. 313–318. Cited by: §2.5.

Multistage multitask learning with reduced rank.
In
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence
, pp. 1638–1644. Cited by: §1. 
A probabilistic model for dirty multitask feature selection
. In Proceedings of the 32nd International Conference on Machine Learning, pp. 1073–1082. Cited by: §1.  Learning feature selection dependencies in multitask learning. In Advances in Neural Information Processing Systems 26, pp. 746–754. Cited by: §1.
 Clustered multitask learning: a convex formulation. In Advances in Neural Information Processing Systems 21, pp. 745–752. Cited by: §1.
 Learning with whom to share in multitask feature learning. In Proceedings of the 28th International Conference on Machine Learning, pp. 521–528. Cited by: §1.

Multitask learning using uncertainty to weigh losses for scene geometry and semantics.
In
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition
, pp. 7482–7491. Cited by: §1, §2.2.  Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.2.
 Selfpaced learning for latent variable models. In Advances in Neural Information Processing Systems 23, pp. 1189–1197. Cited by: §2.4.
 Selfpaced multitask learning. In Proceedings of the ThirtyFirst AAAI Conference on Artificial Intelligence, pp. 2175–2181. Cited by: §1, §2.4, §2.4.
 Pareto multitask learning. In Advances in Neural Information Processing Systems 32, pp. 12037–12047. Cited by: §1, §2.5.
 Adversarial multitask learning for text classification. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pp. 1–10. Cited by: §1.
 Endtoend multitask learning with attention. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 1871–1880. Cited by: §1, §2.2, §4.1.2.
 Learning multiple tasks with multilinear relationship networks. In Advances in Neural Information Processing Systems 30, pp. 1593–1602. Cited by: §1, §4.1.2.
 Multilevel lasso for sparse multitask regression. In Proceedings of the 29th International Conference on Machine Learning, Cited by: §1.
 On the method of bounded differences. Surveys in combinatorics 141 (1), pp. 148–188. Cited by: Proof for Theorem 3.
 Minimax multitask learning and a generalized losscompositional paradigm for MTL. In Advances in Neural Information Processing Systems 25, pp. 2159–2167. Cited by: §1, §2.3.
 Crossstitch networks for multitask learning. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 3994–4003. Cited by: §1.
 Selfpaced multitask learning with shared knowledge. In Proceedings of the TwentySixth International Joint Conference on Artificial Intelligence, pp. 2522–2528. Cited by: §1, §2.4, §2.4, §4.1.2.
 Multitask feature selection. Technical report Department of Statistics, University of California, Berkeley. Cited by: §1.
 Curriculum learning of multiple tasks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 5492–5500. Cited by: §1, §2.4, §2.4.
 Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: §4.1.2.
 Adapting visual category models to new domains. In European conference on computer vision, pp. 213–226. Cited by: §4.1.1.
 Multitask learning as multiobjective optimization. In Advances in Neural Information Processing Systems 31, pp. 525–536. Cited by: §1, §2.5.
 Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §4.1.2.
 Deep hashing network for unsupervised domain adaptation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §4.1.1.
 Multitask learning for classification with Dirichlet process priors. Journal of Machine Learning Research 8, pp. 35–63. Cited by: §1.
 Deep multitask representation learning: A tensor factorisation approach. In Proceedings of the 6th International Conference on Learning Representations, Cited by: §1, §4.1.2.
 Trace norm regularised deep multitask learning. In Proceedings of the 6th International Conference on Learning Representations, Workshop Track, Cited by: §1, §4.1.2.
 A convex formulation for learning task relationships in multitask learning. In Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence, pp. 733–742. Cited by: §1, §4.1.1.
 A survey on multitask learning. CoRR abs/1707.08114. Cited by: §1.
 Probabilistic multitask feature selection. In Advances in Neural Information Processing Systems 23, pp. 2559–2567. Cited by: §1.
 Facial landmark detection by deep multitask learning. In Proceedings of the 13th European Conference on Computer Vision, pp. 94–108. Cited by: §4.1.2.
Appendix
Lemma 1 and Its Proof
Lemma 1
For , we have
Proof. Based on the Taylor expansion of the exponential function, we have
which implies
Based on the Taylor expansion of the exponential function again, can be written as
Define and . It is easy to show that and . For , we have
Since , each term in is no smaller than and we reach the conclusion.
Proof for Theorem 1
Proof. According to the requirement, is monotonically increasing on , implying that its secondorder derivative is positive, which is equivalent to the strong convexity of on . is already required to be monotonically increasing. As both and are required to be nonnegative, we only require that and due to their monotonically increasing property.
Proof for Theorem 2
Proof: According to the scalar composition rule in Eq. (3.10) of (Boyd and Vandenberghe, 2004), when is convex with respect to and is convex and monotonically increasing, is convex with respect to and so is , leading to the validity of the first part. If further is convex with respect to , both terms in the objective function of problem (8) is convex with respect to , making the whole problem convex.
Proof for Theorem 3
Proof. Since is a convex function, we can get
When each pair of the training data
changes, the random variable
can change by no more than due to the boundedness of the loss function . Then by McDiarmid’s inequality (McDiarmid, 1989), we can getwhere denotes the probability, and this inequality implies that with probability at least ,
If we have another training set with the same distribution as , then we can bound as
Multiplying the term by Rademacher variables , each of which is an uniform valued random variable, will not change the expectation since . Furthermore, negating a Rademacher variable does not change its distribution. So we have
Note that is Lipschitz at where . Due to the monotonic property of the loss function such as the crossentropy loss and the hinge loss with respect to the first input argument, is also a Lipschitz function. Then based on properties of the Rademacher compliexity, we can get
Then by combining the above inequalities, we can reach the conclusion.
Proof for Theorem 4
Generalization Bound for Linear Models
Based on Theorem 3, we can analyze specific learning models. Here we consider a linear model where is a matrix with columns each of which defines a learning function for a task as . Here is defined as where denotes the Frobenius norm. For problem (10) with such a linear model, we have the following result.
Theorem 5
When , with probability at least where , we have
Proof. According to Theorem 3, we only need to upperbound . By defining , we have
where the first inequality is due to the CauchySchwartz inequality, the second inequality holds because of the constraint on , the third inequality is due to the Jensen’s inequality based on the square root function, the fourth inequality holds since the norm of each data point is upperbounded by 1 and is the average of data points in the th task.
Comments
There are no comments yet.