We call this "The Detour Advantage": under specific conditions, models trained on one objective and then transferred to another outperform models trained directly on the target. Our hypothesis is that in non-convex optimization landscapes, training temporarily on a related objective can guide the model past local minima that would trap direct gradient descent. To test this, we propose a greedy algorithm that explores objective sequences and demonstrate the phenomenon across function fitting, image reconstruction, sequence prediction, and attribute classification.
Contributions: (1) We show that indirect training paths can outperform direct optimization across multiple domains; (2) We characterize when detours help—underparametrized models, non-convex landscapes, structurally similar tasks—and provide preliminary analysis of why, including weight-space visualizations showing cross-training navigating around local minima.
2. Detour Discovery Algorithm
To determine when a detour helps, we formalize a method that explores combinations of objective sequences and greedily selects the best performing candidates after each round of training. We call this algorithm Cross-Training with Model Selection. We also define a baseline that trains a single model on each objective independently.
Each objective fi has an associated training set Ditrain. For cross-training, we additionally use a validation set Dival for model selection.
2.1. Cross-Training with Model Selection
In the cross-training approach, each objective can adopt models trained for other objectives at each round. This is a greedy search algorithm: at each round, each objective selects whichever model currently performs best on its validation set. A model originally trained on objective fj might perform better on objective fi's validation set (after being trained on fj) than the model trained directly on fi from the beginning. This creates a dynamic transfer learning environment where models traverse through different objectives over time.
- For each objective i, initialize random model θi(0).
-
For rounds r = 1 to R do:
-
For each model θi(r-1) and each objective j:
- Copy θi(r-1) to get a new model θij(r).
- Train θij(r) on training set Djtrain.
-
For each objective j:
- Evaluate all models θij(r) (for all i, including where i=j) on the validation set Djval.
- Select the model with lowest validation loss as θj(r), the new "best" model for objective j for round r.
-
For each model θi(r-1) and each objective j:
- Return the best model θj(R) for each objective after R rounds.
Note: This greedy algorithm is not guaranteed to outperform direct training—a model selected for high early-stage validation performance may lead to poor results after further training. This $O(n^2)$ approach approximates the exhaustive $O(2^{nR})$ search over all possible training paths, which would guarantee finding the optimal trajectory.
2.2. Baseline: Independent Training
The baseline trains each model independently on its own objective—no cross-objective transfer.
- Initialize random model θi(0) for each objective fi
- for round r = 1 to R do
- for each objective fi do
- Train on own objective: θi(r) ← Train(θi(r-1), Ditrain)
- end for
- end for
- return {θ1(R), ..., θn(R)}
2.3. Single-Objective Transfer
To test whether benefits come from evaluating multiple initializations rather than cross-objective transfer, we run Algorithm 1 with all objectives set to the target (fi = fj for all i, j). Comparing these three approaches isolates when indirect paths outperform direct training.
3. Synthetic Experiments
To build intuition, we start with three synthetic experiments: function learning, pixel-wise face reconstruction, and sequence transform tasks. These simplified experiments clearly demonstrate the detour advantage.
3.1. Function Learning
We train neural networks to approximate different mathematical functions. Each network learns to predict one of six related functions: f(x), f'(x), f''(x), f'''(x), sign(f''(x)) (indicating pointwise convexity), and f(f(x)) (composition).
Given a fixed base function f(x), these six objectives share underlying structure. Figure 2 shows results across several choices of f(x), each with different complexity characteristics. For each objective, we plot: ground truth (gray), the trajectory that ultimately produces the best model (green, "Final Path"), the current round's best model (blue, "Round Choice"), and baseline direct training (red dashed).