The Detour Advantage: Discovering Better Objectives Through Indirect Transfer

Andre Ye*, Ryan Bahlous-Boldi*
MIT CSAIL
*Equal contribution
Intuitively, a model performs best on a function f(x) when optimized for f(x). But what if optimizing first for a different but related function g(x), before adapting to f(x), leads to better performance on f(x)?

Consider a model learning to map pixels to color values [0, 1]2 ↦ [0, 1]3 to reconstruct a face X. Training temporarily on a different face Y results in a better reconstruction of X at the end: in Figure 1, observe that with a detour, models learn more discernible face structure, facial features like eyes and noses, clothing, and accessories like glasses and hats.

Round 1, Snapshot 1/100
Figure 1. 5 rounds of training 12 models to reconstruct 12 faces by predicting the color value of each pixel in the image. Top row: target image. Middle row: "detour" training (the target image the model is currently trained on is displayed in the bottom right corner). Bottom row: regular training on the target image.

We call this "The Detour Advantage": under specific conditions, models trained on one objective and then transferred to another outperform models trained directly on the target. Our hypothesis is that in non-convex optimization landscapes, training temporarily on a related objective can guide the model past local minima that would trap direct gradient descent. To test this, we propose a greedy algorithm that explores objective sequences and demonstrate the phenomenon across function fitting, image reconstruction, sequence prediction, and attribute classification.

Contributions: (1) We show that indirect training paths can outperform direct optimization across multiple domains; (2) We characterize when detours help—underparametrized models, non-convex landscapes, structurally similar tasks—and provide preliminary analysis of why, including weight-space visualizations showing cross-training navigating around local minima.

2. Detour Discovery Algorithm

To determine when a detour helps, we formalize a method that explores combinations of objective sequences and greedily selects the best performing candidates after each round of training. We call this algorithm Cross-Training with Model Selection. We also define a baseline that trains a single model on each objective independently.

Each objective fi has an associated training set Ditrain. For cross-training, we additionally use a validation set Dival for model selection.

2.1. Cross-Training with Model Selection

In the cross-training approach, each objective can adopt models trained for other objectives at each round. This is a greedy search algorithm: at each round, each objective selects whichever model currently performs best on its validation set. A model originally trained on objective fj might perform better on objective fi's validation set (after being trained on fj) than the model trained directly on fi from the beginning. This creates a dynamic transfer learning environment where models traverse through different objectives over time.

Algorithm 1: Cross-Training with Model Selection
Input: Objectives {f1, ..., fn}, training sets {D1train, ..., Dntrain}, validation sets {D1val, ..., Dnval}, number of rounds R
Output: Best model for each objective
  1. For each objective i, initialize random model θi(0).
  2. For rounds r = 1 to R do:
    1. For each model θi(r-1) and each objective j:
      1. Copy θi(r-1) to get a new model θij(r).
      2. Train θij(r) on training set Djtrain.
    2. For each objective j:
      1. Evaluate all models θij(r) (for all i, including where i=j) on the validation set Djval.
      2. Select the model with lowest validation loss as θj(r), the new "best" model for objective j for round r.
  3. Return the best model θj(R) for each objective after R rounds.

Note: This greedy algorithm is not guaranteed to outperform direct training—a model selected for high early-stage validation performance may lead to poor results after further training. This $O(n^2)$ approach approximates the exhaustive $O(2^{nR})$ search over all possible training paths, which would guarantee finding the optimal trajectory.

2.2. Baseline: Independent Training

The baseline trains each model independently on its own objective—no cross-objective transfer.

Algorithm 2: Baseline (No Transfer)
Input: Objectives {f1, ..., fn} with training sets {D1train, ..., Dntrain}, rounds R
Output: Final model for each objective
  1. Initialize random model θi(0) for each objective fi
  2. for round r = 1 to R do
  3. for each objective fi do
  4. Train on own objective: θi(r) ← Train(θi(r-1), Ditrain)
  5. end for
  6. end for
  7. return {θ1(R), ..., θn(R)}

2.3. Single-Objective Transfer

To test whether benefits come from evaluating multiple initializations rather than cross-objective transfer, we run Algorithm 1 with all objectives set to the target (fi = fj for all i, j). Comparing these three approaches isolates when indirect paths outperform direct training.

3. Synthetic Experiments

To build intuition, we start with three synthetic experiments: function learning, pixel-wise face reconstruction, and sequence transform tasks. These simplified experiments clearly demonstrate the detour advantage.

3.1. Function Learning

We train neural networks to approximate different mathematical functions. Each network learns to predict one of six related functions: f(x), f'(x), f''(x), f'''(x), sign(f''(x)) (indicating pointwise convexity), and f(f(x)) (composition).

Given a fixed base function f(x), these six objectives share underlying structure. Figure 2 shows results across several choices of f(x), each with different complexity characteristics. For each objective, we plot: ground truth (gray), the trajectory that ultimately produces the best model (green, "Final Path"), the current round's best model (blue, "Round Choice"), and baseline direct training (red dashed).

Sinusoidal-Exponential Modulation
1 / 7
Ground Truth
Final Path
Round Choice
Baseline (No Transfer)
Round 1, Step 1
Method $f(x)$ $f'(x)$ $f''(x)$ $f'''(x)$ convexity $f(f(x))$
Figure 2. Cross-training advantage on complex functions. Neural networks learn to approximate mathematical functions and their derivatives. Green lines (Final Path) show the actual training trajectory of the final selected model, tracking which objective's data was used at each round. Blue lines (Round Choice) show the best model selected at each round. Red dashed lines (Baseline) show direct training without cross-training. The table shows final test losses (MSE), with green highlighting the better method for each objective.

Figure 2 shows that cross-training achieves lower final loss than baseline on nearly all tested functions. The trajectories reveal qualitative differences: baseline training smooths over sharp transitions and fine details, while cross-training preserves the precise shape of the target function. Training temporarily on a related objective allows the model to "snap" to the ground truth structure more quickly—the network adapts faster and captures intricate features that direct training struggles to learn. Crucially, cross-training also outperforms single-objective transfer in every case—the gain is not merely from selecting the best of multiple runs, but from the diversity of objectives themselves.

Notably, derivative objectives (f''(x), f'''(x)) show the largest gains. Derivatives amplify high-frequency components of the base function, making them harder to learn directly. A model trained on f(x) already encodes the function's structure; fine-tuning for derivatives can then leverage this representation rather than learning the amplified signal from scratch.

This advantage is most pronounced on complex functions with rich structure. For simpler functions with smoother, more regular structure, baseline models can learn them directly without requiring cross-training's inductive biases—see Appendix A for these results.

3.2. Pixel-wise Face Reconstruction

To test a higher-complexity task, we train models to reconstruct faces pixel by pixel—a regression task mapping (x, y) coordinates to RGB color values. Using 20 different celebrity faces from CelebA as objectives gives us 20 related but distinct tasks, making this domain ideal for testing cross-training.

Figure 3 shows the complete training progression across 10 rounds with all 20 faces (Figure 1 shows 5 rounds with a subset of faces). At each round, each objective can either continue training its current model or adopt another objective's model that performs better on its validation set.

Training temporarily on another face allows the model to learn intricate facial features more precisely and converge faster. Compare the Detour row to the Baseline row—cross-training captures fine details like texture and edges that baseline training blurs or misses entirely.

Why does training on face Y help reconstruct face X? All faces share common structure: eyes, noses, mouths appear in predictable locations with similar local geometry. Training on any face teaches the network useful primitives—edge detectors, color gradients, symmetry patterns—that transfer across individuals. The detour through face Y provides a curriculum of general facial features before specializing to face X.

Round 1 - Snapshot 0/99
Figure 3. Full training progression for all 20 face reconstruction objectives across 1000 timesteps (10 rounds × 100 snapshots). Each column represents one face (S0-S19), showing target image, final trajectory path, current round training, and baseline trajectory. Use the slider to explore how each model evolves.

3.3. Sequence Transform Tasks

We train small transformer models on 20 different sequence transformation tasks. Each task transforms the same input sequence (16 tokens from a vocabulary of 8) into a different output pattern. For example, one task might reverse the sequence, another might sort it, and another might compute cumulative sums.

The model architecture is intentionally constrained (a 1-layer, 1-head transformer with only 2,300 parameters). This capacity bottleneck makes the experiment particularly revealing: when models lack the capacity to independently learn 20 different tasks, does cross-training help by sharing learned representations?

The 20 tasks span three categories: position-based tasks (Copy, Reverse, various shifts and swaps), value-based tasks (Sort, cumulative operations, differences), and combination tasks that merge multiple operations. We compare baseline training (20 independent specialists) against cross-training where tasks can adopt each other's weights based on validation performance.

Figure 4. Tasks sorted by improvement. Green bars indicate positive transfer, red indicate negative. Loading...

Figure 4 reveals that cross-training helps on the majority of tasks: 16 out of 20 tasks show positive transfer, with only 4 experiencing negative transfer. The improvements are not uniformly distributed; position-based tasks like ShiftR1 and Swap benefit dramatically, while tasks with unique computational structure (Dedupe, Diff) show modest degradation. This asymmetry suggests that the detour advantage depends on structural compatibility between source and target tasks.

Copy
Output equals input
1 / 20
Task Training Curve
Figure 5. Training curves for individual sequence transformation tasks. Navigate through all 20 tasks to see how cross-training (blue) compares to baseline (red dashed) across different task types. Shaded regions indicate ±1 standard deviation across 5 random seeds.

To understand where beneficial transfers originate, we examine the transfer lineage matrix below (Figure 6). This visualization tracks which source tasks were selected for which target tasks throughout training.

Transfer Lineage Matrix
Figure 6. Transfer lineage matrix showing which source tasks (rows) were selected for which target tasks (columns). Darker blue = more frequent selection. Notice the sorting cluster (Sort, SortDesc, SortRev, RevSort, ShiftSort) that constantly exchange weights, and the strong diagonal indicating specialists often remain best for their own task.

Cross-training improves average accuracy from 49.0% to 51.3% (+4.6%). The transfer patterns reveal natural task families: sorting tasks (Sort, SortDesc, SortRev) constantly exchange weights; position tasks (Reverse, ShiftR1, Swap) share positional indexing; inverse operations (CumSum/Diff) transfer bidirectionally. Position-based tasks show dramatic gains (ShiftR1: +36%, Swap: +28%) by discovering shared reasoning circuits.

Critically, these relationships emerge automatically through validation-based selection—we don't manually specify which tasks should help each other. Not all transfer is positive: tasks with unique structure (Diff, Dedupe: -6.8%) perform worse, confirming that the detour advantage requires structural similarity.

4. High-Dimensional Experiment

To test whether the detour advantage scales to higher-dimensional settings, we apply cross-training to multi-attribute classification on the CelebA dataset.

CelebA Multi-Attribute Classification

The CelebA dataset contains over 200,000 celebrity face images, each annotated with 40 binary attributes (e.g., Smiling, Male, Eyeglasses, Blond_Hair). We use 30 of these attributes as training objectives for cross-training, holding out 10 attributes to test generalization to entirely new tasks. This generalization task is meant to measure how good the cross-trained models are at transfering to a new task from the same task distribution, but never seen during training. Good performance on generalization suggests that the cross-trained models have learn task-agnostic features that are useful for all tasks in the same distribution.

We compare three approaches: (1) Specialists, one model trained independently per attribute; (2) Generalist, a single multi-task model predicting all attributes simultaneously; and (3) Cross-Training, our approach that maintains a pool of models and selects the best performer for each attribute based on validation accuracy.

Figure 7. CelebA attribute classification accuracy. Cross-training matches or exceeds specialist performance on training attributes while showing stronger generalization to held-out attributes (+1.8% over specialists).

The results reveal a compelling pattern: while cross-training performs comparably to specialists on the 30 training attributes (87.4% vs 87.2%), it significantly outperforms on the 10 held-out attributes (86.1% vs 84.4%). This +1.8% improvement on unseen tasks suggests that cross-training builds more generalizable representations. The diverse training signal from multiple attributes creates features that transfer better to new tasks.

This generalization result is particularly significant. The held-out attributes were never used during training, so cross-training had no opportunity to optimize for them directly or indirectly. Yet cross-trained models consistently outperform both specialists and the generalist on these novel tasks.

We hypothesize this occurs because cross-training acts as implicit representation learning: maintaining a diverse pool of models trained on different objectives naturally preserves features that capture broadly useful patterns. Specialists, by contrast, may overfit to task-specific patterns that don't transfer. The selection pressure (keeping models that perform well across validation sets) implicitly favors representations with good inductive biases for the domain. In practical terms, a cross-trained model pool may serve as a better starting point for any related task, even ones not anticipated during training.

Importantly, the cross-training approach also outperforms the generalist model on the generalization task, which implies that SGD on its own is not able to sufficiently regularize the models to capture cross-task general features.

Figure 8. Per-attribute improvement of cross-training over specialists. Green bars indicate attributes where cross-training helps; red bars where specialists win. Held-out attributes (right, darker background) show consistently positive transfer.

The per-attribute breakdown reveals that cross-training particularly excels on the held-out attributes (highlighted in green background). Notably, Smiling shows a +5.8% improvement—the largest gain among held-out attributes. This likely reflects that smiling correlates with other trained attributes like High_Cheekbones and Mouth_Slightly_Open, so cross-training builds representations that capture expression-related features even without direct supervision on Smiling.

The generalist model, despite seeing all training attributes simultaneously, fails to match cross-training's performance. This suggests that explicit multi-task learning can actually harm transfer compared to selective weight sharing—perhaps because gradient interference between competing objectives prevents the network from fully exploiting shared structure.

5. Analysis and Discussion

Why do detours help? We analyze the learning dynamics through weight space visualizations and trajectory statistics.

5.1. Weight Space Visualization

To visualize what happens during training, we project the high-dimensional weight space into 2D. We need two weight vectors to define a plane—interpolating between them produces a loss landscape we can visualize. Using the two training trajectories (cross-training and direct training) as our coordinate axes provides an intuitive choice: it aligns the visualization with the actual optimization paths taken by both methods.

The visualizations below reveal qualitative differences in how the two approaches navigate the loss landscape. Direct training (red dashed lines) often follows a convex-like path toward nearby local minima. Cross-training (green solid lines) may temporarily increase error as it trains on a different objective, but eventually navigates toward better solutions. Because these are 2D projections of a high-dimensional space, the detour paths don't always perfectly reach the global minimum shown in the visualization—the true optimum may lie outside our projected plane. See Appendix D for full interactive exploration across all functions and objectives.

Cross-training trajectories curve around regions where direct training gets stuck. The intermediate objective effectively "pushes" the model past local minima that would trap gradient descent. This is especially visible for f'''(x), where direct training converges to a suboptimal plateau while cross-training escapes to a lower-loss region.

f'(x)

$f'(x)$

f'''(x)

$f'''(x)$

convexity

convexity

Figure 9. Weight space landscape visualizations for Sinusoidal-Exponential Modulation showing optimization trajectories for three representative objectives. Contours represent the training loss landscape (darker = lower loss). Green solid lines show cross-training paths, red dashed lines show direct training paths. Circles mark starting points, stars mark final positions. Cross-training paths navigate around local minima, while direct training paths can get trapped. All views use trajectory-aligned projection (PCA).

5.2. Trajectory Statistics

We analyze optimization trajectories by computing metrics from weight snapshots during training. The table below (Figure 10) shows quantitative comparisons across representative functions and objectives, while the detailed metrics reveal fundamental differences in how cross-training and direct training navigate the loss landscape.

Cross-training exhibits fundamentally different optimization dynamics than direct training. While achieving similar efficiency (loss improvement per unit distance), cross-training takes much longer, more tortuous paths (2.3× length, 3.3× tortuosity). The high exploration metrics—frequent backtracking (25× more) and momentum misalignment (85× more)—show that cross-training actively explores the loss landscape rather than descending directly. The dramatic increases in acceleration (9.5×) and jerk (11.2×) indicate that cross-training frequently changes direction during optimization. These metrics quantify the intuition from Figure 9: cross-training "wanders" through weight space, temporarily moving away from the goal to discover better paths, while direct training follows the most direct route to nearby minima. This is expected—optimizing toward different (even if related) objectives naturally produces more varied trajectories through weight space.

Metric Cross-Training / Direct Ratio
Efficiency (loss improvement per unit distance traveled) 1.01
Path Length (total L2 distance: Σ ||wt+1 - wt||) 2.30
Tortuosity (path length / straight-line distance to final position) 3.26
Backtracking to Start (% steps moving closer to initial weights) 25.00
Momentum Misalignment (% steps with negative dot product to EMA direction, α=0.9) 85.00
Mean Step Size (average L2 norm of weight updates per step) 1.01
Acceleration (mean rate of change in step direction) 9.45
Jerk (rate of change of acceleration; measures optimization smoothness) 11.22
Figure 10. Trajectory statistics comparing cross-training vs direct training. Metrics computed from weight snapshots (25 per round × 10 rounds) across the functions in Figure 9. Ratios are computed as average of per-trajectory ratios (not ratio of averages) to give equal weight to each function-objective pair regardless of scale.

5.3. What Didn't Work?

Not all detours help. We identified three failure modes: (1) Random transfers rarely help—without considering task similarity, the advantage disappears; (2) Too many transfers hurt—longer chains (A→B→C→D→target) show diminishing returns from catastrophic forgetting; (3) Dissimilar tasks don't transfer—preliminary CIFAR-100 experiments between different superclasses showed minimal gains. The detour advantage requires structural similarity between objectives.

Limitations. Our search algorithm is $O(n^2)$ per round, making it impractical for large-scale applications. The phenomenon is demonstrated primarily in underparametrized settings; whether it persists in heavily overparametrized modern architectures remains open. Finally, while we characterize when detours help, we do not yet have a reliable method to predict beneficial detours a priori.

5.4. When Should Detours Help?

Our results raise a fundamental question: when should we expect detours to help? Theory suggests an answer. In the overparametrized regime, loss landscapes become increasingly convex [23], [25]—and in convex optimization, direct gradient descent is provably optimal [27]. No detour can improve upon the straight path. This predicts that as models become more overparametrized, detour benefits should diminish.

By contrast, our experiments use small networks (3-layer MLPs with 64-128 hidden units) in the underparametrized regime, where loss landscapes exhibit sharp non-convexities and numerous local minima [11]. Here, training on a related objective can guide optimization around obstacles that trap direct descent. This aligns with our findings: detours help most on complex objectives where direct training struggles, and may explain why large overparametrized models succeed with direct supervised learning—their scale smooths the terrain where detours provide advantage.

These observations suggest testable predictions: detours should help most when (1) models are underparametrized, (2) landscapes are non-convex, and (3) tasks share structure. Future work should systematically vary these factors to delineate the boundaries of the detour advantage.

7. Conclusion

The Detour Advantage reveals a counterintuitive principle: the shortest path to an objective is not always through the objective itself. By taking strategic detours through related objectives, we achieve better final performance than direct optimization.

This finding has implications for understanding optimization and transfer learning. Just as GPS sometimes recommends indirect routes that arrive faster by avoiding traffic, neural network training may benefit from strategic detours through objective space that avoid poor local minima.

More broadly, our results suggest that the standard paradigm of "train on what you want to optimize" may be leaving performance on the table. In multi-objective settings, the choice of training sequence matters—and the optimal sequence is not always the obvious one. Future work should explore practical algorithms for efficiently discovering beneficial detours without exhaustive search, potentially using task similarity metrics or meta-learning to predict which detours will help.

Acknowledgments. We thank Jyo Pari, Akarsh Kumar, Pulkit Agrawal, Idan Shenfeld, Oliver Sieberling, Zachary Wojtowicz, and Nitish Dashora for discussions that helped shape this work.

References

[1] Fifty, C., Amid, E., Zhao, Z., Yu, T., Anil, R., & Finn, C. (2021). Efficiently identifying task groupings for multi-task learning. NeurIPS. https://arxiv.org/pdf/2101.10382

[2] Pan, S. J., & Yang, Q. (2010). A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10), 1345-1359. https://ieeexplore.ieee.org/document/5288526

[3] Zhuang, F., et al. (2021). A comprehensive survey on transfer learning. Proceedings of the IEEE, 109(1), 43-76. https://arxiv.org/abs/1911.02685

[4] Wang, X., et al. (2019). Characterizing and avoiding negative transfer. CVPR. https://arxiv.org/abs/1811.09751

[5] Ruder, S. (2017). An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098. https://arxiv.org/abs/1706.05098

[6] Bengio, Y., Louradour, J., Collobert, R., & Weston, J. (2009). Curriculum learning. ICML. https://ronan.collobert.com/pub/matos/2009_curriculum_icml.pdf

[7] Deleu, T., et al. (2025). Optimal task order for continual learning of multiple tasks. arXiv preprint arXiv:2502.03350. https://arxiv.org/abs/2502.03350

[8] Zamir, A. R., et al. (2018). Taskonomy: Disentangling task transfer learning. CVPR. https://arxiv.org/abs/1804.08328

[9] Garipov, T., et al. (2018). Loss surfaces, mode connectivity, and fast ensembling of DNNs. NeurIPS. https://arxiv.org/abs/1802.10026

[10] Li, H., et al. (2018). Visualizing the loss landscape of neural nets. NeurIPS. https://arxiv.org/abs/1712.09913

[11] Fort, S., & Scherlis, A. (2019). The goldilocks zone: Towards better understanding of neural network loss landscapes. AAAI. https://arxiv.org/abs/1807.02581

[12] Malladi, S., et al. (2023). A kernel-based view of language model fine-tuning. ICML. https://arxiv.org/abs/2210.05643

[13] Kirkpatrick, J., et al. (2017). Overcoming catastrophic forgetting in neural networks. PNAS, 114(13), 3521-3526. https://www.pnas.org/doi/10.1073/pnas.1611835114

[14] Flesch, T., et al. (2025). Humans and neural networks show similar patterns of transfer and interference during continual learning. Nature Human Behaviour. https://www.nature.com/articles/s41562-025-02318-y

[15] Finn, C., Abbeel, P., & Levine, S. (2017). Model-agnostic meta-learning for fast adaptation of deep networks. ICML. https://arxiv.org/abs/1703.03400

[16] Jamal, M. A., & Qi, G.-J. (2019). Task agnostic meta-learning for few-shot learning. CVPR. https://openaccess.thecvf.com/content_CVPR_2019/papers/Jamal_Task_Agnostic_Meta-Learning_for_Few-Shot_Learning_CVPR_2019_paper.pdf

[17] Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2007). Greedy layer-wise training of deep networks. NeurIPS. https://papers.nips.cc/paper/3048-greedy-layer-wise-training-of-deep-networks

[18] Mildenhall, B., et al. (2020). NeRF: Representing scenes as neural radiance fields for view synthesis. ECCV. https://arxiv.org/abs/2003.08934

[19] Mouret, J.-B., & Clune, J. (2015). Illuminating search spaces by mapping elites. arXiv preprint arXiv:1504.04909. https://arxiv.org/abs/1504.04909

[20] Lehman, J., & Stanley, K. O. (2011). Abandoning objectives: Evolution through the search for novelty alone. Evolutionary Computation, 19(2), 189-223. https://direct.mit.edu/evco/article/19/2/189/1237/

[21] Spector, L. (2012). Assessment of problem modality by differential performance of lexicase selection in genetic programming: A preliminary report. In Proceedings of the 14th Annual Conference Companion on Genetic and Evolutionary Computation (GECCO '12), 401-408. ACM. https://dl.acm.org/doi/10.1145/2330784.2330846

[22] Boldi, R., Ding, L., & Spector, L. (2024). Objectives are all you need: Solving deceptive problems without explicit diversity maintenance. In Proceedings of the Genetic and Evolutionary Computation Conference Companion (GECCO '24). https://arxiv.org/abs/2311.02283

[23] Jacot, A., Gabriel, F., & Hongler, C. (2018). Neural tangent kernel: Convergence and generalization in neural networks. NeurIPS. https://arxiv.org/abs/1806.07572

[24] Chizat, L., Oyallon, E., & Bach, F. (2019). On lazy training in differentiable programming. NeurIPS. https://arxiv.org/abs/1812.07956

[25] Lee, J., et al. (2019). Wide neural networks of any depth evolve as linear models under gradient descent. NeurIPS. https://arxiv.org/abs/1902.06720

[26] Cooper, Y. (2021). The loss landscape of overparameterized neural networks. arXiv preprint arXiv:1804.10200. https://arxiv.org/abs/1804.10200

[27] Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press. https://web.stanford.edu/~boyd/cvxbook/

[28] Pope, P. E., et al. (2021). The intrinsic dimension of images and its impact on learning. ICLR. https://arxiv.org/abs/2104.08894

Appendix

A. Cross-Training on Simpler Functions

While Figure 2 in the main text shows cross-training advantages on complex functions, we also tested simpler functions with more regular structure. These functions show smaller performance gains, as their smoother, more regular structure allows baseline models to learn them directly without requiring the inductive biases provided by cross-training.

Multiple Frequency Components
1 / 4
Ground Truth
Final Path
Round Choice
Baseline (No Transfer)
Round 1, Step 1
Method $f(x)$ $f'(x)$ $f''(x)$ $f'''(x)$ convexity $f(f(x))$
Figure A1. Cross-training on simpler functions. Green lines (Final Path) show the actual training trajectory of the final selected model. Blue lines (Round Choice) show the best model selected at each round. Red dashed lines (Baseline) show direct training. The advantage is less extreme because these functions have regular, smooth structure that can be more easily learned by baseline models without requiring transfer from related objectives.

B. Top Cross-Task Transfers

Top Cross-Task Transfers
Figure B1. Most frequent cross-task transfers (excluding self-selection). SortRev ↔ SortDesc dominate (they're essentially the same operation), followed by inverse operation pairs like Diff ↔ CumSum. These patterns reveal natural task families discovered automatically by validation-based selection.

C. Per-Task Accuracy Comparison

Accuracy Comparison
Figure C1. Per-task accuracy comparison. Gray bars show baseline performance, blue bars show cross-training. Tasks like Copy, CumMax, and CumMin achieve near-perfect accuracy with both methods. Position manipulation tasks (ShiftR1, Swap) show the largest visible improvements.

D. Interactive Weight Space Exploration

This interactive visualization allows you to explore weight space trajectories across all tested functions and objectives. Use the controls to select different functions, objectives, and projection methods (trajectory-aligned only).

Weight Space Visualization
Figure D1. Interactive weight space landscape visualizations. Contours represent the training loss landscape (darker = lower loss). Green solid lines show cross-training paths, red dashed lines show direct training paths. Circles mark starting points, stars mark final positions. All visualizations use trajectory-aligned projection (PCA).