May 23, 2026 • 14 min read • Agentic Harness Engineering Series

Multi-Objective Alignment: Beyond Scalar Rewards

Arithmetic preference control, Pareto-dominant multi-objective DPO, geometry-aware optimization, and the safety fine-tuning hazard — what happens when the dataset used for safety alignment is too similar to your fine-tuning data, and what it means for harness-driven self-improvement.

The Data Flywheel (G1) and RL Rollout (G2) patterns described in Post 10 close the loop from pipeline output back to model improvement. The standard implementation uses scalar reward signals — a single quality score per run drives the preference ranking. But scalar rewards are a simplification that hides important tradeoffs: a high-depth, low-helpfulness output is not the same as a low-depth, high-helpfulness one, even if both score the same composite value.

The alignment literature has developed methods that go beyond scalar rewards. This post surveys those methods and their implications for harness-driven self-improvement — with particular attention to the safety fine-tuning hazard that is directly relevant to anyone running RL post-training against harness-generated preference data.

The Scalar Reward Problem

Standard RLHF uses a single scalar reward to rank outputs as preferred or rejected. The scalar model is tractable and well-understood, but it collapses the multi-dimensional nature of quality into a single number. Consider two outputs for a research-synthesis task:

OutputHelpfulnessDepthSpecificitySafetyScalar composite
A965107.5
B79978.0

Scalar DPO would prefer B in all contexts. But for a deployment where safety is paramount (e.g., a medical information harness), A is the better choice. For an academic research harness, B is clearly better. The preference relation is context-dependent, and the scalar model cannot express that context.

Directional Preference Alignment (DPA)

DPA (arXiv:2402.18571v3) replaces scalar rewards with multi-objective reward modeling. Rather than learning a single reward function, DPA learns a reward vector, and user preferences are expressed as unit vectors in reward space — directions that specify the desired tradeoff between objectives.

The key capability this enables is arithmetic preference control: a user can directly specify the helpfulness-verbosity tradeoff at inference time without retraining. A preference vector of (0.8, 0.2) for (helpfulness, brevity) produces different outputs than (0.3, 0.7) — both from the same fine-tuned model.

Validated on Mistral-7B, DPA maintains competitive performance against standard DPO baselines while providing this steering capability. The tradeoff is complexity: multi-objective reward modeling requires richer training data (paired outputs labeled across multiple dimensions) and more careful reward engineering than scalar DPO.

For harness design, this matters because the harness already collects multi-dimensional quality signals — the five-dimension Wiggum rubric logs depth, specificity, completeness, relevance, and structure separately. The preference data in runs.jsonl is naturally multi-dimensional. DPA provides a training framework that can use those dimensions without collapsing them to a scalar.

Multi-Objective Online DPO (MO-ODPO)

MO-ODPO (arXiv:2503.00295v1) extends DPA to the online preference learning setting, where preference data is generated during training rather than collected in advance. The algorithm optimizes across multiple conflicting objectives simultaneously, finding solutions that Pareto-dominate existing baselines — meaning MO-ODPO achieves the best possible tradeoff across all objectives, not just a good point on one of them.

The practical advantage over offline methods is steerability: MO-ODPO produces models where inference-time objective weights actually change behavior. Many multi-objective fine-tuning approaches produce models that nominally accept preference weights but in practice collapse to a fixed behavior regardless of the specified tradeoff. MO-ODPO explicitly optimizes for this steerability as a training criterion.

The connection to the harness RL Rollout (G2) pattern: G2 as described in Post 10 generates preference pairs from runs.jsonl and passes them to a DPO trainer. Replacing the trainer with MO-ODPO — using the five Wiggum dimensions as the objective vector — would produce a model that can be steered at inference time toward depth (for research synthesis tasks) or helpfulness (for quick-answer tasks) by adjusting the preference vector injected into the synthesis prompt. This is the mechanism by which the harness could adapt its producer to task type without maintaining multiple fine-tuned checkpoints.

Geometry-Aware Multi-Objective Optimization: MGDA-Decoupled

Standard multi-objective alignment methods use fixed scalarization — a fixed weight vector over objectives — which introduces what the MGDA-Decoupled paper (arXiv:2604.20685v1) identifies as "procedural unfairness": objectives that are harder to optimize or that represent minority preferences are systematically under-weighted by fixed scalarization, regardless of their stated importance.

MGDA-Decoupled uses a gradient descent-based approach that finds a shared descent direction explicitly accounting for each objective's convergence dynamics. The algorithm identifies objectives that are lagging behind their target rate of improvement and upweights their gradients — producing more equitable optimization without requiring manual weight tuning or reinforcement learning.

Experiments on UltraFeedback demonstrate that MGDA-Decoupled achieves the highest win rates against golden responses both in overall performance and per-objective, validating that geometric fairness in optimization does not sacrifice global quality for local equity.

Alignment Method Landscape — Tradeoff vs Fairness vs Steerability

Four alignment methods positioned on two axes: whether they capture multi-objective tradeoffs (horizontal) and whether they provide inference-time steerability (vertical). Scalar DPO is the baseline; each method adds a specific capability.

Safety Fine-Tuning Hazard: The Dataset Similarity Problem

The most practically urgent finding in the alignment literature for harness engineers is the safety degradation mechanism discovered in arXiv:2506.05346v1:

When the downstream fine-tuning dataset is highly similar to the upstream safety-alignment dataset, safety guardrails degrade. Low similarity between alignment and fine-tuning datasets reduces the harmfulness score by up to 10.33%. This means that fine-tuning on topic-adjacent content — exactly what a harness data flywheel generates — can inadvertently erode the base model's safety properties.

The mechanism is representational: safety-alignment training creates specific activation patterns in middle model layers that respond to harmful-intent signals. When fine-tuning data creates similar activation patterns for non-harmful purposes (research, synthesis), the gradient updates during fine-tuning partially overwrite the safety-relevant representations — not because the fine-tuning data is unsafe, but because it occupies overlapping representation space.

The practical implication for harness-driven self-improvement:

Risk factorMitigation
Fine-tuning data in the same domain as safety-alignment trainingUse diverse task types in the preference dataset; avoid topic clusters
Continuous fine-tuning over many iterationsEvaluate safety benchmarks as part of the eval suite after each autoresearch iteration
Synthetic preference pairs generated from the harness topic distributionIntroduce out-of-distribution topics to diversify the representation space
Unknown base model alignment dataFor open-weight models, safety alignment data is often undocumented; treat all topic overlap as a risk

Alignment Recovery: From 33% to 2% Harmful Rate

When safety degradation does occur during fine-tuning, the recovery method from arXiv:2504.09757v1 provides a surgical fix. The approach:

  1. Identify the "harmful direction" in the fine-tuned model's weight space — the set of parameters that have drifted most from the original aligned model
  2. Restore a small subset of weight parameters from the original aligned checkpoint
  3. Apply a gradient descent-based rollback mechanism to prevent over-restoration that would degrade task performance

Evaluation on 125 fine-tuned LLMs: the method reduces the harmful response rate from 33.25% to 1.74% while maintaining downstream task performance. Competing methods either achieve limited harmful rate reduction or significantly degrade normal functionality — the rollback mechanism is critical for avoiding the tradeoff.

The recovery approach requires access to the original aligned model's weights, which is available for open-weight models but not for API-only models. For local harness deployments using Ollama with open-weight base models, this recovery approach is directly applicable.

Synthetic Preference Data: rDPO

Refined DPO (arXiv:2402.08005v1) eliminates human annotation from the preference data generation pipeline. The approach uses a teacher LLM for self-critique prompting — generating both a response and a critique of that response — and then distills the resulting preference pairs to a student model via DPO.

Key outcomes: improvements in safety (resistance to role-playing attacks), reduced sycophancy, and robustness against adversarial prompting — all without any human-annotated preference labels. The method is validated on diverse behavioral alignment tasks.

This is exactly the mechanism the harness Data Flywheel uses: every Wiggum revision round generates a (pre-revision output, post-revision output) pair that is a naturally occurring preference pair. The pre-revision output is "rejected" (failed evaluation), the post-revision output is "chosen" (passed). No human annotation is required. The rDPO framework validates this approach at scale and demonstrates that self-critique-derived preference pairs produce genuine behavioral alignment improvements.

Harness Self-Improvement Pipeline — Alignment Integration Points

Where the alignment methods described in this post integrate with the harness self-improvement pipeline. The Wiggum loop generates natural preference pairs; the Data Flywheel exports them; alignment training consumes them.

Follow-up Likelihood as Reward (FLR)

FLR (arXiv:2409.13948v3) proposes using the likelihood of follow-up utterances as an implicit reward signal. The intuition: in human conversations, positive follow-up reactions (engagement, elaboration, task continuation) indicate response quality without requiring explicit annotation. High follow-up likelihood = the user found the response useful enough to continue.

FLR matches the performance of reward models trained on large-scale human or GPT-4 annotated data across eight pairwise-preference and four rating-based benchmarks. The key advantage is scale: annotated preference data is expensive and slow to produce; follow-up likelihood can be computed from any existing conversation corpus.

For harness deployments, this creates an interesting signal opportunity. If the harness logs whether the user continued working after receiving an output (issued a follow-up task) vs abandoned the session, those continuation signals become implicit preference labels. The logs already capture this in many configurations — the question is whether the signal is extracted and used.

Alignment Fine-Tuning and Chain-of-Thought Quality

The Alignment Fine-Tuning (AFT) paper (arXiv:2309.02144v1) addresses a specific failure in CoT reasoning fine-tuning: assessment misalignment, where the fine-tuned model assigns higher scores to poor-quality reasoning paths, compounding over iterations into lower overall quality.

The fix is a constraint alignment loss that calibrates how the model scores its own reasoning steps — ensuring that the scoring function correctly distinguishes better from worse CoT paths before the DPO training signal is applied. The paper demonstrates this on four reasoning benchmarks and also identifies that the "constraint" aspect (ensuring negative scores are properly suppressed) is critical for other ranking-based methods like DPO, RRHF, and PRO.

For harness systems that use chain-of-thought prompting in the synthesis stage, this finding suggests a validation step: periodically audit whether the model's self-assessment of its reasoning steps is calibrated against actual output quality (as measured by the Wiggum evaluator). Miscalibration here produces a feedback loop where worse reasoning is rewarded, accelerating quality degradation.

The alignment methods surveyed here — DPA, MO-ODPO, MGDA-Decoupled, rDPO, FLR — are not in tension with the harness patterns described in earlier posts. They are the training-time complement to the inference-time patterns: the harness patterns produce high-quality outputs at inference; the alignment methods use those outputs to update the models so the next inference round requires less harness intervention. Together they form the complete self-improvement loop that Post 10 called the Data Flywheel.

← Previous 13 · Judge Reliability Next → 15 · SLM Efficiency