Fine-Tuning and Alignment Deep Cuts: Synthetic Data, Poisoning, and Safety Recovery

May 1, 2026 • 16 min read

Post 14 covered the high-level alignment methods: DPA, MO-ODPO, MGDA-Decoupled, and why high dataset similarity degrades safety guardrails. This post goes deeper: how to generate preference data without human annotators, how adversaries can poison the alignment phase with minimal label flips, how to recover safety alignment after fine-tuning has degraded it, and what actually predicts whether a supervised fine-tuning run will succeed.

Annotation-Free Reward Signals

The standard RLHF/DPO pipeline requires human-annotated preference data: pairs of responses labeled "preferred" and "rejected." This creates a cost and bottleneck, particularly for domain-specific deployments where annotation requires subject matter expertise. Two papers address this differently.

Follow-up Likelihood as Reward (FLR)

FLR (arXiv:2409.13948v3) observes that in human conversations, follow-up reactions are natural quality signals: a good response elicits engaged follow-up; a poor response elicits correction, dismissal, or silence. The paper operationalizes this as a reward: given two candidate responses, the preferred response is the one for which a model assigns higher likelihood to the follow-up utterance in the conversation.

FLR matches GPT-4-annotated reward models without any human annotation. Evaluations across eight pairwise-preference and four rating-based benchmarks show that FLR is competitive with strong reward models trained on large-scale human or GPT-4-annotated preference data. The reward signal is mined automatically from base model generations against existing conversational data.

For the harness, this is a direct enabler of the Data Flywheel described in Post 10. The flywheel currently accumulates JSONL run traces. Each run trace includes the producer's output and the evaluator's response to that output. The evaluator's response is precisely the follow-up to the producer—and if the evaluator gives a high score with substantive follow-up detail, that is a positive preference signal for the producer's response. FLR makes it possible to extract preference data from the existing run log without any additional annotation step.

Refined DPO with Synthetic Preference Pairs (rDPO)

rDPO (arXiv:2402.08005v1) eliminates human annotation through a different mechanism: a teacher LLM generates synthetic preference pairs via self-critique prompting, and a student LLM is trained on those pairs via DPO. The teacher generates both a "preferred" version (with improvements applied) and a "rejected" version (the original) for each training example.

The method improves safety and robustness against role-playing attacks, and reduces sycophancy in the student model. The released code makes this directly replicable. For the harness, rDPO represents a path to fine-tuning the producer model on domain-specific research tasks without requiring labeled preference data: the planner or evaluator model can act as the teacher, generating synthetic preferred/rejected pairs for producer outputs that already have dimensional rubric scores.

Poisoning the Alignment Phase

If the harness is using the Data Flywheel to generate training data for downstream fine-tuning, the alignment phase itself becomes an attack surface. The label-flipping poisoning paper (arXiv:2511.09105v1) formalizes the theoretical foundations of this attack.

The attack targets RLHF/DPO by flipping preference labels—swapping preferred and rejected—for a subset of training examples. The minimum cost required to achieve a given behavioral change is formulated as a convex optimization problem. The paper derives lower and upper bounds on the attack cost and proposes a post-processing method that reduces the number of label flips needed compared to existing attacks while preserving the poisoning effect.

Minimum-cost poisoning is a tractable optimization. The attack cost is a convex function of the dataset characteristics. In the regime where the reward model's feature dimension is small relative to dataset size (which is the common case for small fine-tuned models), the post-processing significantly reduces the required number of flips. An adversary with access to a fraction of the training pipeline can efficiently steer the resulting model's behavior.

For the harness's RL Rollout component (Post 10), this is a concrete threat: if the JSONL audit log that feeds into training data generation is accessible to an adversary, a small number of modified entries can systematically shift the producer's behavior in a targeted direction. The defense is data provenance: each training example in the flywheel should carry a cryptographic hash of the original run, and any modification invalidates the hash. This is the same provenance verification recommended for inter-agent messages in Post 20.

Safety Recovery After Fine-Tuning

Post 14 documented that high similarity between safety alignment datasets and downstream fine-tuning datasets degrades safety guardrails by up to 10.33%. The "Alleviating Fear" paper (arXiv:2504.09757v1) addresses the complementary problem: once a model has been fine-tuned and its safety alignment has been compromised, how do you recover it without retraining from scratch?

Results on 125 fine-tuned LLMs: A gradient descent-based process identifies the "harmful direction" in fine-tuned model weights—the weight subspace that corresponds to safety-relevant behavior. A rollback mechanism prevents aggressive restoration that would impair downstream task performance. Results: harmful response rate drops from 33.25% to 1.74% while maintaining task performance. This outperforms existing safety restoration methods that either achieve limited harmful rate reduction or significantly damage functional capabilities.

The method requires access to the original aligned model's weights (to compute the harmful direction by comparison). For the harness, this means the aligned base model checkpoint should be preserved and not overwritten when fine-tuning on domain-specific data. If safety degradation is detected (via the harness's own safety evaluation patterns, Post 9), the restoration procedure can be applied without full retraining.

Safety alignment dynamics: dataset similarity impact on guardrail degradation (arXiv:2506.05346v1, covered in Post 14) and post-fine-tuning harmful rate recovery via gradient-based weight restoration (arXiv:2504.09757v1).

Alignment Fine-Tuning for CoT Assessment

AFT (arXiv:2309.02144v1) addresses a specific failure mode in reasoning fine-tuning: after fine-tuning on chain-of-thought (CoT) data, LLMs often exhibit assessment misalignment—they assign higher scores to poor-quality reasoning paths than to correct ones, despite the explicit training signal. This is a calibration failure, not a capability failure.

Alignment Fine-Tuning introduces a constraint alignment loss that calibrates how the model scores its own reasoning steps. The constraint penalizes cases where a negative (incorrect) reasoning path receives a higher score than a positive (correct) one, with the constraint explicitly designed to maintain score discrimination while preserving stability. Experiments on four reasoning benchmarks demonstrate AFT's effectiveness, and the analysis reveals that the overlooked "constraint" aspect is crucial for the performance of other ranking-based methods like DPO, RRHF, and PRO.

For the harness, the Wiggum evaluator is tasked with scoring producer output across six dimensions, including depth_r1 and breadth_r1. If the evaluator model has been fine-tuned on domain data, it may exhibit the same assessment misalignment that AFT targets: scoring shallow outputs highly because they match the surface form of high-scoring examples in the training data. AFT-style constraint alignment on the evaluator fine-tuning run would address this directly.

What Actually Predicts SFT Effectiveness

The massive SFT experiments paper (arXiv:2506.14681v2) addresses a practical question that practitioners answer by trial and error: given a base model and a fine-tuning dataset, will the SFT run be effective? The study trains over 1,000 SFT models under controlled conditions across diverse datasets and base models to find reliable predictors.

Key predictors of SFT effectiveness: (1) Perplexity consistently predicts SFT effectiveness, often outperforming dataset similarity measures between training data and evaluation benchmarks. Low perplexity of training data under the base model signals high alignment with the model's learned distribution and predicts strong post-training performance. (2) Mid-layer weight changes correlate most strongly with performance gains. Shallow and deep layer modifications are less predictive than changes in the middle layers. (3) Model-specific strategies outperform generic protocols; some training-task synergies vary substantially across model families.

For the harness's Data Flywheel, this provides a filter: before submitting a batch of generated training examples to a fine-tuning run, compute the perplexity of those examples under the target base model. Low-perplexity batches predict effective fine-tuning; high-perplexity batches signal distributional mismatch that will reduce fine-tuning gains. This is a cheap pre-screening step that avoids wasted compute on training runs likely to underperform.

Federated Pluralistic Alignment

The federated RLHF paper (arXiv:2512.08786v2) addresses alignment in federated learning settings where raw preference data cannot be shared across parties. Standard aggregation methods (averaging preference weights across clients) fail to represent diverse viewpoints, systematically under-weighting minority preference groups.

The paper introduces an adaptive aggregation scheme that dynamically adjusts preference weights based on each group's historical alignment performance. Experiments on question-answering tasks using a PPO-based RLHF pipeline show consistent improvements in fairness while maintaining competitive alignment scores. The adaptive scheme achieves superior fairness compared to standard aggregation without sacrificing overall alignment quality.

For multi-operator harness deployments where different users or teams have different quality preferences, federated pluralistic alignment is the theoretically sound approach to aggregating those preferences without forcing a single scalar objective.

Cross-Lingual and Multimodal Alignment

Middle-layer representation alignment (arXiv:2502.14830v3) proposes integrating a cross-lingual alignment objective directly into the task-specific training process. The insight is that middle layers offer the strongest potential for cross-lingual alignment—the abstract semantic representations learned by mid-layers generalize across languages in a way that the early (lexical) and late (task-specific) layers do not.

Experiments on slot filling, machine translation, and structured text generation show consistent improvements in cross-lingual transfer, particularly for lower-resource languages. The approach generalizes to languages not seen during the alignment phase. Separately trained alignment modules can be merged with task-specific modules without costly full model retraining.

This connects to the multilingual AaaJ finding in Post 19: the decisive factor for reliable evaluation across languages is localizing the judge-side instructions. The middle-layer alignment result suggests why this is true—the judge's alignment with its evaluation task is encoded in mid-layer representations that are language-dependent, not in the surface token predictions that are language-robust.

Vision-Flan (arXiv:2402.11690v1) introduces a visual instruction tuning dataset with 187 tasks and 1.6 million expert-written instruction instances. The key finding for instruction tuning practice: a minimal quantity of GPT-4 synthesized data (1,000 instances) is sufficient to align VLM response format with human preferences, while the 1.6M diverse human-labeled examples provide core visual understanding capabilities. The two-stage approach—diverse human data first, targeted synthetic alignment second—significantly outperforms single-stage methods and achieves state-of-the-art on multimodal benchmarks. For the harness, this is the correct ordering for any fine-tuning pipeline: diversity for capability, targeted synthesis for format alignment.

SFT effectiveness predictors: the relationship between perplexity and alignment quality (left), and mid-layer vs. other-layer weight change correlation with performance (right), from arXiv:2506.14681v2.

Design Implications

Finding	Source	Harness Implication
Follow-up likelihood matches GPT-4 reward models without annotation	FLR (2409.13948)	The evaluator's response to producer output is a native preference signal; mine it from existing run logs via FLR to generate training data without human labeling
Teacher LLM self-critique generates effective synthetic preference pairs (rDPO)	rDPO (2402.08005)	Use planner or evaluator as teacher to generate preferred/rejected pairs for producer outputs that already have dimensional rubric scores
Label-flipping poisoning is a convex optimization problem with tractable minimum cost	Poisoning (2511.09105)	Attach cryptographic hash to each Data Flywheel training example; invalidate on modification; treat the run log as a tamper-evident data store
Harmful rate drops 33.25%→1.74% via targeted weight restoration in 125 fine-tuned models	Alleviating Fear (2504.09757)	Preserve the aligned base model checkpoint; apply gradient-based safety restoration if post-fine-tuning safety evaluation detects degradation
Perplexity predicts SFT effectiveness better than dataset similarity; mid-layer changes correlate with gains	Massive SFT (2506.14681)	Pre-screen flywheel training batches by computing perplexity under the target base model; discard high-perplexity batches before submitting fine-tuning runs
AFT constraint alignment loss fixes CoT assessment misalignment	AFT (2309.02144)	Apply AFT-style constraint loss when fine-tuning the Wiggum evaluator to prevent it from assigning high scores to shallow outputs that match high-scoring surface forms

← Previous 20 · Injection Security Next → 22 · Structured Knowledge Queries

Annotation-Free Reward Signals

Follow-up Likelihood as Reward (FLR)

Refined DPO with Synthetic Preference Pairs (rDPO)

Poisoning the Alignment Phase

Safety Recovery After Fine-Tuning

Alignment Fine-Tuning for CoT Assessment

What Actually Predicts SFT Effectiveness

Federated Pluralistic Alignment

Cross-Lingual and Multimodal Alignment

Design Implications

Related in this series