This repository releases the core pipeline of Answer Divergence-Guided Selection (ADG) for instruction data selection. ADG scores each instruction by the geometric structure of multiple sampled answers, rather than relying on a single reference response. In the paper, ADG consistently improves instruction tuning under a fixed 10K budget across two backbones, three public instruction pools, and six benchmarks spanning reasoning, knowledge, and coding. The method combines dispersion magnitude and shape anisotropy, then performs bin-wise selection for semantic coverage.
Instruction tuning quality depends heavily on which examples are selected under a fixed data budget. ADG addresses this by examining how a base model responds to the same instruction under stochastic decoding.
For each instruction, ADG:
- samples multiple answers with relatively high-temperature decoding,
- maps answers into a representation space,
- computes geometry-aware scores from the sampled answers,
- ranks examples by the combined score,
- performs proportional selection within semantic bins.
This repository provides the practical pipeline for:
- multi-sample answer generation,
- instruction embedding and clustering,
- ADG scoring and subset selection,
- model training,
- benchmark evaluation,
- optional task-type analysis.
This repository includes the following components:
-
ADG/ADG_llama.py
ADG scoring and selection for the LLaMA backbone. -
ADG/ADG_qwen.py
ADG scoring and selection for the Qwen backbone.
-
generation/generation.py
Generates multiple sampled answers for each instruction. -
generation/embedding/embed.py
Builds instruction embeddings and performs clustering for bin-wise selection.
-
train/train_llama.sh
Training entry script for LLaMA. -
train/train_qwen.sh
Training entry script for Qwen. -
train/training/stanford_alpaca/
Training utilities and backbone-specific training scripts. -
eval/eval.sh
Evaluation script based onlm-evaluation-harness.
analysis/analyse.py
Optional task-type classification script for analyzing selected data.
requirements.txt
Required Python packages for this repository.
.
├── README.md
├── README_zh.md
├── requirements.txt
├── ADG/
│ ├── ADG_llama.py
│ └── ADG_qwen.py
├── generation/
│ ├── generation.py
│ └── embedding/
│ └── embed.py
├── analysis/
│ └── analyse.py
├── eval/
│ └── eval.sh
└── train/
├── train_llama.sh
├── train_qwen.sh
└── training/
└── stanford_alpaca/
├── train_llama.py
├── train_qwen.py
├── utils.py
└── configs/
We recommend Python 3.10 or above.
Example:
conda create -n adg python=3.12.9
conda activate adg
pip install -r requirements.txtDepending on your environment, you may also need to install GPU-specific packages separately.
ADG expects instruction datasets in JSON or JSONL format. Each example should follow the schema below:
{
"id": 0,
"instruction": "Write a short explanation of transformers.",
"input": "",
"output": "Transformers are neural networks based on self-attention..."
}Notes:
idshould uniquely identify each example.instructionis required.inputis optional and can be empty or omitted.outputis the reference response in the original instruction dataset.- Other instruction datasets can be used as long as they are converted into this format.
After answer generation, the intermediate JSONL file contains records like:
{
"id": 0,
"instruction": "Write a short explanation of transformers.",
"output": "Transformers are neural networks based on self-attention...",
"generated_answers": [
"...",
"...",
"...",
"...",
"..."
]
}The practical workflow is:
instruction pool
-> generation/generation.py
-> multi-sample answer JSONL
-> generation/embedding/embed.py
-> instruction embeddings + cluster labels
-> ADG/ADG_llama.py or ADG/ADG_qwen.py
-> top / middle / bottom selected subsets
-> train/train_*.sh
-> finetuned checkpoints
-> eval/eval.sh
Download and preprocess your instruction dataset, such as Alpaca-GPT4, WizardLM, or CoT, into the required format.
Before running, update the following variables in generation/generation.py:
MODEL_NAMEOUTPUT_DIROUTPUT_FILE
Then run:
cd generation
torchrun --nproc_per_node=4 --master_port=29500 generation.py --input_file /path/to/your/instruction_data.json --batch_size 32Before running, update the following variables in generation/embedding/embed.py:
MODEL_NAMEINPUT_JSONLEMBEDDINGS_PATHCLUSTERS_PATHK_CLUSTERS
Then run:
torchrun --nproc_per_node=4 --master_port=29501 generation/embedding/embed.pyChoose the scoring script that matches your backbone.
For LLaMA, configure these variables in ADG/ADG_llama.py:
model_nameINPUT_JSONLOUTPUT_DIREMBEDDINGS_PATHCLUSTERS_PATHK_CLUSTERSFINAL_SELECT_COUNT
Then run:
python ADG/ADG_llama.pyFor Qwen, configure these variables in ADG/ADG_qwen.py:
model_nameINPUT_JSONLOUTPUT_DIREMBEDDINGS_PATHCLUSTERS_PATHCHECKPOINT_DIRFINAL_SELECT_COUNT
Then run:
python ADG/ADG_qwen.pyThe selector saves:
top.jsonmiddle.jsonbottom.json
under the configured OUTPUT_DIR.
Use the selected subset, typically top.json, for instruction tuning.
For LLaMA:
cd train
bash train_llama.shFor Qwen:
cd train
bash train_qwen.shBefore running, update paths such as:
--model_name_or_path--data_path--output_dir
This repository uses lm-evaluation-harness for benchmark evaluation.
Install it first if needed:
git clone https://github.com/EleutherAI/lm-evaluation-harness.git
cd lm-evaluation-harness
pip install -e .Then configure MODEL_PATH and output paths in eval/eval.sh, and run:
cd eval
bash eval.shThe evaluation script currently includes:
- BBH
- GSM8K
- MMLU
- TruthfulQA
- MBPP
- HumanEval
ADG is built around two complementary signals derived from multiple sampled answers:
-
Dispersion magnitude
Measures how widely the sampled answers spread in representation space. -
Shape anisotropy
Measures whether the spread is multi-directional rather than dominated by a single direction.
The final ADG score combines these two parts, and the selected subset is obtained through semantic bin-wise ranking. This design helps avoid collapsing selection into only a few dense instruction regions.
Main functionality:
- load the base model,
- sample multiple answers for each instruction,
- save generated answers in JSONL format,
- support distributed generation.
Main functionality:
- build instruction embeddings,
- run clustering,
- save instruction embeddings and cluster labels,
- provide the semantic bins used by ADG selection.
Main functionality:
- read the generated-answer JSONL file,
- compute answer-geometry metrics,
- combine metrics into the ADG score,
- perform proportional cluster-based selection,
- save
top.json,middle.json, andbottom.json.
Main functionality:
- compute ADG metrics for Qwen-generated answers,
- support checkpoint-based resumption,
- perform the same top / middle / bottom selection pipeline.
Main functionality:
- classify instructions into coarse task categories,
- support optional data-level analysis of selected subsets.
Main functionality:
- launch distributed full fine-tuning,
- use the selected subset for instruction tuning.
Main functionality:
- run benchmark evaluation with
lm-evaluation-harness, - support reasoning, knowledge, and coding tasks.
Most scripts use placeholder paths. Update all required paths before running.
Make sure the generation backbone, embedding backbone, ADG scoring script, and training script are aligned.
The selector depends on:
- generated answer JSONL,
- instruction embeddings,
- clustering results.
Run the previous stages before starting ADG selection.
Generation, embedding, and scoring all use hidden-state-based processing. You may need to reduce batch size or adjust GPU allocation depending on your hardware.
eval/eval.sh depends on lm-evaluation-harness. Install it separately before running evaluation.
If you use this repository, please cite the paper.