This repository releases the core code, training pipeline, inference scripts, and evaluation utilities for GRIP (Generation-guided Retrieval with Information Planning), a unified Retrieval-as-Generation framework for dynamic retrieval-augmented generation.
Instead of treating retrieval as an external controller decision, GRIP internalizes retrieval behavior into token-level decoding through explicit control tokens such as [RETRIEVE], [INTERMEDIARY], [ANSWER], and [SOLVED]. This design enables the model to decide when to retrieve, how to reformulate follow-up queries, and when to stop, all within a single autoregressive trajectory.
GRIP is built around a simple idea: retrieval control should be part of generation itself.
Under the Retrieval as Generation paradigm, the model emits special control tokens during decoding to regulate retrieval behavior. A typical GRIP trajectory may:
- answer directly when internal knowledge is sufficient,
- emit an intermediate response when information is incomplete,
- trigger retrieval with the original or a refined query,
- continue multi-step retrieval when needed,
- terminate with a final answer once the question is resolved.
This repository provides practical code for:
- structured training data construction,
- supervised fine-tuning for token-controlled retrieval behavior,
- rule-based RL fine-tuning with DAPO,
- local multi-step inference,
- benchmark evaluation on QA datasets,
- Wikipedia indexing for BM25-based retrieval.
-
Unified token-level retrieval control
Retrieval timing, query reformulation, and stopping are all represented as trainable decoding actions. -
Self-Triggered Information Planning
The model learns to judge information sufficiency and decide whether more evidence is needed. -
Structured supervision for retrieval behaviors
Four training types teach the model direct answering, retrieval triggering, multi-hop planning, and answer completion. -
One-step decision optimization for multi-step retrieval
GRIP learns multi-step retrieval behavior through one-step decision optimization instead of long-horizon search-policy optimization, making it simpler and more stable while preserving adaptive depth and controllable stopping.
This repository includes the following components.
-
data_generation/first.sh
Entry script for the first-stage data construction pipeline. -
data_generation/make_first_steps.py
Builds the initial A/B/C/D-style structured data with retrieval and answerability signals. -
data_generation/use_gpt_for_data.py
Refines specific training cases with GPT-based query rewriting and intermediary correction. -
data_generation/merge_dataset.py
Merges structured subsets into final SFT and RL training data. -
data_generation/index.py
Builds the Elasticsearch index for the Wikipedia passage corpus.
-
inference/agent.py
Main multi-step GRIP inference script. -
inference/inference.sh
Example launch script for distributed inference.
-
eval/eval.py
Computes EM, F1, ROUGE, and other metrics from reference and prediction files. -
eval/utils.py
Evaluation utilities.
-
train/examples/data_preprocess/grip/sft.py
Converts GRIP SFT training data into parquet format. -
train/examples/data_preprocess/grip/rl.py
Converts GRIP RL training data into parquet format. -
train/examples/sft/run_sft_llama.sh
SFT training script for the LLaMA backbone. -
train/recipe/dapo/dapo_4w_continue_rl_ep3_llama.sh
RL fine-tuning script based on DAPO. -
train/scripts/merge.sh
Merges sharded checkpoints into Hugging Face format after RL training.
-
requirements.txt
Main repository dependencies. -
train/requirements.txt
Additional dependencies for the training framework.
.
├── README.md
├── README_zh.md
├── requirements.txt
├── data_generation/
│ ├── first.sh
│ ├── index.py
│ ├── make_first_steps.py
│ ├── merge_dataset.py
│ └── use_gpt_for_data.py
├── eval/
│ ├── eval.py
│ └── utils.py
├── inference/
│ ├── agent.py
│ └── inference.sh
└── train/
├── README.md
├── pyproject.toml
├── requirements.txt
├── setup.py
├── examples/
│ ├── data_preprocess/grip/
│ │ ├── sft.py
│ │ └── rl.py
│ └── sft/
│ └── run_sft_llama.sh
├── recipe/
│ └── dapo/
│ └── dapo_4w_continue_rl_ep3_llama.sh
├── scripts/
│ └── merge.sh
└── verl/
The standard workflow uses the lowercase
train/directory shown above.
We recommend creating two environments.
conda create -n grip python=3.9
conda activate grip
pip install -r requirements.txtcd train
pip install -e .
pip install -r requirements.txtPlease keep the installation order consistent with the training framework requirements.
The current release includes the following Hugging Face resources:
- GRIP-Llama-3-8B: WisdomShell/GRIP-Llama-3-8B
- GRIP_SFT_Data: WisdomShell/GRIP_SFT_Data
- GRIP_RL_Data: WisdomShell/GRIP_RL_Data
You can also access the released collection from the repository badge:
The practical workflow is:
Wikipedia passages
-> data_generation/index.py
-> Elasticsearch index
Raw QA training data
-> data_generation/first.sh
-> A / B / C / D structured subsets
-> data_generation/use_gpt_for_data.py
-> refined subset C
-> data_generation/merge_dataset.py
-> SFT_data.jsonl + RL_data.jsonl
-> train/examples/data_preprocess/grip/sft.py or rl.py
-> parquet datasets
-> SFT training
-> RL training with DAPO
-> train/scripts/merge.sh
-> GRIP checkpoint
-> inference/agent.py
-> eval/eval.py
Download the Wikipedia passage dump:
mkdir wiki_data
cd wiki_data
wget https://dl.fbaipublicfiles.com/dpr/wikipedia_split/psgs_w100.tsv.gz
gzip -d psgs_w100.tsv.gz
cd ..Set up Elasticsearch and build the index:
mkdir ret
cd ret
wget -O elasticsearch-7.17.9.tar.gz https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.17.9-linux-x86_64.tar.gz
tar zxvf elasticsearch-7.17.9.tar.gz
rm elasticsearch-7.17.9.tar.gz
cd elasticsearch-7.17.9
nohup bin/elasticsearch &
cd ../..
python data_generation/index.py --data_path /path/to/psgs_w100.tsv --index_name wikiBefore generating GRIP training data, combine the raw QA training sets into JSONL format:
{
"question": "Who wrote The Old Man and the Sea?",
"answer": ["Ernest Hemingway"]
}The original project uses training data from:
- Natural Questions Open
- WebQuestions
- TriviaQA
Update these variables in data_generation/first.sh:
MODEL_DIRINPUT_FILEBASE_OUTPUT_DIRES_HOSTES_INDEX
Then run:
bash data_generation/first.shThis stage creates the initial structured subsets for the four GRIP behavior types.
Configure your OpenAI-compatible API settings in data_generation/use_gpt_for_data.py, especially:
base_urlapi_keyINPUT_FILE
Then run:
python data_generation/use_gpt_for_data.pyThis stage rewrites specific training cases into refined [INTERMEDIARY] ... [RETRIEVE] ... patterns.
Update the paths in data_generation/merge_dataset.py:
input_diroutput_jsonl_dir
Then run:
python data_generation/merge_dataset.pyThe script produces:
SFT_data.jsonlRL_data.jsonl
Before running, update --data_path in:
train/examples/data_preprocess/grip/sft.py
Example:
python train/examples/data_preprocess/grip/sft.py \
--data_path /path/to/SFT_data.jsonl \
--save_dir datasets/GRIPSFTThis produces:
datasets/GRIPSFT/train.parquetdatasets/GRIPSFT/test.parquet
Before running, update --data_path and --data_source in:
train/examples/data_preprocess/grip/rl.py
Example:
python train/examples/data_preprocess/grip/rl.py \
--data_path /path/to/RL_data.jsonl \
--save_dir datasets/GRIPRL \
--data_source GRIPRLThis produces:
datasets/GRIPRL/train.parquetdatasets/GRIPRL/test.parquet
Use:
train/examples/sft/run_sft_llama.sh
Key fields to update include:
NAMEdata.train_filesdata.val_filesmodel.partial_pretraintrainer.default_local_dir
Then run:
cd train
bash examples/sft/run_sft_llama.sh
cd ..Use:
train/recipe/dapo/dapo_4w_continue_rl_ep3_llama.sh
Key fields to update include:
MODEL_PATHCKPTS_DIRTRAIN_FILETEST_FILE
Then run the RL script in your training environment.
After RL training, convert the saved shards into Hugging Face format:
cd train
bash scripts/merge.sh
cd ..The test file should follow the format:
{
"question": "Test query",
"answer": ["One or more gold answers"]
}Update the paths in inference/inference.sh:
model_pathinput_fileoutput_file
Then run:
bash inference/inference.shThe prediction file will contain records like:
{
"question": "Test query",
"prediction": ["step 1", "step 2", "final answer"]
}Evaluate predictions with:
python eval/eval.py \
--references_path /path/to/test_dataset.jsonl \
--predictions_path /path/to/prediction.jsonlThe evaluation script supports:
- EM
- F1
- ROUGE
It also handles answer fields with either:
answeranswer_and_def_correct_predictions
GRIP organizes structured supervision into four training types:
-
Type-α: Direct Answer
The model answers directly and terminates. -
Type-β: Retrieval Needed
The model emits a partial response and triggers retrieval. -
Type-γ: Multi-hop Planning
The model iteratively generates new intermediary states and follow-up queries. -
Type-θ: Answer Completion
The model uses retrieved evidence to synthesize and finalize the answer.
This design teaches the model retrieval control through language-native token trajectories rather than external controllers.
Main functionality:
- runs first-stage structured data generation,
- builds initial behavior-specific subsets,
- supports distributed generation,
- interacts with Elasticsearch retrieval.
Main functionality:
- refines subset C with GPT-based query generation,
- rewrites intermediary answers,
- resumes from partial progress if interrupted.
Main functionality:
- merges A / B / C / D subsets,
- builds
SFT_data.jsonl, - builds
RL_data.jsonl.
Main functionality:
- runs local GRIP inference,
- supports multi-round retrieval,
- saves step-wise predictions.
Main functionality:
- computes EM, F1, and ROUGE,
- matches predictions with references,
- reports summary statistics.
Main functionality:
- converts SFT JSONL into parquet format.
Main functionality:
- converts RL JSONL into parquet format,
- prepares reward-model fields for RL training.
Main functionality:
- launches GRIP supervised fine-tuning for LLaMA.
Main functionality:
- launches GRIP RL fine-tuning with DAPO.
The retrieval pipeline depends on a working Elasticsearch service and a built Wikipedia index.
Most scripts contain placeholder paths. Update all paths before running.
data_generation/use_gpt_for_data.py requires valid API credentials and endpoint settings.
The training framework under train/ has its own dependency setup. Install both the main repository dependencies and the training dependencies.
Evaluation expects the prediction file to match the question field and prediction format used by eval/eval.py.
The recommended order is:
- index Wikipedia
- construct structured data
- merge SFT / RL datasets
- preprocess parquet data
- run SFT
- run RL
- merge checkpoints
- run inference and evaluation
This repository is part of our broader research line on controllable and adaptive Retrieval-Augmented Generation (RAG).
-
GRIP [ACL 2026 Main Conference]: Retrieval as Generation: A Unified Framework with Self-Triggered Information Planning
A training-based dynamic RAG framework that internalizes retrieval control into token-level decoding. -
ETC [AAAI 2026 Oral Paper]: Modeling Uncertainty Trends for Timely Retrieval in Dynamic RAG
A training-free dynamic RAG method that improves retrieval timing by modeling entropy trends during decoding. -
SCD [AAAI 2026 Oral Paper]: Language Drift in Multilingual Retrieval-Augmented Generation
A training-free multilingual RAG method that mitigates language drift through decoding-time control.
Together, these projects cover three complementary directions in RAG: training-based retrieval planning, training-free retrieval timing, and decoding-time control for multilingual generation.