This repository accompanies the ACL 2024 paper "Synthesizing Text-to-SQL Data from Weak and Strong LLMs." It leverages a lightweight weak model (DeepSeek 1.3B) to generate candidate SQL queries, validates them through execution, and builds preference pairs. These pairs are then used to train a powerful large model (CodeLlama) using cDPO.
CodeLlama SFT → DeepSeek 1.3B SFT → Sampling + Exec Verification → DPO Pairs → CodeLlama cDPO (SENSE)
(7B/13B) (1.3B) (spider/bird-train) (grouped-any-limit1) (sense-13b)
- CodeLlama SFT — Supervised fine-tuning on spider, bird, and self-instruct data.
- DeepSeek 1.3B SFT — Same data; used as the weak model for sampling.
- Sampling + Exec Verification — DeepSeek 1.3B samples multiple SQLs per question; execution is verified against golden results.
- DPO Pairs —
construct_dpo_pairs.pybuilds (chosen, rejected) pairs from exec results. - cDPO — CodeLlama is trained with cDPO on these pairs, yielding the final model.
├── models/ # Model checkpoints and training configs
│ ├── 1-CODELLAMA-SFT.ARGS.md
│ ├── 2-DEEPSEEK-SFT.ARGS.md
│ └── 3-SENSE-CDPO.ARGS.md
├── data/
│ ├── sft/ # SFT data (spider, bird, self-instruct)
│ ├── sampling/ # Sampling outputs and exec results
│ └── dpo/ # DPO pairs (grouped-any-limit1)
├── sql_eval/ # Evaluation harness
│ ├── DATA_WORKTHROUGH.md # Data preparation guide
│ ├── run_eval_demo.sh # Greedy evaluation
│ └── run_sampling_demo.sh # Sampling evaluation
└── sql_eval/sql_suites/ # Data preprocessing, DPO pair construction
-
Data Preparation — How to download and preprocess Spider, BIRD, Spider-DK, Spider-Syn, Spider-Realistic. Includes filtering by executability.
-
Model Training Arguments — Per-stage training configs:
1-CODELLAMA-SFT.ARGS.md— LR, epochs, data for CodeLlama SFT2-DEEPSEEK-SFT.ARGS.md— LR, epochs, data for DeepSeek 1.3B SFT3-SENSE-CDPO.ARGS.md— LR, DPO beta, cDPO eps, data for final cDPO
From sql_eval/:
# Greedy decoding
bash run_eval_demo.sh
# Sampling (multiple SQLs per question)
bash run_sampling_demo.shBy default MODEL_PATH points to models/sense-13b and TASK_NAME to spider-train. Adjust main.py arguments as needed.
From sql_eval/sql_suites/:
bash generate_pairs.shReads exec results from data/sampling/ and writes DPO pairs to data/dpo/ in grouped-any-limit1 format.
If you plan to build on this data or pipeline, here are some promising directions:
-
Better preference learning algorithms — The released model uses cDPO with
cdpo_eps=0.1. You can experiment with alternatives such as IPO, KTO, SimPO, or other preference optimization methods that may improve stability or sample efficiency. -
Adjusting DPO pair parameters — The current pairs use
grouped-any-limit1(dedup=True, neg_filter=all, keep_gold=False, neg_limit=1). Try varyingneg_limit(e.g., 3 or 5),keep_gold(include gold as positive), orneg_filter(exec-only negatives) to study their impact. Seeconstruct_dpo_pairs.pyandmodels/3-SENSE-CDPO.ARGS.mdfor the construction logic. -
Improving self-instruct data quality — Self-instruct data is used for SFT but not execution-verified. You can run execution verification on it, filter out non-executable or incorrect samples, and optionally rewrite failed SQLs (e.g., via feedback or stronger models) to obtain higher-quality SFT data.
@inproceedings{yang2024synthesizing,
title={Synthesizing text-to-SQL data from weak and strong LLMs},
author={Yang, Jiaxi and Hui, Binyuan and Yang, Min and Yang, Jian and Lin, Junyang and Zhou, Chang},
booktitle={Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
pages={7864--7875},
year={2024}
}