Gen-SQL: Efficient Text-to-SQL By Bridging Natural Language Question And Database Schema With Pseudo-Schema
- Python 3.10
- CUDA 12.1
Refer to requirements.txt for required Python packages.
- NLTK: Run the following code in Python interpreter to download nltk data.
>>> import nltk >>> nltk.download('punkt') >>> nltk.download('averaged_perceptron_tagger')
Retriever Model:
Put the retriever model in pretrained.
LLMs are omitted.
- Download BIRD dataset, and unzip
train.zipanddev.zipin the benchmark/BIRD directory.
If you are NOT interested in the mass datasets, you may safely skip the next steps.
- Merge databases in the root directory (of this project):
After a few minutes (less than 10 minutes on my server), you can find the merged databases in the directory named
python spider_code/merge_spider_db.py
spider_code/spider_ext.
- Merge databases in the root directory (of this project):
It may take about an hour until you can find the merged databases in the directory named
python bird_code/merge_bird_dev_db.py
bird_code/bird_ext.
- Start openai-compatible vllm server:
Here is the link for quick reference.
python -m vllm.entrypoints.openai.api_server --model [YOUR MODEL]
- Run code:
The results will be saved to output.
python main.py
- Execution accuracy:
-
Convert output:
Spider or Spider-mass:
python spider_code/convert_output_ext.py
BIRD or BIRD-mass:
python bird_code/convert_output_ext.py
-
Run evaluation:
Spider or Spider-mass:
. spider_code/eval-sql.shBIRD or BIRD-mass:
. bird_code/evaluation/run_evaluation.sh
-
@inproceedings{shi-etal-2025-gen,
title = "Gen-{SQL}: Efficient Text-to-{SQL} By Bridging Natural Language Question And Database Schema With Pseudo-Schema",
author = "Shi, Jie and
Xu, Bo and
Liang, Jiaqing and
Xiao, Yanghua and
Chen, Jia and
Xie, Chenhao and
Wang, Peng and
Wang, Wei",
editor = "Rambow, Owen and
Wanner, Leo and
Apidianaki, Marianna and
Al-Khalifa, Hend and
Eugenio, Barbara Di and
Schockaert, Steven",
booktitle = "Proceedings of the 31st International Conference on Computational Linguistics",
month = jan,
year = "2025",
address = "Abu Dhabi, UAE",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.coling-main.256/",
pages = "3794--3807",
abstract = "With the prevalence of Large Language Models (LLMs), recent studies have shifted paradigms and leveraged LLMs to tackle the challenging task of Text-to-SQL. Because of the complexity of real world databases, previous works adopt the retrieve-then-generate framework to retrieve relevant database schema and then to generate the SQL query. However, efficient embedding-based retriever suffers from lower retrieval accuracy, and more accurate LLM-based retriever is far more expensive to use, which hinders their applicability for broader applications. To overcome this issue, this paper proposes Gen-SQL, a novel generate-ground-regenerate framework, where we exploit prior knowledge from the LLM to enhance embedding-based retriever and reduce cost. Experiments on several datasets are conducted to demonstrate the effectiveness and scalability of our proposed method. We release our code and data at https://github.com/jieshi10/gensql."
}