Retrieval-Augmented-Generation-Enhanced Large Language Model for Text-to-SQL

This repository contains all the relevant codes for building a RAG enhanced LLM for Text-to-SQL, evaluation data and also instructions on how to evaluate the performance by test-suite-sql-eval through Docker and customize your Text-to-SQL evaluation pipeline based on own data by Langsmith.

Tech Stack

Architecture

The text-to-SQL architecture, as shown in the figure below, receives two pivotal inputs: the user's natural language question and the database's Data Definition Language (DDL), which outlines the database's schema. The DDL plays a critical role, in equipping the Language Model (LLM) with the necessary context of the available tables and fields for formulating queries. The process commences with a Question Masking step that selectively conceals schema-specific keywords, thus preserving the integral framework of the question. This masked version then acts as a query to fetch the three most structurally similar question-and-answer pairs from a vector database. Retrieving these pairs facilitates the LLM in generating the SQL commands and accessing related documentation relevant to the user's question. Subsequently, a comprehensive prompt is crafted, integrating DDL, SQL Q&A pairs and documentation, which is presented to the LLM for SQL query generation. After the LLM generates the SQL query, it is executed within the user’s database. In the event of execution errors, the LLM is equipped with a self-corrective feature that allows for reiteration, refining the query until a successful execution is achieved.

Play with the codes

Install relevant packages

git clone /~https://github.com/mruiyangyou/RAG-enhanced-LLM-Agent-for-text-to-SQL-generation.git
cd RAG-enhanced-LLM-Agent-for-text-to-SQL-generation
pip install -r requirements.txt

Prepare api keys:

OpenAI(Chat)
Cohere(rerank)
Pinecone(vectordb)
Langsmith(monitoring)

export LANGCHAIN_TRACING_V2="true"
export LANGCHAIN_ENDPOINT="https://api.smith.langchain.com"
export LANGCHAIN_API_KEY="xxx"
export LANGCHAIN_PROJECT="xxx"
export PINECONE_API_KEY="xxx"
export COHERE_API_KEY="xxx"
export OPENAI_API_KEY="xxx"
export VANNA_API_KEY="xxx"

Project Structure

data

inventory: data for inventory 50 mentioned in the report
sql_test_suite: data for academic full mentioned in the report
sql_test_suite_academic40: data for academic 40 & academic 40 enhanced mentioned in the report
sql-material: data used during building the RAG

code

baseline: baseline model only relies on schema
chain: containing our RAG and RAG-rerank models
self-rag: self correct RAG model
routing: RAG based router(multi indexes)
pinconedb: a built class wrapped Pinecone-client's and langchain-pinecone's api.
utils: database related utils function
evaluation:
- evaluating scripts: general evaluation codes
- sql-test-suite: sql-test-suite by Docker

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data		data
notebooks		notebooks
src		src
.gitignore		.gitignore
README.md		README.md
architecture.png		architecture.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Retrieval-Augmented-Generation-Enhanced Large Language Model for Text-to-SQL

Tech Stack

Architecture

Play with the codes

Project Structure

data

code

About

Releases

Packages

Languages

mruiyangyou/RAG-enhanced-LLM-Agent-for-text-to-SQL-generation

Folders and files

Latest commit

History

Repository files navigation

Retrieval-Augmented-Generation-Enhanced Large Language Model for Text-to-SQL

Tech Stack

Architecture

Play with the codes

Project Structure

data

code

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages