Skip to content

The TABLET benchmark for evaluating instruction learning with LLMs for tabular prediction.

Notifications You must be signed in to change notification settings

dylan-slack/Tablet

Repository files navigation

TABLET logoUCI NLP Logo

TABLET: Learning From Instructions For Tabular Data

Welcome to the TABLET github! The goal of this project is to benchmark progress on instruction learning for tabular prediction. Hopefully, we can create models that solve tabular prediction tasks using instructions and few labeled examples.

Links

  • Check out the project website 🖥️
  • Read the TABLET paper
  • Play with the TABLET demo 🚀

Overview

While many prediction problems require the use of tabular data, often, gathering sufficient training data can be a challenge task, due to costs or privacy issues. Large language models (LLMs) offer considerable world knowledge due to their pre-training and could help improve sample efficiency for these problems. Still, these models are often not completely aligned with many tabular prediction tasks because of model biases from pre-training and lack of information about the task, hurting their performance in the zero and few shot settings.

What if we could use task instructions to help bridge this gap? That’s where TABLET comes in. TABLET is a living benchmark of tabular datasets annotated with task instructions for evaluating how well LLMs utilize instructions for improving performance on tabular prediction tasks.

What is TABLET?

TABLET is a living benchmark of tabular prediction tasks annotated with instructions. TABLET provides the tools to evaluate models on current tasks and contribute new tasks. The goal is to help researchers develop techniques that improve the sample efficieny of LLMs on tabular prediction.

Citation

If TABLET is useful to your work, please cite us.

@article{tabletSlack23,
         Author = {Dylan Slack and Sameer Singh},
         Title = {TABLET: Learning From Instructions For Tabular Data},
         Year = {2023},
         journal = {arXiv},
}

Installation

Getting the data

To download the data, clone the github repository.

git clone /~https://github.com/dylan-slack/Tablet.git

Once this completes, the data is stored in this path.

Tablet/data/benchmark

Installing TABLET

Please use Python>=3.9. Because of a quirk in one of the packages, please do not use Python=3.9.7. Also, ensure you have pip>=23.0.1.

conda create -n tablet python=3.9.6
conda activate tablet
pip install --upgrade pip

If you want to install the tablet package from source, navigate into the TABLET package directory and install.

cd Tablet
python3 -m pip install -e .

Otherwise, you can install from PyPI with pip. [Note: not released yet]

pip install tablet-benchmark

Completing the benchmark

Unfortunately, some naturally occurring instructions come from sources that are not permissively licensed and do not permit hosting elsewhere. We provide a guide for collecting these instructions in

Tablet/fill_missing_instructions.py

Once this is completed, you can run

python fill_missing_instructions.py

and the instructions will be added to the benchmark data.

Evaluate

The TABLET package offers several useful features for evaluating performance of LLMs + instructions on tabular datasets. TABLET provides code to evaluate arbitrary huggingface models on tasks and also provides tools to simply get the HuggingFace dataset for a particular task so you can perform whatever evaluation you want.

Task Storage

First, let's look at how the task datasets are stored in TABLET. All the tasks are stored in

Tablet/data/benchmark/performance

For example, the Adult task is store at

Tablet/data/benchmark/performance/Adult

Within this directory, there are different directories for each instruction annotation for the Adult task. For example, let's look at one of the prototypes generated instructions. This instruction is stored at

Tablet/data/benchmark/performance/Adult/prototypes-synthetic-performance-0

Instructions collected through other sources have different paths. The rulesets generated instructions all have the directory name

ruleset-synthetic-performance-*

And the naturally occurring instructions have

prototypes-naturallanguage-performance-*

Note, the usage of prototypes here is just to retain formatting consistency with the other directory names.

Within each directory, there are four files

../test.csv
../test.json
../train.csv
../train.json

These are the training and testing sets, stored both in their tabular formats (the .csv's) and their natural language formats (the .json) files. Within the json files, there are each prompt component, like the header, data point serialization, and instruction.

Getting a HuggingFace Dataset for a Task

Here's how to use the TABLET package to get a Huggingface dataset for a particular task. Let's say we want to get one of the Adult and Whooping Cough datasets at these locations

Tablet/data/benchmark/performance/Adult/prototypes-synthetic-performance-0
Tablet/data/benchmark/performance/A37/prototypes-synthetic-performance-0

We can get the test datasets as follows

from Tablet import evaluate

benchmark_path = "./data/benchmark/performance/"
tasks = ['A37/prototypes-synthetic-performance-0',
         'Adult/prototypes-synthetic-performance-0']
evaluator = evaluate.Evaluator(benchmark_path=benchmark_path,
                               tasks_to_run=tasks,
                               encoding_format="flan",
                               k_shot=0)
whooping_cough, adult = evaluator.get_test_hf_datasets()

We can specify k_shot here to control how many k_shot instances are sampled from the training data and included into the prompts. Then, we can access the Adult test data and labels as

test_data, ground_truth_labels = adult['text'], adult['label'] 

Evaluating Performance on a Task

We can also directly evaluate performance on tasks. For instance, evaluating 2-shot Flan-T5 small performance on Adult with prototypes generated instructions with 3 seeds is as follows

from Tablet import evaluate

benchmark_path = "./data/benchmark/performance/"
tasks = ['Adult/prototypes-synthetic-performance-0']
evaluator = evaluate.Evaluator(benchmark_path=benchmark_path,
                               tasks_to_run=tasks,
                               encoding_format="flan",
                               results_file="my_cool_results.txt",
                               k_shot=2)
evaluator.run_eval(how_many=3)

The results will be appended to my_cool_results.txt.

Contribute

In order to build models that can align themselves with tabular prediction problems extremely well from just instructions and perhaps a few examples, we need many tasks. These are useful for evaluating how well we're doing and could be useful for future supervision.

TABLET makes it easy to create new tasks by writing instructions or generating them with GPT-3 for new datasets. Here's how you do it.

Creating a new task

You must have the training and testing for your task stored in pandas df's. Then, you can call Tablet.create. This function will take care of creating the task for the naturally occuring instructions you provide and will also generate instructions using GPT-3, if you would like.

from Tablet import create

create.create_task(train_x,
                   eval_x,
                   train_y,
                   eval_y,
                   name=my_data_set_name,
                   header="Predict xyz.",
                   nl_instruction="Generally, people papers are grad students.",
                   categorical_columns=names_of_categorical_columns,
                   num=index_of_task,
                   num_gpt3_revisions=10,
                   openai_key_path=path_to_open_ai_key,
                   save_loc="./data/benchmark")

Here, train_x and eval_x are the train and test splits. Similarly, train_y and eval_y are the label columns. This function also accepts the name of the task (e.g., things like Adult or Wine), the header describing the high level goal of the task, and the natural langauge instructions--this is the nl_instructions argument. You must also specify the names of the categorical columns. The num argument is the index the task with this naturally occurring instruction will be stored under (e.g., prototypes-naturallanguage-performance-{num}).

Further, If you wish to generate instructions with GPT-3, you will need to provide an OpenAI key in a file and give the location of this file to the openai_key_path argument and specify how many instructions for the prototypes and rulesets templates you wish to create with num_gpt3_revisions.

Submitting a task

To include your awesome new task, please make sure the task's files are under

./data/benchmark/performance/my_new_task

and submit a pull request.

Please also include a short readmd.md in folder describing the goal of the task and the license the data and instructions are under. For instance, something like this is ideal:

Task: Predict how many sheep someone will need to count before they fall asleep.
Data License: Apache 2.0
Instruction License: MIT

We'll review it and add it to the benchmark. If you would like your name & website added to the lists of tasks on the homepage, please mention this in the pull request as well.