Create prompts with a given token length for testing LLMs and other transformers text models.
Pre-created prompts for popular model architectures are provided in .jsonl files in the prompts directory.
To generate one or a few prompts, or to test the functionality, you can use the Test Prompt Generator Space on Hugging Face.
pip install git+/~https://github.com/helena-intel/test-prompt-generator.git transformers
Some tokenizers may require additional dependencies. For example, sentencepiece
or protobuf
.
Specify a tokenizer, and the number of tokens the prompt should have. A prompt will be returned that, when tokenized with the given tokenizer, contains the requested number of tokens.
For tokenizer, use a model_id from the Hugging Face hub, a path to a local file, or one of the preset tokenizers:
['bert', 'blenderbot', 'bloom', 'bloomz', 'chatglm3', 'falcon', 'gemma', 'gpt-neox', 'llama', 'magicoder', 'mistral', 'mpt', 'opt', 'phi-2', 'pythia', 'qwen', 'redpajama', 'roberta', 'starcoder', 't5', 'vicuna', 'zephyr']
. The preset tokenizers should work for most models with that architecture,
but if you want to be sure, use an exact model_id. This list shows the exact tokenizers used for the presets.
Prompts are generated by truncating a given source text at the provided number of tokens. By default Alice in Wonderland is used; you can also provide your own source. A prefix can optionally be prepended to the text, to create prompts like "Please summarize the following text: [text]". The prompts are returned by the function/command line app, and can also optionally be saved to a .jsonl file.
from test_prompt_generator import generate_prompt
# use preset value for opt tokenizer
prompt = generate_prompt(tokenizer_id="opt", num_tokens=32)
# use model_id
prompt = generate_prompt(tokenizer_id="facebook/opt-2.7b", num_tokens=32)
Add a source_text_file and prefix. Instead of source_text_file, you can also pass source_text
containing a string with the source text.
from test_prompt_generator import generate_prompt
prompt = generate_prompt(
tokenizer_id="mistral",
num_tokens=32,
source_text_file="source.txt",
prefix="Please translate to Dutch:",
output_file="prompt_32.jsonl",
)
Use multiple token sizes. When using multiple token sizes, output_file
is required, and the generate_prompt
function does not return anything. The output_file
will contain one line for each token size.
prompt = generate_prompt(
tokenizer_id="mistral",
num_tokens=[32,64,128],
output_file="prompts.jsonl",
)
NOTE: When specifing one token size, the prompt will be returned as string, making it easy to copy and use in a test scenario where you need one prompt. When specifying multiple token sizes a dictionary with the prompts will be returned. The output file is always in .jsonl format, regardless of the number of generated prompts.
test-prompt-generator -t mistral -n 32
Use test-prompt-generator --help
to see all options:
usage: test-prompt-generator [-h] -t TOKENIZER -n NUM_TOKENS [-p PREFIX] [-o OUTPUT_FILE] [--overwrite] [-v] [-f FILE]
options:
-h, --help show this help message and exit
-t TOKENIZER, --tokenizer TOKENIZER
preset tokenizer id, model_id from Hugging Face hub, or path to local directory with tokenizer files. Options for presets are: ['bert', 'bloom', 'gemma', 'chatglm3', 'falcon', 'gpt-neox',
'llama', 'magicoder', 'mistral', 'opt', 'phi-2', 'pythia', 'roberta', 'qwen', 'starcoder', 't5']
-n NUM_TOKENS, --num_tokens NUM_TOKENS
Number of tokens the generated prompt should have. To specify multiple token sizes, use e.g. `-n 16 32`
-p PREFIX, --prefix PREFIX
Optional: prefix that the prompt should start with. Example: 'Translate to Dutch:'
-o OUTPUT_FILE, --output_file OUTPUT_FILE
Optional: Path to store the prompt as .jsonl file
--overwrite Overwrite output_file if it already exists.
-v, --verbose
-f FILE, --file FILE Optional: path to text file to generate prompts from. Default text_files/alice.txt
This software is provided "as is" and for testing purposes only. The author makes no warranties, express or implied, regarding the software's operation, accuracy, or reliability.