-
-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AutoAWQForCausalLM requires the download of pile-val-backup #2458
Comments
Sorry buddy. AutoAwq is another project, not maintained by vllm. You should ask in autoawq not vllm. |
@e576082c you should really save the awq quantization and then you can reload it from the save dir (or even a repo if you push it manually). The dataset is used to calibrate the quantization from what I get but @casper-hansen should correct me on this one. The proper way to infer your AWQ model is as follow. Quantize and savefrom awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = "/mnt/AI/models/safetensors/loyal-piano-m7"
quant_path = f"{model_path}-awq"
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }
# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
# Quantize
# This takes a while so you really want to save the result
# to make inference loading faster
model.quantize(tokenizer, quant_config=quant_config)
# Save quantized model
# The whole point of awq format is to have a smaller footprint
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path) Load your AWQ modelfrom awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = "/mnt/AI/models/safetensors/loyal-piano-m7-awq"
# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
... Also I'm pretty sure AWQ needs the full precision model in pytorch format as I don't think it supports safetensors format. At least I didn't had luck with it. |
We need a quantization dataset and use the pile validation split for that. It works as intended, I would suggest you increase your disk capacity or clean up the disk. You also need to save the model, that’s just a basic step and what you should expect. If it’s a problem, I recommend renting a machine that fits the requirements for quantizing. |
Thanks for the code @wasertech, and thanks for the advice everybody! I didn't know that AWQ quantization requires a calibration dataset (because so far I only used GGUF, what just works). Regardless, at least putting a warning to the docs about the dataset download would be nice (because of its unknown license). I am skeptical about calibration dataset reliant quantization methods (is the result deterministic? what about multi-lingual tests? what's the point of a calibration dataset anyway? The model should be able to write program code in 4 programming languages, and be fluent in 6 human languages, so 'calibrating' on an English dataset seems pointless to me. Yeah, I have not looked into the dataset, so I don't even have a clue about what might be inside of it, so sorry if I guessed wrong.). Anyway I choose to use GGUF (the older version, not the new one), and made a simple bash script for llama.cpp to iterate over my test prompts. (using the --file, --logdir, and --model arguments of llama.cpp). Saving quantized models to disk is not something I would do, even if the code requires writing out the file to disk (probably cleaner that way, due to memory management issues?), I saved the quantized file into a tmpfs anyway. (I have a lot of unused RAM, so why not use it?) My WD red SMR disks (lol 'NAS optimized'), are slow, so I prefer to avoid using them. Even if I could have managed to make some free space on my disks, rewriting a 4-7 GB big file for each test model would have taken a prohibitively long time. Anyway, thanks again for the help. My problem is solved, I guess (llama.cpp is now chewing through my tests at 20T/s speed. xD Gonna take a few more days to complete.) But I'll leave this issue open regarding the unknown license of the pile-val-backup dataset, and the absence of warnings before its download. |
@e576082c please close this issue here with a link to a new issue in autoAWQ repo because you are in vLLM repo atm so not the right one to raise this issue 😅 |
Closing as this issue should be raised with AutoAWQ instead. |
I installed vllm to automatically run some tests on a bunch of Mistral-7B models, (what I cooked up locally, and I do NOT want to upload to huggingface before properly testing them). The plan is to:
So back to the issue about vllm, and how all of this might be related to vllm:
For quick testing, I copy-pasted and modded some code from docs.vllm.ai/en/latest/quantization/auto_awq.html. My code isn't much different from the one in the official docs at vllm.ai, and this particular code triggers the download of "pile-val-backup".
Perharps I messed up somthing in the code, but I honestly don't think so. Please have a look at it:
I suppose my code is quite trivial, almost the same as in the docs.
Ah, and before I forget it:
The error I get is:
And finally, here is some genreic code, what currently works, but it's slow, so I'm not happy with it, so I would like to use vllm instead:
So... vllm doesn't work, while the generic code I put together from Huggingface docs does work, but it's too slow.
I would really like to try out vllm, but I won't download a random shady dataset (pile-val-backup), what AWQ requires for whatever reason.
Please remove the dependency on "pile-val-backup".
The text was updated successfully, but these errors were encountered: