compatible with llama-server docker #35

aiseei · 2025-01-10T20:55:55Z

thanks for this cool project !
Anyway - we could use this to setup, start , stop docker based servers?

mostlygeek · 2025-01-11T05:18:20Z

I found podman to be a better option than docker. I run qwen-2vl-7B with VLLM and podman and it works. Try that. It should be able to run docker containers, and be compatible with the way llama-swap shuts down process (sends a SIGTERM).

mostlygeek · 2025-01-11T17:12:14Z

Here is a configuration snippet I use to run vllm, podman and llama-swap together:

models:
  # run VLLM in podman on the 3090
  "qwen2-vl-7B-gptq-int8":
    aliases:
      - gpt-4-vision
    proxy: "http://127.0.0.1:9797"
    cmd: >
      podman run --rm
        -v /mnt/nvme/models:/models
        --device nvidia.com/gpu=GPU-<redacted>
        -p 9797:8000 --ipc=host
        --security-opt=label=disable
        docker.io/vllm/vllm-openai:v0.6.4
        --model "/models/Qwen/Qwen2-VL-7B-Instruct-GPTQ-Int8"
        --served-model-name gpt-4-vision qwen2-vl-7B-gptq-int8
        --disable-log-stats
        --enforce-eager

knguyen298 · 2025-01-28T16:40:20Z

I am currently using it with docker, and it has been working great! Example below:

models:
  "Qwen Coder 32B":
    cmd: >
      docker run --pull=always --init --rm -v /ssd/llamacpp/models:/models --gpus '"device=1,2"' -p 9501:9501 --name "llama-swap-qwen-coder-32b"
      ghcr.io/ggerganov/llama.cpp:server-cuda
      --host 0.0.0.0
      --port 9501
      --flash-attn --metrics
      --model /models/Qwen2.5-Coder-32B-Instruct-Q6_K_L.gguf
      -ngl 99
      --ctx-size 32768
      --split-mode row
      --temp 0.1
      --keep -1

    proxy: http://127.0.0.1:9501

mostlygeek · 2025-01-28T21:49:53Z

Any issues with swapping between two models with docker you’ve seen?

knguyen298 · 2025-01-29T14:49:17Z

Nope, swapping is pretty seamless, no lingering container after swapping. I have llama-swap running as a systemctl service, and even forcibly restarting it with a Docker container loaded doesn't result in the container lingering around.

mostlygeek · 2025-01-29T19:40:31Z

Thanks @knguyen298. Closing this issue. I think the --init flag is the key. It uses docker-init as pid 1, which will catch the SIGTERM signal from llama-swap and shut it down.

$ docker run  --init ubuntu:latest ps
    PID TTY          TIME CMD
      1 ?        00:00:00 docker-init
      7 ?        00:00:00 ps

mostlygeek · 2025-01-29T21:02:11Z

I played around with this a bit and I couldn't get it to swap correctly. 🤔 Was there some other setup you did?

knguyen298 · 2025-01-29T21:16:15Z

What happens when you try to swap?

I also saw that you didn't have the --rm flag in your above example. I'm pretty sure it's needed so you don't have a container sitting around consuming VRAM, but I'm not 100% sure.

zenabius · 2025-01-30T20:31:52Z

I played around with this a bit and I couldn't get it to swap correctly. 🤔 Was there some other setup you did?

@mostlygeek @aiseei
Ollama and Llama.cpp docker versions:

models:
  "0-ollama-server":
    cmd: >
      docker run --rm 
      --gpus all
      --init 
      -p 9797:11434
      -e "HOME=/root/"
      -e "OLLAMA_DEBUG=0"
      -e "OLLAMA_TMPDIR=/root/.ollama/tmp/"
      -e "OLLAMA_MAX_LOADED_MODELS=3"
      -e "OLLAMA_KEEP_ALIVE=3600"
      -e "OLLAMA_ORIGINS=moz-extension://*"
      -e "OLLAMA_FLASH_ATTENTION=1"
      -e "OLLAMA_KV_CACHE_TYPE=q8_0"
      -v /mnt/llm/llama-swap-ollama:/root/.ollama 
      ollama/ollama
    proxy: "http://127.0.0.1:9797"
    # check this path for an HTTP 200 OK before serving requests
    # default: /health to match llama.cpp
    # use "none" to skip endpoint checking, but may cause HTTP errors
    # until the model is ready
    checkEndpoint: /api/version
    # automatically unload the model after this many seconds
    # ttl values must be a value greater than 0
    # default: 0 = never unload model
    ttl: 240

    aliases:
    - "minicpm-v"
    - "mxbai-embed-large"
    unlisted: true


    "1-Qwen2.5-72B-Instruct-draft":
    cmd: >
      docker run --rm 
      --gpus '"device=0,1,2"'
      --init 
      -p 9797:8080 
      -v /mnt/llm/GGUF:/models 
      ghcr.io/ggerganov/llama.cpp:server-cuda
      --host 0.0.0.0 --port 8080
      --tensor-split 34,36,30
      --parallel 2
      --flash-attn
      --slots
      --cache-type-k q8_0
      --cache-type-v q8_0
      --model /models/00.Models/Qwen2.5-72B-Instruct-GGUF/huihui-ai.Qwen2.5-72B-Instruct-abliterated.Q6_K-00001-of-00005.gguf
      --gpu-layers 999
      --ctx-size 24576
      --model-draft /models/01.Drafts/qwen2.5-0.5b-instruct-abliterated-q8_0.gguf
      --gpu-layers-draft 999
      --ctx-size-draft 4096
      --draft-max 24
      --draft-min 5
      --draft-p-min 0.6
      --device-draft CUDA2
      --top-p 0.95
      --temp 0.8
      --repeat-penalty 1.2
      --frequency-penalty 0.1
      --presence-penalty 0.2
      --no-mmap
      --no-webui
      --no-context-shift
    proxy: "http://127.0.0.1:9797"
    ttl: 3600
    aliases:
    - qwen

usermod -aG docker llamaswpusr
sudo nano /etc/systemd/system/llama-swap.service

[Unit]
Description=llama-swap daemon
After=network.target

[Service]
User=llamaswpusr
Group=docker

# Set this to match your environment
ExecStart=/opt/llama-swap/bin/llama-swap --config /opt/llama-swap/config/config.yaml

# Grant Docker access
AmbientCapabilities=CAP_NET_RAW CAP_SYS_ADMIN
NoNewPrivileges=true

Restart=on-failure
RestartSec=3
StartLimitBurst=3
StartLimitInterval=30s  # Added 's' for seconds as per systemd documentation

[Install]
WantedBy=multi-user.target

systemctl daemon-reload
systemctl enable llama-swap.service
systemctl start llama-swap.service
systemctl status llama-swap.service

mostlygeek · 2025-01-30T22:51:22Z

I've messed around with this quite a bit and I still have no idea how you all got it working! 😅
I still can't get running containers to reliably terminate before starting a new one.

llama-swap sends a SIGTERM to stop before swapping. This is incompatible with docker's client/server model. The docker run ... is just a client that sends commands to the docker-daemon. The recommended way to stop a running container is by calling docker stop ....

Here is a testing config I'm using:

models:

  "docker1":
    proxy: "http://127.0.0.1:9503"
    cmd: >
      docker run --gpus '"device=3"' --init --rm
      -p 9503:8080 -v /mnt/nvme/models:/models --name dockertest1
      ghcr.io/ggerganov/llama.cpp:server-cuda -ngl 99 --model '/models/Qwen2.5-Coder-0.5B-Instruct-Q4_K_M.gguf'

  "docker2":
    proxy: "http://127.0.0.1:9503"
    cmd: >
      docker run --runtime nvidia --gpus '"device=3"' --init --rm
      -p 9503:8080 -v /mnt/nvme/models:/models --name dockertest2
      ghcr.io/ggerganov/llama.cpp:server-cuda -ngl 99 --model '/models/Qwen2.5-Coder-0.5B-Instruct-Q4_K_M.gguf'

I can get docker1 to load by visiting http://x.x.x.x/upstream/docker1, which shows the llama.cpp webUI. However, accessing http://x.x.x.x/upstream/docker2 (to trigger a swap), the docker1 container doesn't stop as expected. I can confirm this with docker ps.

So I'm thinking right now:

there's something different with my docker setup. It's ubuntu 24.04, with the nvidia docker toolkit installed
write a signal proxy to catch SIGTERM and call docker stop .... It would essentially be like docker-run <flags for real docker run>
add docker functionality to llama-swap.

For native docker support the config could look like:

models: 
  "docker1": 
    proxy: http://localhost:9505
    cmd: >
      docker run ... --name "my_container"
   cmd_stop: docker stop "my_container"

In this case if cmd_stop exists it will be called instead of sending a SIGTERM signal.

mostlygeek · 2025-01-31T00:48:57Z

I pushed #40 which provides a cmd_stop in the docker-support-35 branch. In my testing using the cmd_stop is reliable in stopping a docker container.

Add `cmd_stop` to model configuration to run a command instead of sending a SIGTERM to shutdown a process before swapping.

mostlygeek · 2025-01-31T01:11:58Z

@zenabius Would you mind trying out the new release (v84) that I pushed with cmd_stop?

Also, as a bit of irony I wrote llama-swap because ollama didn't support the nvidia P40s. So discovering someone running it inside llama-swap is pretty neat!

zenabius · 2025-01-31T02:55:02Z

@zenabius Would you mind trying out the new release (v84) that I pushed with cmd_stop?

Also, as a bit of irony I wrote llama-swap because ollama didn't support the nvidia P40s. So discovering someone running it inside llama-swap is pretty neat!

[llama-swap] 192.168.16.7 [2025-01-30 18:35:15] "GET /v1/models HTTP/1.1" 200 578 "Python/3.11 aiohttp/3.11.8" 63.84µs
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
 Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
warn: LLAMA_ARG_HOST environment variable is set, but will be overwritten by command line argument --host
build: 4589 (eb7cf15a) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
system info: n_threads = 8, n_threads_batch = 8, total_threads = 16

system_info: n_threads = 8 (n_threads_batch = 8) / 16 | CUDA : ARCHS = 520,610,700,750 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 

Web UI is disabled
main: HTTP server is listening, hostname: 0.0.0.0, port: 8080, http threads: 15
main: loading model
srv    load_model: loading model '/models/00.Models/Mistral-Small-24B-Instruct-2501-Q6_K_L.gguf'
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23916 MiB free
llama_model_loader: loaded meta data with 44 key-value pairs and 363 tensors from /models/00.Models/Mistral-Small-24B-Instruct-2501-Q6_K_L.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Mistral Small 24B Instruct 2501
llama_model_loader: - kv   3:                            general.version str              = 2501
llama_model_loader: - kv   4:                           general.finetune str              = Instruct
llama_model_loader: - kv   5:                           general.basename str              = Mistral-Small
llama_model_loader: - kv   6:                         general.size_label str              = 24B
llama_model_loader: - kv   7:                            general.license str              = apache-2.0
llama_model_loader: - kv   8:                   general.base_model.count u32              = 1
llama_model_loader: - kv   9:                  general.base_model.0.name str              = Mistral Small 24B Base 2501
llama_model_loader: - kv  10:               general.base_model.0.version str              = 2501
llama_model_loader: - kv  11:          general.base_model.0.organization str              = Mistralai
llama_model_loader: - kv  12:              general.base_model.0.repo_url str              = https://huggingface.co/mistralai/Mist...
llama_model_loader: - kv  13:                          general.languages arr[str,10]      = ["en", "fr", "de", "es", "it", "pt", ...
llama_model_loader: - kv  14:                          llama.block_count u32              = 40
llama_model_loader: - kv  15:                       llama.context_length u32              = 32768
llama_model_loader: - kv  16:                     llama.embedding_length u32              = 5120
llama_model_loader: - kv  17:                  llama.feed_forward_length u32              = 32768
llama_model_loader: - kv  18:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  19:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  20:                       llama.rope.freq_base f32              = 100000000.000000
llama_model_loader: - kv  21:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  22:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  23:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  24:                           llama.vocab_size u32              = 131072
llama_model_loader: - kv  25:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  26:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  27:                         tokenizer.ggml.pre str              = tekken
llama_model_loader: - kv  28:                      tokenizer.ggml.tokens arr[str,131072]  = ["<unk>", "<s>", "</s>", "[INST]", "[...
llama_model_loader: - kv  29:                  tokenizer.ggml.token_type arr[i32,131072]  = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  30:                      tokenizer.ggml.merges arr[str,269443]  = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ �...
llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  32:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  33:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  34:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  35:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv  37:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  38:               general.quantization_version u32              = 2
llama_model_loader: - kv  39:                          general.file_type u32              = 18
llama_model_loader: - kv  40:                      quantize.imatrix.file str              = /models_out/Mistral-Small-24B-Instruc...
llama_model_loader: - kv  41:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  42:             quantize.imatrix.entries_count i32              = 280
llama_model_loader: - kv  43:              quantize.imatrix.chunks_count i32              = 128
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type q8_0:    2 tensors
llama_model_loader: - type q6_K:  280 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q6_K
print_info: file size   = 18.31 GiB (6.67 BPW) 
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 1000
load: token to piece cache size = 0.8498 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 32768
print_info: n_embd           = 5120
print_info: n_layer          = 40
print_info: n_head           = 32
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 32768
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 100000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 32768
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 13B
print_info: model params     = 23.57 B
print_info: general.name     = Mistral Small 24B Instruct 2501
print_info: vocab type       = BPE
print_info: n_vocab          = 131072
print_info: n_merges         = 269443
print_info: BOS token        = 1 '<s>'
print_info: EOS token        = 2 '</s>'
print_info: UNK token        = 0 '<unk>'
print_info: LF token         = 1196 'Ä'
print_info: EOG token        = 2 '</s>'
print_info: max token length = 150
load_tensors: offloading 40 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 41/41 layers to GPU
load_tensors:    CUDA_Host model buffer size =   680.00 MiB
load_tensors:        CUDA0 model buffer size = 18072.21 MiB
request: GET /health 172.17.0.1 503
request: GET /health 172.17.0.1 503
request: GET /health 172.17.0.1 503
request: GET /health 172.17.0.1 503
request: GET /health 172.17.0.1 503
request: GET /health 172.17.0.1 503
request: GET /health 172.17.0.1 503
llama_init_from_model: n_seq_max     = 2
llama_init_from_model: n_ctx         = 32768
llama_init_from_model: n_ctx_per_seq = 16384
llama_init_from_model: n_batch       = 2048
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 1
llama_init_from_model: freq_base     = 100000000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (16384) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 32768, offload = 1, type_k = 'q8_0', type_v = 'q8_0', n_layer = 40, can_shift = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  2720.00 MiB
llama_init_from_model: KV self size  = 2720.00 MiB, K (q8_0): 1360.00 MiB, V (q8_0): 1360.00 MiB
llama_init_from_model:  CUDA_Host  output buffer size =     1.00 MiB
llama_init_from_model:      CUDA0 compute buffer size =   266.00 MiB
llama_init_from_model:  CUDA_Host compute buffer size =    74.01 MiB
llama_init_from_model: graph nodes  = 1127
llama_init_from_model: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 32768
srv          init: initializing slots, n_slots = 2
slot         init: id  0 | task -1 | new slot n_ctx_slot = 16384
slot         init: id  1 | task -1 | new slot n_ctx_slot = 16384
main: model loaded
main: chat template, chat_template: {{ bos_token }}{% for message in messages %}{% if message['role'] == 'user' %}{{ '[INST]' + message['content'] + '[/INST]' }}{% elif message['role'] == 'system' %}{{ '[SYSTEM_PROMPT]' + message['content'] + '[/SYSTEM_PROMPT]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token }}{% else %}{{ raise_exception('Only user, system and assistant roles are supported!') }}{% endif %}{% endfor %}, example_format: '[SYSTEM_PROMPT] You are a helpful assistant[/SYSTEM_PROMPT][INST] Hello[/INST] Hi there</s>[INST] How are you?[/INST]'
main: server is listening on http://0.0.0.0:8080 - starting the main loop
srv  update_slots: all slots are idle
request: GET /health 172.17.0.1 200
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 16384, n_keep = 0, n_prompt_tokens = 332
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 332, n_tokens = 332, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 332, n_tokens = 332
request: GET /health 127.0.0.1 200
slot      release: id  0 | task 0 | stop processing: n_past = 365, truncated = 0
slot print_timing: id  0 | task 0 | 
prompt eval time =   22091.03 ms /   332 tokens (   66.54 ms per token,    15.03 tokens per second)
      eval time =   13380.43 ms /    34 tokens (  393.54 ms per token,     2.54 tokens per second)
     total time =   35471.46 ms /   366 tokens
srv  update_slots: all slots are idle
request: POST /v1/chat/completions 172.17.0.1 200
[llama-swap] 192.168.16.7 [2025-01-30 18:35:58] "POST /v1/chat/completions HTTP/1.1" 200 8669 "Python/3.11 aiohttp/3.11.8" 43.734246604s
[llama-swap] 192.168.16.7 [2025-01-30 18:35:59] "GET /v1/models HTTP/1.1" 200 578 "Python/3.11 aiohttp/3.11.8" 87.184µs
slot launch_slot_: id  0 | task 35 | processing task
slot update_slots: id  0 | task 35 | new prompt, n_ctx_slot = 16384, n_keep = 0, n_prompt_tokens = 422
slot update_slots: id  0 | task 35 | kv cache rm [185, end)
slot update_slots: id  0 | task 35 | prompt processing progress, n_past = 422, n_tokens = 237, progress = 0.561611
slot update_slots: id  0 | task 35 | prompt done, n_past = 422, n_tokens = 237
slot      release: id  0 | task 35 | stop processing: n_past = 434, truncated = 0
slot print_timing: id  0 | task 35 | 
prompt eval time =     182.78 ms /   237 tokens (    0.77 ms per token,  1296.63 tokens per second)
      eval time =     342.29 ms /    13 tokens (   26.33 ms per token,    37.98 tokens per second)
     total time =     525.07 ms /   250 tokens
srv  update_slots: all slots are idle
[llama-swap] 192.168.16.7 [2025-01-30 18:35:59] "POST /v1/chat/completions HTTP/1.1" 200 600 "Python/3.11 aiohttp/3.11.8" 530.866349ms
request: POST /v1/chat/completions 172.17.0.1 200
[llama-swap] 192.168.16.7 [2025-01-30 18:36:10] "GET /v1/models HTTP/1.1" 200 578 "Python/3.11 aiohttp/3.11.8" 44.855µs
!!! Running stop command: docker stop small
small
INFO 01-30 18:36:16 __init__.py:183] Automatically detected platform cuda.
INFO 01-30 18:36:17 api_server.py:835] vLLM API server version 0.7.0
INFO 01-30 18:36:17 api_server.py:836] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/mnt/model/mistralai-Pixtral-12B-2409', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='mistral', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format='mistral', dtype='auto', kv_cache_dtype='auto', max_model_len=32768, guided_decoding_backend='xgrammar', logits_processor_pattern=None, distributed_executor_backend=None, pipeline_parallel_size=2, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=True, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt={'image': 4}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['vllm-vision-Pixtral-12B-2409'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)
INFO 01-30 18:36:17 config.py:2314] Downcasting torch.float32 to torch.float16.
INFO 01-30 18:36:22 config.py:520] This model supports multiple tasks: {'reward', 'classify', 'embed', 'generate', 'score'}. Defaulting to 'generate'.
INFO 01-30 18:36:22 config.py:1328] Defaulting to use mp for distributed inference
WARNING 01-30 18:36:22 config.py:647] Async output processing can not be enabled with pipeline parallel
INFO 01-30 18:36:22 llm_engine.py:232] Initializing an LLM engine (v0.7.0) with config: model='/mnt/model/mistralai-Pixtral-12B-2409', speculative_config=None, tokenizer='/mnt/model/mistralai-Pixtral-12B-2409', skip_tokenizer_init=False, tokenizer_mode=mistral, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=vllm-vision-Pixtral-12B-2409, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}, use_cached_outputs=False, 
WARNING 01-30 18:36:22 multiproc_worker_utils.py:298] Reducing Torch parallelism from 8 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 01-30 18:36:22 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
�[1;36m(VllmWorkerProcess pid=65)�[0;0m INFO 01-30 18:36:22 multiproc_worker_utils.py:227] Worker ready; awaiting tasks
INFO 01-30 18:36:23 cuda.py:225] Using Flash Attention backend.
�[1;36m(VllmWorkerProcess pid=65)�[0;0m INFO 01-30 18:36:23 cuda.py:225] Using Flash Attention backend.
INFO 01-30 18:36:24 utils.py:938] Found nccl from library libnccl.so.2
�[1;36m(VllmWorkerProcess pid=65)�[0;0m INFO 01-30 18:36:24 utils.py:938] Found nccl from library libnccl.so.2
INFO 01-30 18:36:24 pynccl.py:67] vLLM is using nccl==2.21.5
�[1;36m(VllmWorkerProcess pid=65)�[0;0m INFO 01-30 18:36:24 pynccl.py:67] vLLM is using nccl==2.21.5
INFO 01-30 18:36:24 model_runner.py:1110] Starting to load model /mnt/model/mistralai-Pixtral-12B-2409...
�[1;36m(VllmWorkerProcess pid=65)�[0;0m INFO 01-30 18:36:24 model_runner.py:1110] Starting to load model /mnt/model/mistralai-Pixtral-12B-2409...
INFO 01-30 18:36:29 config.py:2924] cudagraph sizes specified by model runner [] is overridden by config []
�[1;36m(VllmWorkerProcess pid=65)�[0;0m INFO 01-30 18:36:29 config.py:2924] cudagraph sizes specified by model runner [] is overridden by config []
\rLoading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
\rLoading safetensors checkpoint shards: 100% Completed | 1/1 [00:18<00:00, 18.09s/it]
\rLoading safetensors checkpoint shards: 100% Completed | 1/1 [00:18<00:00, 18.09s/it]

INFO 01-30 18:36:48 model_runner.py:1115] Loading model weights took 12.2486 GB
�[1;36m(VllmWorkerProcess pid=65)�[0;0m INFO 01-30 18:36:48 model_runner.py:1115] Loading model weights took 12.2486 GB
�[1;36m(VllmWorkerProcess pid=65)�[0;0m INFO 01-30 18:36:56 worker.py:266] Memory profiling takes 7.82 seconds\r
�[1;36m(VllmWorkerProcess pid=65)�[0;0m INFO 01-30 18:36:56 worker.py:266] the current vLLM instance can use total_gpu_memory (23.69GiB) x gpu_memory_utilization (0.90) = 21.32GiB\r
�[1;36m(VllmWorkerProcess pid=65)�[0;0m INFO 01-30 18:36:56 worker.py:266] model weights take 12.25GiB; non_torch_memory takes 0.14GiB; PyTorch activation peak memory takes 3.89GiB; the rest of the memory reserved for KV Cache is 5.04GiB.
INFO 01-30 18:36:57 worker.py:266] Memory profiling takes 8.40 seconds\r
INFO 01-30 18:36:57 worker.py:266] the current vLLM instance can use total_gpu_memory (23.69GiB) x gpu_memory_utilization (0.90) = 21.32GiB\r
INFO 01-30 18:36:57 worker.py:266] model weights take 12.25GiB; non_torch_memory takes 0.15GiB; PyTorch activation peak memory takes 3.60GiB; the rest of the memory reserved for KV Cache is 5.33GiB.
INFO 01-30 18:36:57 executor_base.py:108] # CUDA blocks: 4129, # CPU blocks: 3276
INFO 01-30 18:36:57 executor_base.py:113] Maximum concurrency for 32768 tokens per request: 2.02x
INFO 01-30 18:37:01 llm_engine.py:429] init engine (profile, create kv cache, warmup model) took 12.90 seconds
INFO 01-30 18:37:01 api_server.py:753] Using supplied chat template:\r
INFO 01-30 18:37:01 api_server.py:753] None
INFO 01-30 18:37:01 launcher.py:19] Available routes are:
INFO 01-30 18:37:01 launcher.py:27] Route: /openapi.json, Methods: GET, HEAD
INFO 01-30 18:37:01 launcher.py:27] Route: /docs, Methods: GET, HEAD
INFO 01-30 18:37:01 launcher.py:27] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 01-30 18:37:01 launcher.py:27] Route: /redoc, Methods: GET, HEAD
INFO 01-30 18:37:01 launcher.py:27] Route: /health, Methods: GET
INFO 01-30 18:37:01 launcher.py:27] Route: /ping, Methods: GET, POST
INFO 01-30 18:37:01 launcher.py:27] Route: /tokenize, Methods: POST
INFO 01-30 18:37:01 launcher.py:27] Route: /detokenize, Methods: POST
INFO 01-30 18:37:01 launcher.py:27] Route: /v1/models, Methods: GET
INFO 01-30 18:37:01 launcher.py:27] Route: /version, Methods: GET
INFO 01-30 18:37:01 launcher.py:27] Route: /v1/chat/completions, Methods: POST
INFO 01-30 18:37:01 launcher.py:27] Route: /v1/completions, Methods: POST
INFO 01-30 18:37:01 launcher.py:27] Route: /v1/embeddings, Methods: POST
INFO 01-30 18:37:01 launcher.py:27] Route: /pooling, Methods: POST
INFO 01-30 18:37:01 launcher.py:27] Route: /score, Methods: POST
INFO 01-30 18:37:01 launcher.py:27] Route: /v1/score, Methods: POST
INFO 01-30 18:37:01 launcher.py:27] Route: /rerank, Methods: POST
INFO 01-30 18:37:01 launcher.py:27] Route: /v1/rerank, Methods: POST
INFO 01-30 18:37:01 launcher.py:27] Route: /v2/rerank, Methods: POST
INFO 01-30 18:37:01 launcher.py:27] Route: /invocations, Methods: POST
INFO:     Started server process [7]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO:     172.17.0.1:41718 - "GET /health HTTP/1.1" 200 OK
INFO 01-30 18:37:02 chat_utils.py:330] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
WARNING 01-30 18:37:02 chat_utils.py:990] 'add_generation_prompt' is not supported for mistral tokenizer, so it will be ignored.
WARNING 01-30 18:37:02 chat_utils.py:994] 'continue_final_message' is not supported for mistral tokenizer, so it will be ignored.
INFO 01-30 18:37:02 logger.py:37] Received request chatcmpl-297e95a809c449f6b9e2a1bf35f84075: prompt: None, params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=32740, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: [1, 3, 1267, 4019, 27505, 1877, 1049, 1046, 1766, 1050, 1048, 1050, 1053, 1045, 1048, 1049, 1045, 1049, 1055, 3077, 7777, 12940, 28005, 1911, 2342, 4688, 1106, 4], lora_request: None, prompt_adapter_request: None.
INFO:     172.17.0.1:41718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 01-30 18:37:02 async_llm_engine.py:209] Added request chatcmpl-297e95a809c449f6b9e2a1bf35f84075.
/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py:519: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:1560.)
 object_tensor = torch.frombuffer(pickle.dumps(obj), dtype=torch.uint8)
[rank0]:[W130 18:37:02.056104077 ProcessGroupNCCL.cpp:3057] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[rank1]:[W130 18:37:02.056189458 ProcessGroupNCCL.cpp:3057] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
INFO 01-30 18:37:03 async_llm_engine.py:177] Finished request chatcmpl-297e95a809c449f6b9e2a1bf35f84075.
[llama-swap] 192.168.16.7 [2025-01-30 18:37:03] "POST /v1/chat/completions HTTP/1.1" 200 1001 "Python/3.11 aiohttp/3.11.8" 52.238973762s
[llama-swap] 192.168.16.7 [2025-01-30 18:37:03] "GET /v1/models HTTP/1.1" 200 578 "Python/3.11 aiohttp/3.11.8" 58.09µs
INFO 01-30 18:37:03 logger.py:37] Received request chatcmpl-fa838fe757ee4c58a9819cb2d94f30d4: prompt: None, params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=50, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: [1, 3, 17013, 120256, 1747, 4607, 15776, 1338, 11745, 1395, 1278, 7330, 1877, 1106, 1475, 1267, 66249, 1261, 104335, 4798, 1319, 2649, 2081, 2224, 1032, 1051, 6619, 1041, 1455, 32181, 31825, 1278, 2830, 14364, 1505, 17915, 1307, 1278, 7330, 1294, 14171, 7278, 1046, 4682, 3659, 1275, 1710, 1402, 2434, 1317, 22548, 9705, 1809, 10035, 95017, 26607, 1505, 6061, 58929, 1046, 18113, 1080, 2918, 1068, 101803, 54585, 14671, 1353, 62814, 57484, 1338, 63841, 1307, 26864, 1877, 1240, 1159, 1147, 1137, 18196, 29897, 120430, 1010, 1240, 1159, 1141, 1170, 113959, 53059, 89275, 1010, 1069, 5408, 1307, 10144, 123630, 1010, 12133, 14012, 3078, 106375, 1010, 13370, 71575, 42571, 1294, 64865, 1010, 1240, 1159, 1142, 1174, 9856, 15972, 13240, 8713, 4], lora_request: None, prompt_adapter_request: None.
INFO 01-30 18:37:03 async_llm_engine.py:209] Added request chatcmpl-fa838fe757ee4c58a9819cb2d94f30d4.
INFO 01-30 18:37:03 async_llm_engine.py:177] Finished request chatcmpl-fa838fe757ee4c58a9819cb2d94f30d4.
INFO:     172.17.0.1:41718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[llama-swap] 192.168.16.7 [2025-01-30 18:37:03] "POST /v1/chat/completions HTTP/1.1" 200 436 "Python/3.11 aiohttp/3.11.8" 375.147444ms
INFO 01-30 18:37:11 metrics.py:453] Avg prompt throughput: 14.2 tokens/s, Avg generation throughput: 1.2 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
[llama-swap] 192.168.16.7 [2025-01-30 18:37:14] "GET /v1/models HTTP/1.1" 200 578 "Python/3.11 aiohttp/3.11.8" 60.705µs
INFO 01-30 18:37:14 logger.py:37] Received request chatcmpl-4f06117407644dad8c33815f9701b389: prompt: None, params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=32740, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: [1, 3, 1267, 4019, 27505, 1877, 1049, 1046, 1766, 1050, 1048, 1050, 1053, 1045, 1048, 1049, 1045, 1049, 1055, 3077, 7777, 12940, 28005, 1911, 2342, 4688, 1106, 4], lora_request: None, prompt_adapter_request: None.
INFO:     172.17.0.1:47590 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 01-30 18:37:14 async_llm_engine.py:209] Added request chatcmpl-4f06117407644dad8c33815f9701b389.
INFO 01-30 18:37:14 async_llm_engine.py:177] Finished request chatcmpl-4f06117407644dad8c33815f9701b389.
[llama-swap] 192.168.16.7 [2025-01-30 18:37:14] "POST /v1/chat/completions HTTP/1.1" 200 2927 "Python/3.11 aiohttp/3.11.8" 520.776271ms
INFO 01-30 18:37:14 logger.py:37] Received request chatcmpl-85e1239cb2524a9c8ea90487eace34cb: prompt: None, params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=50, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: [1, 3, 17013, 120256, 1747, 4607, 15776, 1338, 11745, 1395, 1278, 7330, 1877, 1106, 1475, 1267, 66249, 1261, 104335, 4798, 1319, 2649, 2081, 2224, 1032, 1051, 6619, 1041, 1455, 32181, 31825, 1278, 2830, 14364, 1505, 17915, 1307, 1278, 7330, 1294, 14171, 7278, 1046, 4682, 3659, 1275, 1710, 1402, 2434, 1317, 22548, 9705, 1809, 10035, 95017, 26607, 1505, 6061, 58929, 1046, 18113, 1080, 2918, 1068, 101803, 54585, 14671, 1353, 62814, 57484, 1338, 63841, 1307, 26864, 1877, 1240, 1159, 1147, 1137, 18196, 29897, 120430, 1010, 1240, 1159, 1141, 1170, 113959, 53059, 89275, 1010, 1069, 5408, 1307, 10144, 123630, 1010, 12133, 14012, 3078, 106375, 1010, 13370, 71575, 42571, 1294, 64865, 1010, 1240, 1159, 1142, 1174, 9856, 15972, 13240, 8713, 4], lora_request: None, prompt_adapter_request: None.
INFO 01-30 18:37:14 async_llm_engine.py:209] Added request chatcmpl-85e1239cb2524a9c8ea90487eace34cb.
INFO 01-30 18:37:17 metrics.py:453] Avg prompt throughput: 28.2 tokens/s, Avg generation throughput: 3.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%.
INFO 01-30 18:37:17 async_llm_engine.py:177] Finished request chatcmpl-85e1239cb2524a9c8ea90487eace34cb.
INFO:     172.17.0.1:47590 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[llama-swap] 192.168.16.7 [2025-01-30 18:37:17] "POST /v1/chat/completions HTTP/1.1" 200 436 "Python/3.11 aiohttp/3.11.8" 2.689992894s
INFO 01-30 18:37:31 metrics.py:453] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 01-30 18:37:41 metrics.py:453] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
[llama-swap] 192.168.16.7 [2025-01-30 18:37:46] "GET /v1/models HTTP/1.1" 200 578 "Python/3.11 aiohttp/3.11.8" 130.957µs
[llama-swap] 192.168.16.7 [2025-01-30 18:38:01] "GET /v1/models HTTP/1.1" 200 578 "Python/3.11 aiohttp/3.11.8" 78.257µs
[llama-swap] 192.168.16.7 [2025-01-30 18:38:10] "GET /v1/models HTTP/1.1" 200 578 "Python/3.11 aiohttp/3.11.8" 62.328µs
[llama-swap] 192.168.16.7 [2025-01-30 18:38:43] "GET /v1/models HTTP/1.1" 200 578 "Python/3.11 aiohttp/3.11.8" 83.678µs
[llama-swap] 192.168.0.11 [2025-01-30 18:39:03] "GET /logs HTTP/1.1" 200 4359 "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36 Edg/132.0.0.0" 91.893µs
[llama-swap] 192.168.0.11 [2025-01-30 18:39:03] "GET /logs/streamSSE HTTP/1.1" 200 36152 "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36 Edg/132.0.0.0" 3m47.157146088s
[llama-swap] 192.168.0.11 [2025-01-30 18:39:03] "GET /favicon.ico HTTP/1.1" 200 15406 "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36 Edg/132.0.0.0" 146.776µs
INFO 01-30 18:39:18 launcher.py:57] Shutting down FastAPI HTTP server.
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO 01-30 18:39:18 async_llm_engine.py:63] Engine is gracefully shutting down.
�[1;36m(VllmWorkerProcess pid=65)�[0;0m INFO 01-30 18:39:18 multiproc_worker_utils.py:251] Worker exiting
INFO 01-30 18:39:20 multiproc_worker_utils.py:126] Killing local vLLM worker processes
[rank0]:[W130 18:39:20.565236225 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
regex filter
clear

This is "old style" shutdown, without "cmd_stop: docker stop"

Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36 Edg/132.0.0.0" 3m47.157146088s
[llama-swap] 192.168.0.11 [2025-01-30 18:39:03] "GET /favicon.ico HTTP/1.1" 200 15406 "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36 Edg/132.0.0.0" 146.776µs
INFO 01-30 18:39:18 launcher.py:57] Shutting down FastAPI HTTP server.
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO 01-30 18:39:18 async_llm_engine.py:63] Engine is gracefully shutting down.
�[1;36m(VllmWorkerProcess pid=65)�[0;0m INFO 01-30 18:39:18 multiproc_worker_utils.py:251] Worker exiting
INFO 01-30 18:39:20 multiproc_worker_utils.py:126] Killing local vLLM worker processes
[rank0]:[W130 18:39:20.565236225 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
regex filter
clear`

zenabius · 2025-01-31T03:38:07Z

@mostlygeek.
Everything seems to be fine, I am just not using container names.
Main thing to stop container via old style is "--init"

zenabius · 2025-01-31T04:39:20Z

@mostlygeek forget to ask
is there any possibility to auto load any model with llama-swap service startup?

and could you add please endpoint for OpenWebUI audio "/audio/transcriptions"?

and there is more problem, llama-swap don't understand OPTIONS query

OPTIONS /v1/chat/completions HTTP/1.1" 404 -1 "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Cursor/0.45.5 Chrome/128.0.6613.186 Electron/32.2.6 Safari/537.36" 3.747µs

$headers = @{
		"Content-Type" = "application/json"
		"Authorization" = "Bearer 123456"
}

$body = @{
		"messages" = @(
				@{
						"role" = "system"
						"content" = "You are a test assistant."
				},
				@{
						"role" = "user"
						"content" = "Testing. Just say hi and nothing else."
				}
		)
		"model" = "coder"
} | ConvertTo-Json

Invoke-WebRequest -Uri "http://192.168.0.2:8080/v1/chat/completions" -Method Post -Headers $headers -Body $body

mostlygeek · 2025-01-31T06:08:10Z

@zenabius

Please file an issue for supporting OPTIONS. That should be easy to add in
llama-swap supports the openAI endpoints for most compatibility.

zenabius · 2025-01-31T06:12:26Z

@mostlygeek sorry, /audio/transcriptions is for OpenWebUI audio transcriptions, I wrote it wrong.

knguyen298 · 2025-01-31T16:36:57Z

I'm genuinely not sure what is different about my config that it works for me. I believe I am running v83, based on the timestamp on my llama-swap file. Checked my dockerd config file, nothing there except the Nvidia container runtime config.

I do see that my systemctl has llama-swap running as a local user, who is a part of the docker user group. In the README you have it as User=nobody. Not sure if that makes a difference.

--init is the documented way for Docker to receive a signals, including SIGTERM, so that should be all that's needed. I went ahead and checked swapping containers while watching docker ps, and confirmed there's no lingering container.

I'm running Ubuntu Server 22.04.5, with Docker version 27.4.1, build b9d17ea.

mostlygeek · 2025-01-31T16:44:06Z

I did as much research as I could and the docs say the docker client will proxy signals and send them to PID 1 in the container. I haven’t been able to get it working as expected.

Either way, two ways to shut down containers cleaning now, with the SIGTERM and the cmd. I do prefer just one way though 🤷🏻‍♂️.

mostlygeek · 2025-01-31T20:24:45Z

I figured out my problems was. My docker version was too old ~v24.
After upgrading my ubuntu 24.04 box to 27.5.1 the docker run command propagates SIGTERM signals. Yeesh.

Thanks @zenabius @knguyen298 for info.

mostlygeek added the documentation Improvements or additions to documentation label Jan 11, 2025

mostlygeek closed this as completed Jan 29, 2025

mostlygeek reopened this Jan 29, 2025

mostlygeek added the enhancement New feature or request label Jan 30, 2025

mostlygeek added a commit that referenced this issue Jan 31, 2025

Add cmd_stop configuration to better support docker (#35)

baeb0c4

Add `cmd_stop` to model configuration to run a command instead of sending a SIGTERM to shutdown a process before swapping.

mostlygeek closed this as completed Jan 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compatible with llama-server docker #35

compatible with llama-server docker #35

aiseei commented Jan 10, 2025

mostlygeek commented Jan 11, 2025

mostlygeek commented Jan 11, 2025

knguyen298 commented Jan 28, 2025

mostlygeek commented Jan 28, 2025

knguyen298 commented Jan 29, 2025

mostlygeek commented Jan 29, 2025

mostlygeek commented Jan 29, 2025

knguyen298 commented Jan 29, 2025

zenabius commented Jan 30, 2025 •

edited

Loading

mostlygeek commented Jan 30, 2025 •

edited

Loading

mostlygeek commented Jan 31, 2025

mostlygeek commented Jan 31, 2025

zenabius commented Jan 31, 2025 •

edited

Loading

zenabius commented Jan 31, 2025 •

edited

Loading

zenabius commented Jan 31, 2025 •

edited

Loading

mostlygeek commented Jan 31, 2025

zenabius commented Jan 31, 2025

knguyen298 commented Jan 31, 2025

mostlygeek commented Jan 31, 2025 •

edited

Loading

mostlygeek commented Jan 31, 2025 •

edited

Loading

compatible with llama-server docker #35

compatible with llama-server docker #35

Comments

aiseei commented Jan 10, 2025

mostlygeek commented Jan 11, 2025

mostlygeek commented Jan 11, 2025

knguyen298 commented Jan 28, 2025

mostlygeek commented Jan 28, 2025

knguyen298 commented Jan 29, 2025

mostlygeek commented Jan 29, 2025

mostlygeek commented Jan 29, 2025

knguyen298 commented Jan 29, 2025

zenabius commented Jan 30, 2025 • edited Loading

mostlygeek commented Jan 30, 2025 • edited Loading

mostlygeek commented Jan 31, 2025

mostlygeek commented Jan 31, 2025

zenabius commented Jan 31, 2025 • edited Loading

zenabius commented Jan 31, 2025 • edited Loading

zenabius commented Jan 31, 2025 • edited Loading

mostlygeek commented Jan 31, 2025

zenabius commented Jan 31, 2025

knguyen298 commented Jan 31, 2025

mostlygeek commented Jan 31, 2025 • edited Loading

mostlygeek commented Jan 31, 2025 • edited Loading

zenabius commented Jan 30, 2025 •

edited

Loading

mostlygeek commented Jan 30, 2025 •

edited

Loading

zenabius commented Jan 31, 2025 •

edited

Loading

zenabius commented Jan 31, 2025 •

edited

Loading

zenabius commented Jan 31, 2025 •

edited

Loading

mostlygeek commented Jan 31, 2025 •

edited

Loading

mostlygeek commented Jan 31, 2025 •

edited

Loading