Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Get "can't connect to retriever" error when concurrency exceeds 32 #1556

Open
2 of 8 tasks
leslieluyu opened this issue Feb 14, 2025 · 6 comments
Open
2 of 8 tasks
Assignees
Labels
bug Something isn't working Dev

Comments

@leslieluyu
Copy link
Collaborator

Priority

P2-High

OS type

Ubuntu

Hardware type

Gaudi2

Installation method

  • Pull docker images from hub.docker.com
  • Build docker images from source
  • Other

Deploy method

  • Docker
  • Docker Compose
  • Kubernetes Helm Charts
  • Kubernetes GMC
  • Other

Running nodes

Single Node

What's the version?

chatqna v1.2 norerank
chatqna-chatqna-ui-695995789c-sz67q opea/chatqna-ui:1.2
chatqna-data-prep-67f484b58f-xwvct opea/dataprep:1.2
chatqna-db8987c4c-slm6z opea/chatqna-without-rerank:1.2
chatqna-nginx-6d9df4b75b-swts2 opea/nginx:latest
chatqna-redis-vector-db-66c94f7fc5-csfx2 redis/redis-stack:7.2.0-v9
chatqna-retriever-usvc-5b64ff97c8-4fkd9 opea/retriever:1.2
chatqna-tei-7fc4845868-lr2wx ghcr.io/huggingface/text-embeddings-inference:cpu-1.5
chatqna-tgi-f5fc79849-bhrsk ghcr.io/huggingface/tgi-gaudi:2.3.1
chatqna-tgi-f5fc79849-fv48f ghcr.io/huggingface/tgi-gaudi:2.3.1
chatqna-tgi-f5fc79849-jdmwj ghcr.io/huggingface/tgi-gaudi:2.3.1
chatqna-tgi-f5fc79849-jwsxb ghcr.io/huggingface/tgi-gaudi:2.3.1
chatqna-tgi-f5fc79849-nhkdj ghcr.io/huggingface/tgi-gaudi:2.3.1
chatqna-tgi-f5fc79849-q5glp ghcr.io/huggingface/tgi-gaudi:2.3.1
chatqna-tgi-f5fc79849-td5lk ghcr.io/huggingface/tgi-gaudi:2.3.1
chatqna-tgi-f5fc79849-zxtr9 ghcr.io/huggingface/tgi-gaudi:2.3.1

Description

Get error when load is heavy.
Use benchmark to get perf data. there are error message : "can't connect to retriever" (see message below) when concurrency exceed 32 .
There are no error when concurrency is below 16(1,2,4,8,16).
This phenomenon only occurs in version 1.2; it did not exist in previous versions(v1.1,v1.0,v0.8, etc.)

Reproduce steps

  1. deploy the chatqna v1.2 by using helm-chart
  2. send request by using benchmarking scripts
  3. see the log in chatqna backend

Raw log

chatqna-5d64d99997-w7928 chatqna INFO:     100.83.122.244:57853 - "POST /v1/chatqna HTTP/1.1" 500 Internal Server Error
chatqna-5d64d99997-w7928 chatqna ERROR:    Exception in ASGI application
chatqna-5d64d99997-w7928 chatqna     raise OSError(err, f'Connect call failed {address}')
chatqna-5d64d99997-w7928 chatqna ConnectionRefusedError: [Errno 111] Connect call failed ('172.21.103.154', 7000)
chatqna-5d64d99997-w7928 chatqna   File "/usr/local/lib/python3.11/site-packages/starlette/middleware/errors.py", line 187, in __call__
chatqna-5d64d99997-w7928 chatqna   File "/usr/local/lib/python3.11/site-packages/starlette/middleware/errors.py", line 165, in __call__
chatqna-5d64d99997-w7928 chatqna     raise client_error(req.connection_key, exc) from exc
chatqna-5d64d99997-w7928 chatqna aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host chatqna-retriever-usvc:7000 ssl:default [Connect call failed ('172.21.103.154', 7000)]
chatqna-5d64d99997-w7928 chatqna INFO:     100.83.122.244:36813 - "POST /v1/chatqna HTTP/1.1" 500 Internal Server Error
chatqna-5d64d99997-w7928 chatqna ERROR:    Exception in ASGI application
chatqna-5d64d99997-w7928 chatqna     raise OSError(err, f'Connect call failed {address}')
chatqna-5d64d99997-w7928 chatqna ConnectionRefusedError: [Errno 111] Connect call failed ('172.21.103.154', 7000)
chatqna-5d64d99997-w7928 chatqna   File "/usr/local/lib/python3.11/site-packages/starlette/middleware/errors.py", line 187, in __call__
chatqna-5d64d99997-w7928 chatqna   File "/usr/local/lib/python3.11/site-packages/starlette/middleware/errors.py", line 165, in __call__
chatqna-5d64d99997-w7928 chatqna     raise client_error(req.connection_key, exc) from exc
chatqna-5d64d99997-w7928 chatqna aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host chatqna-retriever-usvc:7000 ssl:default [Connect call failed ('172.21.103.154', 7000)]
chatqna-5d64d99997-w7928 chatqna INFO:     100.83.122.244:43839 - "POST /v1/chatqna HTTP/1.1" 500 Internal Server Error
chatqna-5d64d99997-w7928 chatqna ERROR:    Exception in ASGI application
chatqna-5d64d99997-w7928 chatqna     raise OSError(err, f'Connect call failed {address}')
chatqna-5d64d99997-w7928 chatqna ConnectionRefusedError: [Errno 111] Connect call failed ('172.21.103.154', 7000)
chatqna-5d64d99997-w7928 chatqna   File "/usr/local/lib/python3.11/site-packages/starlette/middleware/errors.py", line 187, in __call__
chatqna-5d64d99997-w7928 chatqna   File "/usr/local/lib/python3.11/site-packages/starlette/middleware/errors.py", line 165, in __call__
chatqna-5d64d99997-w7928 chatqna     raise client_error(req.connection_key, exc) from exc
chatqna-5d64d99997-w7928 chatqna aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host chatqna-retriever-usvc:7000 ssl:default [Connect call failed ('172.21.103.154', 7000)]
chatqna-5d64d99997-w7928 chatqna INFO:     100.83.122.244:1674 - "POST /v1/chatqna HTTP/1.1" 500 Internal Server Error
chatqna-5d64d99997-w7928 chatqna ERROR:    Exception in ASGI application
chatqna-5d64d99997-w7928 chatqna     raise OSError(err, f'Connect call failed {address}')
chatqna-5d64d99997-w7928 chatqna ConnectionRefusedError: [Errno 111] Connect call failed ('172.21.103.154', 7000)
chatqna-5d64d99997-w7928 chatqna   File "/usr/local/lib/python3.11/site-packages/starlette/middleware/errors.py", line 187, in __call__
chatqna-5d64d99997-w7928 chatqna   File "/usr/local/lib/python3.11/site-packages/starlette/middleware/errors.py", line 165, in __call__
chatqna-5d64d99997-w7928 chatqna     raise client_error(req.connection_key, exc) from exc
chatqna-5d64d99997-w7928 chatqna aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host chatqna-retriever-usvc:7000 ssl:default [Connect call failed ('172.21.103.154', 7000)]

Attachments

No response

@leslieluyu leslieluyu added the bug Something isn't working label Feb 14, 2025
@xiguiw
Copy link
Collaborator

xiguiw commented Feb 17, 2025

chatqna-retriever-usvc-5b64ff97c8-4fkd9 opea/retriever:1.2

What chatqna-retriever-usvc is?
It's in kubernetes only, not in docker-compose.

service_name: "chatqna-retriever-usvc" # Replace with your service name

@xiguiw
Copy link
Collaborator

xiguiw commented Feb 17, 2025

@leslieluyu
Would you please kindly provide the log of chatqna-retriever-usvc? Thanks!

@leslieluyu
Copy link
Collaborator Author

I just found that the issue was mainly caused by LOGFLAG=True, the retriever will have heavy load.
But still ,compared chatqna v1.1 and v1.2 with the LOGFLAG=True, the perf data of v1.2 will be worse.
At least the failed request number of v1.2 will be much more than v1.1.
This issue is just for record

@xiguiw
Copy link
Collaborator

xiguiw commented Feb 25, 2025

y caused by LOGFLAG=True, the retriever will have heavy load.
But still ,compared chatqna v1.1 and v1.2 with the LOGFLAG=True, the perf data of v1.2 will be worse.
At least the failed request number of v1.2 will be much more than v1.1.
This issue is just for record

Could share some numbers so we get an idea the LOGFLAG affect the performance.

  1. With/Without LOGFLAG, what the max number of request the retriever can support? Is it probability failure or not?

  2. What the max number of V1.1 and V1.2 retrieve support?
    Thanks!

The regression of 1.2 should be root caused and analyzed.
@lvliang-intel Who can help on this?

@leslieluyu
Copy link
Collaborator Author

leslieluyu commented Feb 28, 2025

@leslieluyu
[Xigui] I edit this. Only retrieval performance gap is needed.
For example, the max request that retrieval can support in 1.1 and 1.2?
The performance gap with/without LOG.

@xiguiw
Copy link
Collaborator

xiguiw commented Mar 4, 2025

Comparing to LLM and reranking, the retriever consumes little computing resource and memories (both capacities and bandwidth).

For the performance regression, need to double check the retriever log. If blocked by retriever, it is likely the Sync IO/AsyncIO, timout etc.

@leslieluyu
The logs are backend macro service log, I mean the chatqna macro servies container log.

hatqna-5d64d99997-w7928 chatqna INFO:     100.83.122.244:57853 - "POST /v1/chatqna HTTP/1.1" 500 Internal Server Error

Would you please share the retrieval container log?If not easy to separate the container logs, you can pull all the logs here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Dev
Projects
None yet
Development

No branches or pull requests

4 participants