-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(request-response): Report failure when streams are at capacity #5417
fix(request-response): Report failure when streams are at capacity #5417
Conversation
@jxs We will appreciate it if this can make it to 0.54. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks Yianis!
This pull request has merge conflicts. Could you please resolve them @oblique? 🙏 |
Approvals have been dismissed because the PR was updated after the send-it
label was applied.
@jxs I was under impression that 0.26.3 is already released, but it is not on crates.io 🤔 |
Fixes potential hanging issue if use relies on response or failures to make progress Pull-Request: libp2p#5417.
This PR enforces that outbound requests are finished within the specified protocol timeout. The stable2412 version running libp2p 0.52.4 contains a bug which does not track request timeouts properly: - libp2p/rust-libp2p#5429 The issue has been detected while submitting libp2p -> litep2p requests in kusama. This aims to check that pending outbound requests have not timedout. Although the issue has been fixed in libp2p, there might be other cases where this may happen. For example: - libp2p/rust-libp2p#5417 For more context see: #7076 (comment) 1. Ideally, the force-timeout mechanism in this PR should never be triggered in production. However, origin/stable2412 occasionally encounters this issue. When this happens, 2 warnings may be generated: - one warning introduced by this PR wrt force timeout terminating the request - possible one warning when the libp2p decides (if at all) to provide the response back to substrate (as mentioned by @alexggh [here](/~https://github.com/paritytech/polkadot-sdk/pull/7222/files#diff-052aeaf79fef3d9a18c2cfd67006aa306b8d52e848509d9077a6a0f2eb856af7L769) and [here](/~https://github.com/paritytech/polkadot-sdk/pull/7222/files#diff-052aeaf79fef3d9a18c2cfd67006aa306b8d52e848509d9077a6a0f2eb856af7L842) 2. This implementation does not propagate to the substrate service the `RequestFinished { error: .. }`. That event is only used internally by substrate to increment metrics. However, we don't have the peer information available to propagate the event properly when we force-timeout the request. Considering this should most likely not happen in production (origin/master) and that we'll be able to extract information by warnings, I would say this is a good tradeoff for code simplicity: /~https://github.com/paritytech/polkadot-sdk/blob/06e3b5c6a7696048d65f1b8729f16b379a16f501/substrate/client/network/src/service.rs#L1543 ### Testing Added a new test to ensure the timeout is reached properly, even if libp2p does not produce a response in due time. I've also transitioned the tests to using `tokio::test` due to a limitation of [CI](/~https://github.com/paritytech/polkadot-sdk/actions/runs/12832055737/job/35784043867) ``` --- TRY 1 STDERR: sc-network request_responses::tests::max_response_size_exceeded --- thread 'request_responses::tests::max_response_size_exceeded' panicked at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/time/interval.rs:139:26: there is no reactor running, must be called from the context of a Tokio 1.x runtime ``` cc @paritytech/networking --------- Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io> Co-authored-by: Bastian Köcher <git@kchr.de>
Description
Fixes potential hanging issue if use relies on response or failures to make progress
Notes & open questions
We are investigating a bug in our project (eigerco/lumina#256) and @zvolin found out that when an outbound request can not be scheduled it didn't produce any errors.
Inbound requests do not need to produce this kind of error because they only get reported to the user when they successfully been scheduled.
Change checklist