Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(request-response): Report failure when streams are at capacity #5417

Merged
merged 10 commits into from
Jun 4, 2024

Conversation

oblique
Copy link
Contributor

@oblique oblique commented May 24, 2024

Description

Fixes potential hanging issue if use relies on response or failures to make progress

Notes & open questions

We are investigating a bug in our project (eigerco/lumina#256) and @zvolin found out that when an outbound request can not be scheduled it didn't produce any errors.

Inbound requests do not need to produce this kind of error because they only get reported to the user when they successfully been scheduled.

Change checklist

  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • A changelog entry has been made in the appropriate crates

@oblique oblique changed the title fix(request-response)!: Report outbound failure when max concurrent streams is reached fix(request-response)!: Report outbound failure when max concurrent streams reached May 24, 2024
@oblique oblique changed the title fix(request-response)!: Report outbound failure when max concurrent streams reached fix(request-response)!: Report outbound failure when streams are at capacity May 24, 2024
@oblique oblique changed the title fix(request-response)!: Report outbound failure when streams are at capacity fix(request-response)!: Report failure when streams are at capacity May 24, 2024
@oblique
Copy link
Contributor Author

oblique commented May 24, 2024

@jxs We will appreciate it if this can make it to 0.54.

@jxs jxs requested a review from thomaseizinger May 24, 2024 16:24
@oblique oblique changed the title fix(request-response)!: Report failure when streams are at capacity fix(request-response): Report failure when streams are at capacity May 29, 2024
@thomaseizinger thomaseizinger requested a review from jxs June 1, 2024 22:00
jxs
jxs previously approved these changes Jun 3, 2024
Copy link
Member

@jxs jxs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks Yianis!

protocols/request-response/tests/error_reporting.rs Outdated Show resolved Hide resolved
protocols/request-response/tests/error_reporting.rs Outdated Show resolved Hide resolved
protocols/request-response/tests/error_reporting.rs Outdated Show resolved Hide resolved
@jxs jxs added the send-it label Jun 4, 2024
Copy link
Contributor

mergify bot commented Jun 4, 2024

This pull request has merge conflicts. Could you please resolve them @oblique? 🙏

@mergify mergify bot dismissed jxs’s stale review June 4, 2024 10:05

Approvals have been dismissed because the PR was updated after the send-it label was applied.

@mergify mergify bot merged commit af42122 into libp2p:master Jun 4, 2024
72 checks passed
@oblique oblique deleted the fix/req-resp-max-streams-failure branch June 4, 2024 14:46
@nazar-pc
Copy link
Contributor

nazar-pc commented Jun 4, 2024

@jxs I was under impression that 0.26.3 is already released, but it is not on crates.io 🤔

@jxs
Copy link
Member

jxs commented Jun 4, 2024

no no, sorry @nazar-pc I meant the present, that I was going to release. It's now released, see here cc @oblique

TimTinkers pushed a commit to unattended-backpack/rust-libp2p that referenced this pull request Sep 14, 2024
Fixes potential hanging issue if use relies on response or failures to make progress

Pull-Request: libp2p#5417.
github-merge-queue bot pushed a commit to paritytech/polkadot-sdk that referenced this pull request Jan 22, 2025
This PR enforces that outbound requests are finished within the
specified protocol timeout.

The stable2412 version running libp2p 0.52.4 contains a bug which does
not track request timeouts properly:
- libp2p/rust-libp2p#5429

The issue has been detected while submitting libp2p -> litep2p requests
in kusama. This aims to check that pending outbound requests have not
timedout. Although the issue has been fixed in libp2p, there might be
other cases where this may happen. For example:
- libp2p/rust-libp2p#5417

For more context see:
#7076 (comment)


1. Ideally, the force-timeout mechanism in this PR should never be
triggered in production. However, origin/stable2412 occasionally
encounters this issue. When this happens, 2 warnings may be generated:
- one warning introduced by this PR wrt force timeout terminating the
request
- possible one warning when the libp2p decides (if at all) to provide
the response back to substrate (as mentioned by @alexggh
[here](/~https://github.com/paritytech/polkadot-sdk/pull/7222/files#diff-052aeaf79fef3d9a18c2cfd67006aa306b8d52e848509d9077a6a0f2eb856af7L769)
and
[here](/~https://github.com/paritytech/polkadot-sdk/pull/7222/files#diff-052aeaf79fef3d9a18c2cfd67006aa306b8d52e848509d9077a6a0f2eb856af7L842)

2. This implementation does not propagate to the substrate service the
`RequestFinished { error: .. }`. That event is only used internally by
substrate to increment metrics. However, we don't have the peer
information available to propagate the event properly when we
force-timeout the request. Considering this should most likely not
happen in production (origin/master) and that we'll be able to extract
information by warnings, I would say this is a good tradeoff for code
simplicity:


/~https://github.com/paritytech/polkadot-sdk/blob/06e3b5c6a7696048d65f1b8729f16b379a16f501/substrate/client/network/src/service.rs#L1543


### Testing

Added a new test to ensure the timeout is reached properly, even if
libp2p does not produce a response in due time.

I've also transitioned the tests to using `tokio::test` due to a
limitation of
[CI](/~https://github.com/paritytech/polkadot-sdk/actions/runs/12832055737/job/35784043867)

```
--- TRY 1 STDERR:        sc-network request_responses::tests::max_response_size_exceeded ---
thread 'request_responses::tests::max_response_size_exceeded' panicked at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/time/interval.rs:139:26:
there is no reactor running, must be called from the context of a Tokio 1.x runtime
```



cc @paritytech/networking

---------

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Co-authored-by: Bastian Köcher <git@kchr.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants