Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: improve connection handling #22

Merged
merged 10 commits into from
Dec 12, 2024

Conversation

divagant-martian
Copy link
Contributor

@divagant-martian divagant-martian commented Dec 10, 2024

Description

Progress towards better connection handling. Until #13 subscriptions were kept around for the duration of the run even when no longer needed. As a consequence, a notable chunk of code never ran that affects ending and accepting connections, with quick re-connection being a pain point. This PR tries to improve connection handling, but I don't think this is the end of the issues.

Changes are:

  • we keep around the sender for the alternative connection. The alternative connection is that one with a different origin (whether we are or not the dialer). This seems to be original intent of keeping both connections, but dropping the sender to the alternative one effectively ends it because the connection loops has no longer a channel to read messages from for sending. This is bad because the that also means that if there were sent messages in that connection those might be never read.
  • We do not close the connection when the proto state machine indicates to disconnect the peer. The effect of doing this is that all the (per-topic) Disconnect messages were never sent to the peer. This is what made the other node not realize we were gone and instead rely on timeouts.
  • We close the connection only after all the stream data for sending messages has been ackd by the remote. This is not ideal, there isn't one single Disconnect message, but this leaves us in a far better place than before.
  • It's easy to miss but this change is a relatively important: Disconnection was being handled twice: Once when the proto state machine indicates the peer must be disconnected. Here we dropped the peer data and closed the connection before, and once again when the connection loop indicates it's ended, which again removes the peer and informs the state machine of the peer gone. This is problematic because between the state machine deciding the peer must be disconnected and the connection loop ending, messages can be queued, like Join, which were before the lost to the abyss

Breaking Changes

n/a

Notes & open questions

I believe this is better than before. I also believe there are more related issues lurking.
The connection loop currently ignores errors on the receiving side. This is "justified" by arguing errors on the receiving side should appear on the sender side as well. This is weak argument but at this point I'm unsure what to do with it. Incremental improvement I guess?

Change checklist

  • Self-review.
  • Documentation updates following the style guide, if relevant.
  • Tests if relevant.
  • All breaking changes documented.

@divagant-martian divagant-martian changed the title do not close the connection on peer disconnection improve connection handling Dec 10, 2024
Copy link

github-actions bot commented Dec 10, 2024

Documentation for this PR has been generated and is available at: https://n0-computer.github.io/iroh-gossip/pr/22/docs/iroh_gossip/

Last updated: 2024-12-12T05:30:05Z

@divagant-martian divagant-martian changed the title improve connection handling fix: improve connection handling Dec 12, 2024
@divagant-martian divagant-martian marked this pull request as ready for review December 12, 2024 05:32
@divagant-martian
Copy link
Contributor Author

divagant-martian commented Dec 12, 2024

tested using deltachat's full workflow in

Copy link
Contributor

@dignifiedquire dignifiedquire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code wise this looks good to me, but I am not deeply enough read into the gossip network code to understand if this is the "right" way to fix the issue. given the regression tests pass now, I trust you though

@divagant-martian divagant-martian merged commit 61e64c7 into main Dec 12, 2024
24 of 26 checks passed
@divagant-martian divagant-martian deleted the fix-connection-loop-second-try branch December 12, 2024 14:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants