-
Notifications
You must be signed in to change notification settings - Fork 184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(iroh-bytes)!: refactor downloader queue and add progress reporting #2085
Conversation
96cbd60
to
ae3154f
Compare
30787ae
to
5090058
Compare
… subsequent operations
use super::DownloadKind; | ||
|
||
/// The channel that can be used to subscribe to progress updates. | ||
pub type ProgressSubscriber = FlumeProgressSender<DownloadProgress>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once it is merged we can make this a BoxedProgressSender so we don't have flume in here...
|
||
let ActiveRequestInfo { node, temp_tag, .. } = active_request_info; | ||
|
||
// get node info | ||
let node_info = self | ||
.nodes | ||
.get_mut(&node) | ||
.expect("node exists in the mapping"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ppodolsky triggered this expect
when using this branch. I have a suspicion and am writing a test to confirm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now fixed. I added the test, it failed, and now passes. The test is fail_while_running
.
## Description adds retries to the downloader: * after a node fails to dial, or fails during a transfer with IO errors, we move the node into a delay queue, with an increasing timeout. once the delay expires, the node may be dialed again. * if the head of the download queue can only proceed with nodes which are currently in the retry queue, we *park* the download in a separate set from the main queue, and proceed with the next download * once a retrying node succeeds to reconnect, we unpark all parked hashes for which this node is a candidate
da70d6e
to
0f88719
Compare
0f88719
to
0367468
Compare
b94958f
to
51e251d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments. TBH this was a less thorough review just due to reviewer exhaustion, which I assume you as PR owner are suffering as well. Taking into account we just did a release, I think we have time to deal with bugs if any in future PRs
for entry in self.queue.iter_parked() { | ||
assert!( | ||
matches!(self.next_step(entry), NextStep::Park), | ||
"next step for parked node ist WaitForNodeRetry" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"next step for parked node ist WaitForNodeRetry" | |
"next step for parked node is WaitForNodeRetry" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Writing this without first checking the enum changes, but this assertion message is confusing. I'm reading it as:
for each parked item in the queue, check that the next step is Park
but when it's not, say that it's WaitForNodeRetry
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment is wrong, the enum variant was renamed earlier but not in the comment. Also slopyy language. Rephrased to "all parked downloads evaluate to the correct next step".
async move { | ||
if kind == blob_fail.into() { | ||
tokio::time::sleep(Duration::from_millis(10)).await; | ||
Err(FailureAction::DropPeer(anyhow!("bad!"))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤣
@@ -345,3 +361,134 @@ async fn long_queue() { | |||
res.expect("all downloads to succeed"); | |||
} | |||
} | |||
|
|||
#[tokio::test] | |||
async fn fail_while_running() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all the behaviour here makes sense but I don't understand why the test was needed, or let's say.. what particular scenario it's testing.
My wild guess is that it tests that.. concurrent downloads being handled by the same peer are not cancelled if one fails? Does it matter whether the peer was or not removed (DropPeer
) after both are done?
In any case, could you add some docs to the test so that it can be updated more easily when we forget why it was added?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your intuition was correct. The branch previously had a bug (spotted by @ppodolsky) where DropPeer
would not first check if there are other downloads running but drop the peer unconditionally. The test recreates that scenario, and passes since the bug was fixed.
I added this comment to the test:
/// If a download errors with [`FailureAction::DropPeer`], make sure that the peer is not dropped
/// while other transfers are still running.
Does it matter whether the peer was or not removed (DropPeer) after both are done?
Currently, a peer will only be dropped after a FailureAction::DropPeer
if no other transfer is running. My intuition for that choice was:
- If the other transfers after the failed one succeed without error, the condition that caused DropPeer was likely transient, so we just keep the peer and ignore the previous DropPeer
- If the failure condition remains, the other transfers will fail with DropPeer as well, and after the last transfer failed, the peer will be dropped.
As I'm typing this, the following scenario would be possible in the current implementation, but likely should be avoided:
- Queue hashA, hashB with nodeA. hashA, hashB are started
- hashA fails with DropPeer -> peer is not dropped because hashB is still running
- Queue hashC with nodeA. hashC is started
- hashB fails with DropPeer -> peer is not dropped because hashC is still running
- Queue hashD with nodeA. hashD is started
- ... repeat as above
In plain words: If I keep queuing new dowloads to a bad node in some frequency, it might happen that the peer is never dropped even though it is clearly faulty. Downloads will still succeed if other providers are available, but cycles will be waisted and resource capacity needlessly consumed.
I think I will add a test for this case. A fix could be to mark a node as failed
(or count the failures) and not queue new downloads for suchly marked nodes (but do not drop them while things are running). After a successful transfer, the failed
marker is cleared.
iroh-bytes/src/downloader.rs
Outdated
/// Queue to manage dropping nodes. | ||
/// Nodes to which we have an active or idle connection | ||
connected_nodes: HashMap<NodeId, ConnectionInfo<D::Connection>>, | ||
/// Nodes that failed and for which we track retries |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
failed.. dialing? or had a a failed download? both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's both. Adjusted the comment.
f64b4d5
to
d57325b
Compare
* refactor: downloader * test: add test for concurrent progress reporting * chore: cleanup * refactor: rename some things for clarity * feat: limit concurrent dials per download * chore: cleanup * refactor: respect progress sender IDs to allow reusing the senders in subsequent operations * chore: fmt & clippy * cleanup: shared download progress * feat: handle tags in the downloader * fix: visibility * fixup * fixup * chore: clippy * fixup * refactor: make export a seperate operation from download * fixup after merge * refactor: remove id mapping * refactor: move BlobId up to the progress events * fix tests * cleanup * fix: invariants * fixes and cleanups * fix: nodes_have should only add to queued downloads * chore: fmt * chore: clippy! * fix: do not queue downloads with no providers * feat: set more than one node on download requests (n0-computer#2128) ## Description Based on n0-computer#2085 The downloader already has the feature to try multiple nodes for a single hash or hashseq. With n0-computer#2085 we expose the downloader over the RPC protocol, by adding a `DownloadMode { Queued, Direct }` enum to the `BlobDownloadRequest`. This PR modifies the `BlobDownloadRequest` to include a list of provider nodes for a hash instead of just a single node. For queued requests that go to the downloader, the nodes are just passed to the downloader. For direct requests, the nodes are tried in-order, until one succeeds. ## Notes & open questions <!-- Any notes, remarks or open questions you have to make about the PR. --> ## Change checklist - [ ] Self-review. - [ ] Documentation updates if relevant. - [ ] Tests if relevant. * fix: address review by divma * fix: remove now obsolete into_send method from ProgressSender * chore: remove unused code * fix: use Display for DownloadKind * fix: first round of changes after review * docs: downloader next_step * more review * more review * fix: only remove hashes from provider map if they are really not needed * refactor: move TagSet into util * refactor: store intents in request info * cleanup * refactor: no allocation in next_step and simpler code * fix: test after merge * refactor: better method * fix(iroh-bytes): do not log redundant file delete error (n0-computer#2199) ## Description fix(iroh-bytes): do not log when a file can not be deleted... because it is not there. this makes the delete op idempotent, and also helps in case the file was not there in the first place. Fixes n0-computer#2142 ## Notes & open questions <!-- Any notes, remarks or open questions you have to make about the PR. --> ## Change checklist - [ ] Self-review. - [ ] Documentation updates if relevant. - [ ] Tests if relevant. * fix(iroh-dns-server): fix bug in pkarr name parsing (n0-computer#2200) ## Description The order of labels when parsing a DNS query as pkarr name was reversed. This means all queries for second-level subdomains under a pkarr pubkey failed. * First commit is the actual fix. * Second commit adds a test for various record name and type variations * Third and forth commit improve the debug and info logging ## Notes & open questions Was discovered by @rklaehn in n0-computer#2188 and fixes the wrong behavior observed there. ## Change checklist - [x] Self-review. - [x] Documentation updates if relevant. - [x] Tests if relevant. --------- Co-authored-by: Franz Heinzmann (Frando) <frando@unbiskant.org> Co-authored-by: Rüdiger Klaehn <rklaehn@protonmail.com>
* refactor: downloader * test: add test for concurrent progress reporting * chore: cleanup * refactor: rename some things for clarity * feat: limit concurrent dials per download * chore: cleanup * refactor: respect progress sender IDs to allow reusing the senders in subsequent operations * chore: fmt & clippy * cleanup: shared download progress * feat: handle tags in the downloader * fix: visibility * fixup * fixup * chore: clippy * fixup * refactor: make export a seperate operation from download * fixup after merge * refactor: remove id mapping * refactor: move BlobId up to the progress events * fix tests * cleanup * fix: invariants * fixes and cleanups * fix: nodes_have should only add to queued downloads * chore: fmt * chore: clippy! * fix: do not queue downloads with no providers * feat: set more than one node on download requests (n0-computer#2128) ## Description Based on n0-computer#2085 The downloader already has the feature to try multiple nodes for a single hash or hashseq. With n0-computer#2085 we expose the downloader over the RPC protocol, by adding a `DownloadMode { Queued, Direct }` enum to the `BlobDownloadRequest`. This PR modifies the `BlobDownloadRequest` to include a list of provider nodes for a hash instead of just a single node. For queued requests that go to the downloader, the nodes are just passed to the downloader. For direct requests, the nodes are tried in-order, until one succeeds. ## Notes & open questions <!-- Any notes, remarks or open questions you have to make about the PR. --> ## Change checklist - [ ] Self-review. - [ ] Documentation updates if relevant. - [ ] Tests if relevant. * fix: address review by divma * fix: remove now obsolete into_send method from ProgressSender * chore: remove unused code * fix: use Display for DownloadKind * fix: first round of changes after review * docs: downloader next_step * more review * more review * fix: only remove hashes from provider map if they are really not needed * refactor: move TagSet into util * refactor: store intents in request info * cleanup * refactor: no allocation in next_step and simpler code * fix: test after merge * refactor: better method * fix(iroh-bytes): do not log redundant file delete error (n0-computer#2199) ## Description fix(iroh-bytes): do not log when a file can not be deleted... because it is not there. this makes the delete op idempotent, and also helps in case the file was not there in the first place. Fixes n0-computer#2142 ## Notes & open questions <!-- Any notes, remarks or open questions you have to make about the PR. --> ## Change checklist - [ ] Self-review. - [ ] Documentation updates if relevant. - [ ] Tests if relevant. * fix(iroh-dns-server): fix bug in pkarr name parsing (n0-computer#2200) ## Description The order of labels when parsing a DNS query as pkarr name was reversed. This means all queries for second-level subdomains under a pkarr pubkey failed. * First commit is the actual fix. * Second commit adds a test for various record name and type variations * Third and forth commit improve the debug and info logging ## Notes & open questions Was discovered by @rklaehn in n0-computer#2188 and fixes the wrong behavior observed there. ## Change checklist - [x] Self-review. - [x] Documentation updates if relevant. - [x] Tests if relevant. --------- Co-authored-by: Franz Heinzmann (Frando) <frando@unbiskant.org> Co-authored-by: Rüdiger Klaehn <rklaehn@protonmail.com>
* refactor: downloader * test: add test for concurrent progress reporting * chore: cleanup * refactor: rename some things for clarity * feat: limit concurrent dials per download * chore: cleanup * refactor: respect progress sender IDs to allow reusing the senders in subsequent operations * chore: fmt & clippy * cleanup: shared download progress * feat: handle tags in the downloader * fix: visibility * fixup * fixup * chore: clippy * fixup * refactor: make export a seperate operation from download * fixup after merge * refactor: remove id mapping * refactor: move BlobId up to the progress events * fix tests * cleanup * fix: invariants * fixes and cleanups * fix: nodes_have should only add to queued downloads * chore: fmt * chore: clippy! * fix: do not queue downloads with no providers * feat: set more than one node on download requests (n0-computer#2128) ## Description Based on n0-computer#2085 The downloader already has the feature to try multiple nodes for a single hash or hashseq. With n0-computer#2085 we expose the downloader over the RPC protocol, by adding a `DownloadMode { Queued, Direct }` enum to the `BlobDownloadRequest`. This PR modifies the `BlobDownloadRequest` to include a list of provider nodes for a hash instead of just a single node. For queued requests that go to the downloader, the nodes are just passed to the downloader. For direct requests, the nodes are tried in-order, until one succeeds. ## Notes & open questions <!-- Any notes, remarks or open questions you have to make about the PR. --> ## Change checklist - [ ] Self-review. - [ ] Documentation updates if relevant. - [ ] Tests if relevant. * fix: address review by divma * fix: remove now obsolete into_send method from ProgressSender * chore: remove unused code * fix: use Display for DownloadKind * fix: first round of changes after review * docs: downloader next_step * more review * more review * fix: only remove hashes from provider map if they are really not needed * refactor: move TagSet into util * refactor: store intents in request info * cleanup * refactor: no allocation in next_step and simpler code * fix: test after merge * refactor: better method * fix(iroh-bytes): do not log redundant file delete error (n0-computer#2199) ## Description fix(iroh-bytes): do not log when a file can not be deleted... because it is not there. this makes the delete op idempotent, and also helps in case the file was not there in the first place. Fixes n0-computer#2142 ## Notes & open questions <!-- Any notes, remarks or open questions you have to make about the PR. --> ## Change checklist - [ ] Self-review. - [ ] Documentation updates if relevant. - [ ] Tests if relevant. * fix(iroh-dns-server): fix bug in pkarr name parsing (n0-computer#2200) ## Description The order of labels when parsing a DNS query as pkarr name was reversed. This means all queries for second-level subdomains under a pkarr pubkey failed. * First commit is the actual fix. * Second commit adds a test for various record name and type variations * Third and forth commit improve the debug and info logging ## Notes & open questions Was discovered by @rklaehn in n0-computer#2188 and fixes the wrong behavior observed there. ## Change checklist - [x] Self-review. - [x] Documentation updates if relevant. - [x] Tests if relevant. --------- Co-authored-by: Franz Heinzmann (Frando) <frando@unbiskant.org> Co-authored-by: Rüdiger Klaehn <rklaehn@protonmail.com>
## Description based on #2085 * do not start downloads without any known node * track missing content hashes in the live sync session * start downloads after content propagation properly * reenable the sync_big test, let's see what CI says this time (was flakey before) ## Notes & open questions <!-- Any notes, remarks or open questions you have to make about the PR. --> ## Change checklist - [ ] Self-review. - [ ] Documentation updates if relevant. - [ ] Tests if relevant.
…ng (#2085) ## Description This PR contains changes to the downloader: * Remove the `Role::Provider` vs `Role::Candidate` distinction. We added this back when we did not have content propagation in docs, and it now does not make much sense for our architecture. Either we know/assume a node has something, or not. The "inbetween" state did not make sense anymore IMO. * Rework the queuing logic to be based on a simple queue of pending downloads. Before, if a download could not be started because the concurrenty limits were reached, the download was considered failed and inserted, with a delay, into the retry queue. Now, if a download cannot be started, we just wait until a slot frees up, and then start it. Note that the queue is, for now, quite simple - if the next download in the queue cannot be started (e.g. because all provider nodes are currently busy with other downloads), we do not try to start the second-top download in the queue, but instead wait until a slot is freed up. We could certainly optimize this by "jumping the queue" in certain cases, this would however also need more logic to make sure that downloads cannot be "forgotten". Therefore, for now the queue is handled strictly in order. * The retry behavior is refactored: We now retry nodes (not downloads as before) up to a retry limit, with an increasing timeout. If a download can only proceed with a retrying node, it is parked and the next item in the queue is processed. The download is unparked if the retrying node successfully connects. * Add progress reporting to downloads managed through the downloader. For this I wrote a `SharedProgress` handler that allows to subscribe to already running downloads: If an intent is registered for hash A, and this download is started, and while it is running, another intent is registered for the same hash, it will now receive an `DownloadProgress::InitialState` which contains a `TransferProgress` which functions as a reducer for the progress events This can be used from the client even, and further events can be reduced/merged into that struct. The PR contains a test for this concurrent progress reporting. * Expose the downloader in the iroh node. Download requests via the RPC API can now set a `DownloadMode` enum either to `Direct` or to `Queued`: the former will behave as currently (issue an iroh-bytes request directly, without a queue or concurrency limits) and the latter will add the download to the downloader queue. ## Breaking changes Changes in `iroh`: * The `BlobDownloadRequest` has a new field `mode` to select between direct and queued downloads, and now contains a list of `nodes` in place of a single `node` before Changes in `iroh_bytes`: * `Role` enum is removed * `Downloader::queue` now takes a `DownloadRequest` with more options than before * `DownloadProgress` has a new variant `InitialState` which is emitted when attaching to an already-running download * `ConcurrencyLimits` gained a new field Other changes: * `SetTagOption` was moved from `iroh` to `iroh-bytes` ## Notes & open questions * Another followup improves content downloading in docs: #2127 . * A few more tests around the queuing behavior would be great * I have a partially done followup which adds a hook for content discovery to the downloader <!-- Any notes, remarks or open questions you have to make about the PR. --> ## Change checklist - [x] Self-review. - [x] Documentation updates if relevant. - [x] Tests if relevant. --------- Co-authored-by: Friedel Ziegelmayer <me@dignifiedquire.com>
Description
This PR contains changes to the downloader:
Role::Provider
vsRole::Candidate
distinction. We added this back when we did not have content propagation in docs, and it now does not make much sense for our architecture. Either we know/assume a node has something, or not. The "inbetween" state did not make sense anymore IMO.SharedProgress
handler that allows to subscribe to already running downloads: If an intent is registered for hash A, and this download is started, and while it is running, another intent is registered for the same hash, it will now receive anDownloadProgress::InitialState
which contains aTransferProgress
which functions as a reducer for the progress events This can be used from the client even, and further events can be reduced/merged into that struct. The PR contains a test for this concurrent progress reporting.DownloadMode
enum either toDirect
or toQueued
: the former will behave as currently (issue an iroh-bytes request directly, without a queue or concurrency limits) and the latter will add the download to the downloader queue.Breaking changes
Changes in
iroh
:BlobDownloadRequest
has a new fieldmode
to select between direct and queued downloads, and now contains a list ofnodes
in place of a singlenode
beforeChanges in
iroh_bytes
:Role
enum is removedDownloader::queue
now takes aDownloadRequest
with more options than beforeDownloadProgress
has a new variantInitialState
which is emitted when attaching to an already-running downloadConcurrencyLimits
gained a new fieldOther changes:
SetTagOption
was moved fromiroh
toiroh-bytes
Notes & open questions
Another followup improves content downloading in docs: refactor: improve content downloading in docs #2127 .
A few more tests around the queuing behavior would be great
I have a partially done followup which adds a hook for content discovery to the downloader
Change checklist