Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(iroh-blobs): do not hit the network when downloading blobs which are complete #2586

Merged
merged 15 commits into from
Aug 5, 2024

Conversation

Frando
Copy link
Member

@Frando Frando commented Aug 5, 2024

Description

Two changes to the downloader:

  • Never try to download from ourselves. If the only provider node added is our own node, fail with error "no providers".
  • The actual download request flow is turned into a generator (while keeping API compatibility for the existing get_to_db public function). A new get_to_db_in_steps function either runs to completion if the requested data is fully available locally, or yields a NeedsConn struct at the point where it needs a network connection to proceed. The NeedsConn has an async proceed(self, conn: Connection), which must be called with a connection for the actual download to start.

This two-step process allows the downloader to check if we should dial nodes at all, or are already done without doing anything, while emitting the exact same flow of events (because we run the same loop) to the client.

To achieve this, get_to_db now uses a genawaiter generator internally. This means that the big loop that is the iroh-blobs protocol request flow does not have to be changed at all, only that instead of a closure we yield and resume, which makes this much easier to integrate into an external state machine like the downloader.

The changes needed for this for the downloader are a bit verbose because the downloader itself is generic over a Getter, with impls for the actual impl and a test impl that does not use networking; therefore the new NeedsConn state has to be modeled with an additional associated type and trait here.

This PR adds three tests:

  • Downloading a missing blob from the local node fails without trying to connect to ourselves
  • Downloading an existing blob succeeds without trying to download
  • Downloading an existing collection succeeds without trying to download

Closes #2575
Replaced #2576

Notes and open questions

Breaking changes

None, only an API addition to the public API of iroh_blobs: iroh_blobs::get::check_local_with_progress_if_complete

Copy link

github-actions bot commented Aug 5, 2024

Documentation for this PR has been generated and is available at: https://n0-computer.github.io/iroh/pr/2586/docs/iroh/

Last updated: 2024-08-05T11:37:34Z

@Frando Frando changed the title refactor(iroh-blobs): use generator to perform checking the local store and downloading from network in two steps refactor(iroh-blobs): use generator to make checking locally and downloading a two-step process Aug 5, 2024
@Frando Frando marked this pull request as ready for review August 5, 2024 10:03
@Frando Frando changed the title refactor(iroh-blobs): use generator to make checking locally and downloading a two-step process refactor(iroh-blobs): do not hit the network when downloading blobs which are complete Aug 5, 2024
@dignifiedquire
Copy link
Contributor

can you also add a test with a collection that needs some blobs to be downloaded

@Frando
Copy link
Member Author

Frando commented Aug 5, 2024

Wait, I am not sure if this is correct. We can only resume the generator once, so we have to recreate it if a node fails. I do this in the direct_download case, but not in the queued downloader. Hrmm, gotta think more.

@Frando
Copy link
Member Author

Frando commented Aug 5, 2024

OK, I pushed another commit that hopefully fixes things.

Here's the constraint of the current design in the PR: We create a generator, which encapsulates the loop-over-hashseq flow for getting collections. We start the generator before connecting to a node, to yield Complete in case we have everything available locally. The generator is resumed with a connection, to then perform the actual download protocol, continuing the loop as needed. However, this generator we can only ever resume once. So if the node fails during the download, and we want to try the next node, we have to recreate the generator. This is what we do now in the PR. This means, though, that in this case the FoundLocal events would be emitted again.

When using the iroh blobs client, this is not necessarily an issue, because the TransferState that we reduce the events into by default deals fine with this.

I see three options now:

  • a) Keep as-is in the PR and document that events may be emitted multiple times when the download is auto-retried, and have to deduplicated or ignored client-side if acting on each event. Easiest option this is.
  • b) Deduplicate the events when sending them out, by keeping a TransferState in the progress sender. This is straightforward to add to the queued downloader, because there we keep a TransferState anyway to emit an initial state when a new intent is added to an already-queued download request. If we want to have it for direct downloads too, the "keep a transfer state" logic would have to be added to the get_to_db flow as well. Might be rather straightforward too, but would increase the state being kept.
  • c) Do not use a generator, but rewrite the get_to_db loop so that they return a cloneable struct of its state at the time when it needs a conn, and resume with that struct. This likely means "write manual state machine instead of looping dynamically". Not super keen on doing this, but it wouldn't be the end of the world either, the loop is not that complex.

WDYT? Also, cc @rklaehn

@dignifiedquire
Copy link
Contributor

dignifiedquire commented Aug 5, 2024

The problem I see with the current approach is that you would get Local events, for blobs that where not local before, eg peer 1 has blob1, but not blob2, and peer 2 has blob2, so you get the following events if I understand correctly

  • Remote: Blob1
  • < Error, blob2 not found with peer1 >
  • < Restart generator >
  • Local: Blob1
  • Remote: Blob2

@Frando
Copy link
Member Author

Frando commented Aug 5, 2024

The problem I see with the current approach is that you would get Local events, for blobs that where not local before

True. We should write a test for the scenario you made up in any case.

My current thinking is that we could use the TransferState that already exists to check if any event should be emitted or not. At least for FoundLocal. So option b) that is.

And for Found (which should really be renamed FoundRemote) we should add the node id of the remote peer IMO and emit them multiple times. Edit: Ugh, we can't do that, because iroh-blobs doesn't know about node ids, it works with regular quic connections too. Unfortunate. Either a new generic around the connection it is, or mapping the node id in in the higher layers.

This is turning out a bit bigger than envisioned. I'd say let's keep discussing a bit and then I can impl what we arrive at (did this a bit "on the go" so far, without having gotten at these difficulties before. This is good to tackle however, on current main we are a) double-emitting events for each node we try, and b) have no tests for that).

@dignifiedquire
Copy link
Contributor

I think if we have the lower levels of iroh-blobs emit events on a per transfer basis this is good.
For the events emitted from the downloader, I am thinking we could lean into the duplicate events and wrap the events in a an event like this

enum DownloaderEvent {
  BlobTransferEvent {
    event: BlobEvent,
    target_node: NodeId,
  }
}

and allow full insight into the actual happenings for a user

@dignifiedquire dignifiedquire changed the title refactor(iroh-blobs): do not hit the network when downloading blobs which are complete fix(iroh-blobs): do not hit the network when downloading blobs which are complete Aug 5, 2024
@dignifiedquire dignifiedquire added this pull request to the merge queue Aug 5, 2024
Merged via the queue into main with commit 0784403 Aug 5, 2024
29 checks passed
@dignifiedquire dignifiedquire deleted the refactor-download-check-local branch August 5, 2024 17:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Queued blob download from oneself fails
2 participants