-
Notifications
You must be signed in to change notification settings - Fork 184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(iroh-sync): Add actor to iroh-sync, remove deadlocks #1612
Conversation
@@ -345,10 +348,14 @@ struct Actor { | |||
|
|||
impl Actor { | |||
pub async fn run(mut self) -> anyhow::Result<()> { | |||
let mut i = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️ I know the feeling so well when you have to add a counter to your actor loops
"sync[accept]: done, success" | ||
); | ||
|
||
let timings = Timings { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove this again also. Added for debugging purposes. But might be useful nevertheless..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it, alternatively we could make these into metrics
iroh/src/sync_engine/rpc.rs
Outdated
@@ -180,6 +180,7 @@ impl<S: Store> SyncEngine<S> { | |||
let tag = bao_store | |||
.import_bytes(value.into(), BlobFormat::RAW) | |||
.await?; | |||
// TODO: Use sync handle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all the store operations probably should? that way there is no need anymore for the spawn_blocking calls either
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's the plan I think. Will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I'm not that sure anymore: Redb is basically a global rwlock, correct?
By passing everything through a single channel onto a single thread, we basically make that a mutex. So no concurrent reads can happen on the database. Whereas before, with transforming the sync interators into streams with spawn_blocking
, this could at least theoretically happen in parallel if no thread is writing at the time. Depending on how many threads the blocking pool has.
So do we want a threadpool of our own here in the end?
Maybe these thoughts are premature, and it makes sense architecture wise to have all interactions go through the channel interface, to have a clean point between sync and async interactions. And we could then still add a threadpool for non-mutating access to the replica later...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my initial sense would be for now to make everything go to the actor, and potentially parallelise in there if possible later down the road
commit 0a254f2 Merge: 6d690b3 792c756 Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Thu Oct 12 12:26:14 2023 +0200 Merge remote-tracking branch 'origin/main' into HEAD commit 6d690b3 Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Thu Oct 12 12:00:06 2023 +0200 finalize the migration to the new sync actor architecture commit 7ea9ed2 Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Thu Oct 12 10:29:02 2023 +0200 feat: ensure that insert events are only emitted after actual inserts commit a474a6b Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Thu Oct 12 02:11:47 2023 +0200 fix:doc links commit d604a48 Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Thu Oct 12 01:32:00 2023 +0200 chore: doc links commit fe08f0b Merge: d8251d6 5dc3c44 Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Thu Oct 12 01:19:39 2023 +0200 Merge branch 'main' into sync-no-deadlocks commit d8251d6 Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Thu Oct 12 00:37:15 2023 +0200 cleanups and improvements commit fd0320d Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Wed Oct 11 22:46:15 2023 +0200 add bulk test commit b6de02b Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Wed Oct 11 22:26:43 2023 +0200 make the debug logs enjoyable 😆 commit dd63dfd Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Wed Oct 11 22:45:57 2023 +0200 cleanup tests commit 5565a18 Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Wed Oct 11 15:25:43 2023 +0200 cleanup commit 36a5882 Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Wed Oct 11 15:04:25 2023 +0200 remove leftover from other branch commit e82df02 Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Wed Oct 11 14:56:30 2023 +0200 fix: shutdown commit be80fe6 Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Wed Oct 11 14:55:28 2023 +0200 chore: fmt commit 7b237a6 Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Wed Oct 11 14:47:43 2023 +0200 remove sync propagation and reports commit cd84d25 Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Wed Oct 11 14:38:45 2023 +0200 refactor: use seperate actors for sync and gossip commit 6950056 Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Wed Oct 11 09:08:51 2023 +0200 fix: return proper errors from iroh_gossip subscribe_all commit 239c699 Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Tue Oct 10 15:06:58 2023 +0200 try to clear file locks by dropping things commit 8865641 Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Tue Oct 10 00:57:09 2023 +0200 feat: add migration to populate latest table commit a06113e Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Mon Oct 9 23:44:19 2023 +0200 chore: clippy commit af45ed1 Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Mon Oct 9 23:38:46 2023 +0200 tests: improve sync tests commit 6aca9aa Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Mon Oct 9 18:15:04 2023 +0200 chore: fmt commit 6a09d69 Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Mon Oct 9 18:12:39 2023 +0200 fix: print commit 31f0d81 Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Mon Oct 9 18:11:14 2023 +0200 test: account in sync_basic for sync happening two times in a row commit 71c07c7 Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Mon Oct 9 16:16:28 2023 +0200 chore: fmt commit 0d996a9 Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Mon Oct 9 16:14:02 2023 +0200 disable sync reports for now commit 4b27778 Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Mon Oct 9 16:10:24 2023 +0200 improve error logging commit d19646c Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Mon Oct 9 14:36:15 2023 +0200 fix: actually store which entry is latest for each author commit de6e0d8 Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Mon Oct 9 12:14:29 2023 +0200 chore: clippy commit 07f0e24 Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Mon Oct 9 11:47:28 2023 +0200 maintain gossip join state commit bd8198b Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Mon Oct 9 11:25:34 2023 +0200 better logging commit 4f3c573 Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Mon Oct 9 10:54:12 2023 +0200 tests: sync_big reduce number of nodes for CI commit 0f3c06b Merge: 44a6155 898b0f7 Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Sun Oct 8 14:53:16 2023 +0200 Merge remote-tracking branch 'origin/main' into sync-propagation commit 44a6155 Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Fri Oct 6 18:13:18 2023 +0200 tests: give sync_big more timeout commit 00e6c3d Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Fri Oct 6 18:11:05 2023 +0200 debug: log events to find out why test fails commit 714b628 Merge: e5c0f3a e2ee678 Author: Diva M <divma@protonmail.com> Date: Thu Oct 5 21:59:50 2023 -0500 Merge branch 'main' into sync-propagation commit e5c0f3a Merge: 99359b8 2fd31b7 Author: Diva M <divma@protonmail.com> Date: Thu Oct 5 21:56:49 2023 -0500 Merge branch 'main' into sync-propagation commit 99359b8 Author: Diva M <divma@protonmail.com> Date: Thu Oct 5 21:56:46 2023 -0500 fix bug commit 4a7932e Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Wed Oct 4 12:21:49 2023 +0200 tests: sync_big better errors commit 813c370 Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Wed Oct 4 00:36:04 2023 +0200 fix: cleanups commit 318a0e4 Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Wed Oct 4 00:07:02 2023 +0200 chore: clippy commit 0197023 Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Tue Oct 3 17:24:47 2023 +0200 fix: docs commit b8679c3 Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Tue Oct 3 17:22:51 2023 +0200 refactor: improve logging commit bcdec8f Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Tue Oct 3 16:51:43 2023 +0200 fixes after rebase commit 4ce5114 Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Mon Oct 2 12:48:56 2023 +0200 fix: try to fix connect abort logic commit 9bd5108 Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Mon Oct 2 12:32:41 2023 +0200 fix: better cancellation logic commit d9d501d Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Mon Oct 2 12:17:03 2023 +0200 chore: clippy commit 84224cf Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Mon Oct 2 12:14:29 2023 +0200 fix: dont error log each failed download attempt commit fac22d7 Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Mon Oct 2 11:56:23 2023 +0200 cleanups commit b2b53d0 Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Thu Sep 28 16:11:37 2023 +0200 fix: tests commit 2c9feaa Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Thu Sep 28 15:35:37 2023 +0200 fix: rebase commit a13a807 Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Thu Sep 28 15:14:54 2023 +0200 fix: after rebase commit 67aa476 Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Thu Sep 28 15:12:52 2023 +0200 chore: fmt commit 76b345e Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Thu Sep 28 13:42:19 2023 +0200 refactor: rename some things commit e45e12f Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Wed Sep 27 10:11:47 2023 +0200 fix: cleanup commit f3b6e19 Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Wed Sep 27 09:44:07 2023 +0200 fix: cleanup, docs commit 9e12175 Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Wed Sep 27 01:36:37 2023 +0200 chore: fmt & cleanup commit 8ea2c44 Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Wed Sep 27 01:35:25 2023 +0200 feat(wip): add AsyncReplica commit 4ea484f Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Wed Sep 27 01:21:21 2023 +0200 chore: fmt commit 9341423 Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Wed Sep 27 00:18:37 2023 +0200 better debugging in iroh-sync commit abbedbf Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Wed Sep 27 00:04:14 2023 +0200 feat: better tracing in sync and gossip commit 166656c Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Tue Sep 26 23:42:14 2023 +0200 feat: propagate sync reports to neighbors, and try to improve select loop commit c9a1087 Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Tue Sep 26 23:39:26 2023 +0200 deps: add console_subscriber as dev dependency commit e0215c2 Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Tue Sep 26 23:37:51 2023 +0200 feat(iroh-sync): add state vector, add better results for sync operations commit 6ad40fb Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Fri Sep 22 14:22:48 2023 +0200 fix deps commit 350140a Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Fri Sep 22 14:05:37 2023 +0200 tests: improve sync_big test commit 8280974 Author: Franz Heinzmann (Frando) <frando@unbiskant.org> Date: Thu Sep 21 12:54:31 2023 +0200 tests: big sync test (wip)
0a254f2
to
1481cfe
Compare
f867408
to
3f9b183
Compare
fc92284
to
9187232
Compare
1ad4d0a
to
b4f6522
Compare
…state of replicas
## Description With #1612 we moved all operations on replicas into a dedicated actor, with all communication happening over a channel. A nice consequence of this is that the iroh-sync store only lives in the actor thread, and therefore we don't need a generic on the `SyncEngine` and the `Node` anymore. This PR removes the generic. ## Change checklist - [x] Self-review. - [x] Documentation updates if relevant. - [x] Tests if relevant.
## Description With #1612 we moved all operations on replicas into a dedicated actor, with all communication happening over a channel. A nice consequence of this is that the iroh-sync store only lives in the actor thread, and therefore we don't need a generic on the `SyncEngine` and the `Node` anymore. This PR removes the generic. ## Change checklist - [x] Self-review. - [x] Documentation updates if relevant. - [x] Tests if relevant.
## Description * Add an actor to `iroh_sync` that processes async requests for operations on Replicas * Major refactoring of how we access the iroh sync API from iroh: Before this our `live::Actor` would easily deadlock, because inserting entries from gossip, and processing insert events coming back from the replica, happened in the same actor loop. Also, because iroh-sync is fully synchronous, we have to take care to not lock or block the async executor. So, to solve both of these issues, all operations on replicas and the iroh sync store are moved into a separate actor which runs in a loop on a `std::thread` and just blocks while waiting for actions from a channel. This, in turn, moves some state keeping from iroh to iroh-sync, which is a good thing I think. * Remove Clone and all locks from `Replica`. As they are now usually owned by the `iroh_sync::actor` actor, no need for locks. We can get unique mut refs instead. * Change the subscription model on `Replica`s to be a `Vec<flume::Sender<Event>>`. This now allows to directly subscribe to replica events from multiple places. This allows us to have subscription to replica events from the client not go through any actor. The replica events are merged together with the events from the `sync_engine::live` actor, and that's it. Should improve latency and ease work on the actors. Because we merge the channels for replica events and live actor events for client subscriptions, this means, that closing a replica from the client does not end the subscription stream, because the subscription on the live actor will remain active. To end the subscription, the client would need to drop the receiver. I think this is fine for now. What we might want to do instead is to split the subscription from the client to two methods and channels, `subscribe_insert_events` and `subscribe_sync_events` or so, then we could close them separately. However the one for sync events would still stay open indefinitely, because you wouldn't want to drop and recreate, I think, in case you do `join` / `leave` / `join`. Or do we? * Track the open state of Replicas in the `iroh_sync::actor`. I opted for a simple implementation to start: The actor counts calls to `open` and decrements on calls to `close` and closes the replica once the count reaches zero. This works fine, however for replicas opened from the RPC client, because there's no async drop, I spawn a tokio task in drop to send the close call to the node. This means that if a client is force-killed, the replica would remain open indefinitely. It would be better to solve this cleaner - the only idea I had so far was to give out something like `ReplicaDescriptor`s on `open` and then send require regular keep-alive calls from the RPC client. I think it's fine to defer this change to a followup (which will be straightforward with the architecture in place now) because the impl in this PR is already quite an improvement over the state in `main` or #1612, where we don't ever close replicas. * event subscriptions for replica insert events are now handled directly in the `iroh_sync::actor`, which is much nicer IMO. For the RPC subscription this event stream is merged with a subscription for sync events from the `iroh::sync_engine::live` actor. ## Notes & open questions Replicas may remain open indefinitely if the RPC client dies without calling `Doc::close`. This should be fixed in a followup and will need work in quic-rpc to get a notification once connections close. ## Change checklist - [x] Self-review. - [x] Documentation updates if relevant. - [x] Tests if relevant. --------- Co-authored-by: Friedel Ziegelmayer <me@dignifiedquire.com>
Description
Add an actor to
iroh_sync
that processes async requests for operations on ReplicasMajor refactoring of how we access the iroh sync API from iroh: Before this our
live::Actor
would easily deadlock, because inserting entries from gossip, and processing insert events coming back from the replica, happened in the same actor loop. Also, because iroh-sync is fully synchronous, we have to take care to not lock or block the async executor. So, to solve both of these issues, all operations on replicas and the iroh sync store are moved into a separate actor which runs in a loop on astd::thread
and just blocks while waiting for actions from a channel. This, in turn, moves some state keeping from iroh to iroh-sync, which is a good thing I think.Remove Clone and all locks from
Replica
. As they are now usually owned by theiroh_sync::actor
actor, no need for locks. We can get unique mut refs instead.Change the subscription model on
Replica
s to be aVec<flume::Sender<Event>>
. This now allows to directly subscribe to replica events from multiple places. This allows us to have subscription to replica events from the client not go through any actor. The replica events are merged together with the events from thesync_engine::live
actor, and that's it. Should improve latency and ease work on the actors. Because we merge the channels for replica events and live actor events for client subscriptions, this means, that closing a replica from the client does not end the subscription stream, because the subscription on the live actor will remain active. To end the subscription, the client would need to drop the receiver. I think this is fine for now. What we might want to do instead is to split the subscription from the client to two methods and channels,subscribe_insert_events
andsubscribe_sync_events
or so, then we could close them separately. However the one for sync events would still stay open indefinitely, because you wouldn't want to drop and recreate, I think, in case you dojoin
/leave
/join
. Or do we?Track the open state of Replicas in the
iroh_sync::actor
. I opted for a simple implementation to start: The actor counts calls toopen
and decrements on calls toclose
and closes the replica once the count reaches zero. This works fine, however for replicas opened from the RPC client, because there's no async drop, I spawn a tokio task in drop to send the close call to the node. This means that if a client is force-killed, the replica would remain open indefinitely. It would be better to solve this cleaner - the only idea I had so far was to give out something likeReplicaDescriptor
s onopen
and then send require regular keep-alive calls from the RPC client. I think it's fine to defer this change to a followup (which will be straightforward with the architecture in place now) because the impl in this PR is already quite an improvement over the state inmain
or refactor(iroh-sync): Add actor to iroh-sync, remove deadlocks #1612, where we don't ever close replicas.event subscriptions for replica insert events are now handled directly in the
iroh_sync::actor
, which is much nicer IMO. For the RPC subscription this event stream is merged with a subscription for sync events from theiroh::sync_engine::live
actor.Notes & open questions
Replicas may remain open indefinitely if the RPC client dies without calling
Doc::close
. This should be fixed in a followup and will need work in quic-rpc to get a notification once connections close.Change checklist