Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(iroh-sync): Add actor to iroh-sync, remove deadlocks #1612

Merged
merged 23 commits into from
Oct 16, 2023

Conversation

Frando
Copy link
Member

@Frando Frando commented Oct 11, 2023

Description

  • Add an actor to iroh_sync that processes async requests for operations on Replicas

  • Major refactoring of how we access the iroh sync API from iroh: Before this our live::Actor would easily deadlock, because inserting entries from gossip, and processing insert events coming back from the replica, happened in the same actor loop. Also, because iroh-sync is fully synchronous, we have to take care to not lock or block the async executor. So, to solve both of these issues, all operations on replicas and the iroh sync store are moved into a separate actor which runs in a loop on a std::thread and just blocks while waiting for actions from a channel. This, in turn, moves some state keeping from iroh to iroh-sync, which is a good thing I think.

  • Remove Clone and all locks from Replica. As they are now usually owned by the iroh_sync::actor actor, no need for locks. We can get unique mut refs instead.

  • Change the subscription model on Replicas to be a Vec<flume::Sender<Event>>. This now allows to directly subscribe to replica events from multiple places. This allows us to have subscription to replica events from the client not go through any actor. The replica events are merged together with the events from the sync_engine::live actor, and that's it. Should improve latency and ease work on the actors. Because we merge the channels for replica events and live actor events for client subscriptions, this means, that closing a replica from the client does not end the subscription stream, because the subscription on the live actor will remain active. To end the subscription, the client would need to drop the receiver. I think this is fine for now. What we might want to do instead is to split the subscription from the client to two methods and channels, subscribe_insert_events and subscribe_sync_events or so, then we could close them separately. However the one for sync events would still stay open indefinitely, because you wouldn't want to drop and recreate, I think, in case you do join / leave / join. Or do we?

  • Track the open state of Replicas in the iroh_sync::actor. I opted for a simple implementation to start: The actor counts calls to open and decrements on calls to close and closes the replica once the count reaches zero. This works fine, however for replicas opened from the RPC client, because there's no async drop, I spawn a tokio task in drop to send the close call to the node. This means that if a client is force-killed, the replica would remain open indefinitely. It would be better to solve this cleaner - the only idea I had so far was to give out something like ReplicaDescriptors on open and then send require regular keep-alive calls from the RPC client. I think it's fine to defer this change to a followup (which will be straightforward with the architecture in place now) because the impl in this PR is already quite an improvement over the state in main or refactor(iroh-sync): Add actor to iroh-sync, remove deadlocks #1612, where we don't ever close replicas.

  • event subscriptions for replica insert events are now handled directly in the iroh_sync::actor, which is much nicer IMO. For the RPC subscription this event stream is merged with a subscription for sync events from the iroh::sync_engine::live actor.

Notes & open questions

Replicas may remain open indefinitely if the RPC client dies without calling Doc::close. This should be fixed in a followup and will need work in quic-rpc to get a notification once connections close.

Change checklist

  • Self-review.
  • Documentation updates if relevant.
  • Tests if relevant.

@@ -345,10 +348,14 @@ struct Actor {

impl Actor {
pub async fn run(mut self) -> anyhow::Result<()> {
let mut i = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️ I know the feeling so well when you have to add a counter to your actor loops

"sync[accept]: done, success"
);

let timings = Timings {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove this again also. Added for debugging purposes. But might be useful nevertheless..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it, alternatively we could make these into metrics

@@ -180,6 +180,7 @@ impl<S: Store> SyncEngine<S> {
let tag = bao_store
.import_bytes(value.into(), BlobFormat::RAW)
.await?;
// TODO: Use sync handle
Copy link
Contributor

@dignifiedquire dignifiedquire Oct 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all the store operations probably should? that way there is no need anymore for the spawn_blocking calls either

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's the plan I think. Will do.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I'm not that sure anymore: Redb is basically a global rwlock, correct?
By passing everything through a single channel onto a single thread, we basically make that a mutex. So no concurrent reads can happen on the database. Whereas before, with transforming the sync interators into streams with spawn_blocking, this could at least theoretically happen in parallel if no thread is writing at the time. Depending on how many threads the blocking pool has.

So do we want a threadpool of our own here in the end?

Maybe these thoughts are premature, and it makes sense architecture wise to have all interactions go through the channel interface, to have a clean point between sync and async interactions. And we could then still add a threadpool for non-mutating access to the replica later...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my initial sense would be for now to make everything go to the actor, and potentially parallelise in there if possible later down the road

@b5 b5 modified the milestones: v0.7.0, v0.8.0 Oct 11, 2023
commit 0a254f2
Merge: 6d690b3 792c756
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Thu Oct 12 12:26:14 2023 +0200

    Merge remote-tracking branch 'origin/main' into HEAD

commit 6d690b3
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Thu Oct 12 12:00:06 2023 +0200

    finalize the migration to the new sync actor architecture

commit 7ea9ed2
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Thu Oct 12 10:29:02 2023 +0200

    feat: ensure that insert events are only emitted after actual inserts

commit a474a6b
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Thu Oct 12 02:11:47 2023 +0200

    fix:doc links

commit d604a48
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Thu Oct 12 01:32:00 2023 +0200

    chore: doc links

commit fe08f0b
Merge: d8251d6 5dc3c44
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Thu Oct 12 01:19:39 2023 +0200

    Merge branch 'main' into sync-no-deadlocks

commit d8251d6
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Thu Oct 12 00:37:15 2023 +0200

    cleanups and improvements

commit fd0320d
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Wed Oct 11 22:46:15 2023 +0200

    add bulk test

commit b6de02b
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Wed Oct 11 22:26:43 2023 +0200

    make the debug logs enjoyable 😆

commit dd63dfd
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Wed Oct 11 22:45:57 2023 +0200

    cleanup tests

commit 5565a18
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Wed Oct 11 15:25:43 2023 +0200

    cleanup

commit 36a5882
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Wed Oct 11 15:04:25 2023 +0200

    remove leftover from other branch

commit e82df02
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Wed Oct 11 14:56:30 2023 +0200

    fix: shutdown

commit be80fe6
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Wed Oct 11 14:55:28 2023 +0200

    chore: fmt

commit 7b237a6
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Wed Oct 11 14:47:43 2023 +0200

    remove sync propagation and reports

commit cd84d25
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Wed Oct 11 14:38:45 2023 +0200

    refactor: use seperate actors for sync and gossip

commit 6950056
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Wed Oct 11 09:08:51 2023 +0200

    fix: return proper errors from iroh_gossip subscribe_all

commit 239c699
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Tue Oct 10 15:06:58 2023 +0200

    try to clear file locks by dropping things

commit 8865641
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Tue Oct 10 00:57:09 2023 +0200

    feat: add migration to populate latest table

commit a06113e
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Mon Oct 9 23:44:19 2023 +0200

    chore: clippy

commit af45ed1
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Mon Oct 9 23:38:46 2023 +0200

    tests: improve sync tests

commit 6aca9aa
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Mon Oct 9 18:15:04 2023 +0200

    chore: fmt

commit 6a09d69
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Mon Oct 9 18:12:39 2023 +0200

    fix: print

commit 31f0d81
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Mon Oct 9 18:11:14 2023 +0200

    test: account in sync_basic for sync happening two times in a row

commit 71c07c7
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Mon Oct 9 16:16:28 2023 +0200

    chore: fmt

commit 0d996a9
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Mon Oct 9 16:14:02 2023 +0200

    disable sync reports for now

commit 4b27778
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Mon Oct 9 16:10:24 2023 +0200

    improve error logging

commit d19646c
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Mon Oct 9 14:36:15 2023 +0200

    fix: actually store which entry is latest for each author

commit de6e0d8
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Mon Oct 9 12:14:29 2023 +0200

    chore: clippy

commit 07f0e24
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Mon Oct 9 11:47:28 2023 +0200

    maintain gossip join state

commit bd8198b
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Mon Oct 9 11:25:34 2023 +0200

    better logging

commit 4f3c573
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Mon Oct 9 10:54:12 2023 +0200

    tests: sync_big reduce number of nodes for CI

commit 0f3c06b
Merge: 44a6155 898b0f7
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Sun Oct 8 14:53:16 2023 +0200

    Merge remote-tracking branch 'origin/main' into sync-propagation

commit 44a6155
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Fri Oct 6 18:13:18 2023 +0200

    tests: give sync_big more timeout

commit 00e6c3d
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Fri Oct 6 18:11:05 2023 +0200

    debug: log events to find out why test fails

commit 714b628
Merge: e5c0f3a e2ee678
Author: Diva M <divma@protonmail.com>
Date:   Thu Oct 5 21:59:50 2023 -0500

    Merge branch 'main' into sync-propagation

commit e5c0f3a
Merge: 99359b8 2fd31b7
Author: Diva M <divma@protonmail.com>
Date:   Thu Oct 5 21:56:49 2023 -0500

    Merge branch 'main' into sync-propagation

commit 99359b8
Author: Diva M <divma@protonmail.com>
Date:   Thu Oct 5 21:56:46 2023 -0500

    fix bug

commit 4a7932e
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Wed Oct 4 12:21:49 2023 +0200

    tests: sync_big better errors

commit 813c370
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Wed Oct 4 00:36:04 2023 +0200

    fix: cleanups

commit 318a0e4
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Wed Oct 4 00:07:02 2023 +0200

    chore: clippy

commit 0197023
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Tue Oct 3 17:24:47 2023 +0200

    fix: docs

commit b8679c3
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Tue Oct 3 17:22:51 2023 +0200

    refactor: improve logging

commit bcdec8f
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Tue Oct 3 16:51:43 2023 +0200

    fixes after rebase

commit 4ce5114
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Mon Oct 2 12:48:56 2023 +0200

    fix: try to fix connect abort logic

commit 9bd5108
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Mon Oct 2 12:32:41 2023 +0200

    fix: better cancellation logic

commit d9d501d
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Mon Oct 2 12:17:03 2023 +0200

    chore: clippy

commit 84224cf
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Mon Oct 2 12:14:29 2023 +0200

    fix: dont error log each failed download attempt

commit fac22d7
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Mon Oct 2 11:56:23 2023 +0200

    cleanups

commit b2b53d0
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Thu Sep 28 16:11:37 2023 +0200

    fix: tests

commit 2c9feaa
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Thu Sep 28 15:35:37 2023 +0200

    fix: rebase

commit a13a807
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Thu Sep 28 15:14:54 2023 +0200

    fix: after rebase

commit 67aa476
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Thu Sep 28 15:12:52 2023 +0200

    chore: fmt

commit 76b345e
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Thu Sep 28 13:42:19 2023 +0200

    refactor: rename some things

commit e45e12f
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Wed Sep 27 10:11:47 2023 +0200

    fix: cleanup

commit f3b6e19
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Wed Sep 27 09:44:07 2023 +0200

    fix: cleanup, docs

commit 9e12175
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Wed Sep 27 01:36:37 2023 +0200

    chore: fmt & cleanup

commit 8ea2c44
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Wed Sep 27 01:35:25 2023 +0200

    feat(wip): add AsyncReplica

commit 4ea484f
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Wed Sep 27 01:21:21 2023 +0200

    chore: fmt

commit 9341423
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Wed Sep 27 00:18:37 2023 +0200

    better debugging in iroh-sync

commit abbedbf
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Wed Sep 27 00:04:14 2023 +0200

    feat: better tracing in sync and gossip

commit 166656c
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Tue Sep 26 23:42:14 2023 +0200

    feat: propagate sync reports to neighbors, and try to improve select loop

commit c9a1087
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Tue Sep 26 23:39:26 2023 +0200

    deps: add console_subscriber as dev dependency

commit e0215c2
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Tue Sep 26 23:37:51 2023 +0200

    feat(iroh-sync): add state vector, add better results for sync operations

commit 6ad40fb
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Fri Sep 22 14:22:48 2023 +0200

    fix deps

commit 350140a
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Fri Sep 22 14:05:37 2023 +0200

    tests: improve sync_big test

commit 8280974
Author: Franz Heinzmann (Frando) <frando@unbiskant.org>
Date:   Thu Sep 21 12:54:31 2023 +0200

    tests: big sync test (wip)
@Frando Frando force-pushed the sync-no-deadlocks branch from 0a254f2 to 1481cfe Compare October 12, 2023 10:28
@Frando Frando changed the title (WIP) No more deadlocks in sync refactor(sync): No more deadlocks in sync Oct 12, 2023
@Frando Frando force-pushed the sync-no-deadlocks branch from f867408 to 3f9b183 Compare October 12, 2023 10:42
@b5 b5 modified the milestones: v0.8.0, v0.7.1 Oct 12, 2023
@Frando Frando force-pushed the sync-no-deadlocks branch from fc92284 to 9187232 Compare October 12, 2023 19:15
@Frando Frando force-pushed the sync-no-deadlocks branch 2 times, most recently from 1ad4d0a to b4f6522 Compare October 13, 2023 14:41
@Frando Frando changed the title refactor(iroh-sync): No more deadlocks in sync refactor(iroh-sync): Add actor to iroh-sync, remove deadlocks Oct 16, 2023
@dignifiedquire dignifiedquire added this pull request to the merge queue Oct 16, 2023
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 16, 2023
@dignifiedquire dignifiedquire added this pull request to the merge queue Oct 16, 2023
Merged via the queue into main with commit a70c6f1 Oct 16, 2023
@dignifiedquire dignifiedquire deleted the sync-no-deadlocks branch October 16, 2023 18:36
github-merge-queue bot pushed a commit that referenced this pull request Oct 17, 2023
## Description

With #1612 we moved all operations on replicas into a dedicated actor,
with all communication happening over a channel. A nice consequence of
this is that the iroh-sync store only lives in the actor thread, and
therefore we don't need a generic on the `SyncEngine` and the `Node`
anymore. This PR removes the generic.

## Change checklist

- [x] Self-review.
- [x] Documentation updates if relevant.
- [x] Tests if relevant.
dignifiedquire pushed a commit that referenced this pull request Oct 18, 2023
## Description

With #1612 we moved all operations on replicas into a dedicated actor,
with all communication happening over a channel. A nice consequence of
this is that the iroh-sync store only lives in the actor thread, and
therefore we don't need a generic on the `SyncEngine` and the `Node`
anymore. This PR removes the generic.

## Change checklist

- [x] Self-review.
- [x] Documentation updates if relevant.
- [x] Tests if relevant.
matheus23 pushed a commit that referenced this pull request Nov 14, 2024
## Description

* Add an actor to `iroh_sync` that processes async requests for
operations on Replicas
* Major refactoring of how we access the iroh sync API from iroh: Before
this our `live::Actor` would easily deadlock, because inserting entries
from gossip, and processing insert events coming back from the replica,
happened in the same actor loop. Also, because iroh-sync is fully
synchronous, we have to take care to not lock or block the async
executor. So, to solve both of these issues, all operations on replicas
and the iroh sync store are moved into a separate actor which runs in a
loop on a `std::thread` and just blocks while waiting for actions from a
channel. This, in turn, moves some state keeping from iroh to iroh-sync,
which is a good thing I think.
* Remove Clone and all locks from `Replica`. As they are now usually
owned by the `iroh_sync::actor` actor, no need for locks. We can get
unique mut refs instead.

* Change the subscription model on `Replica`s to be a
`Vec<flume::Sender<Event>>`. This now allows to directly subscribe to
replica events from multiple places. This allows us to have subscription
to replica events from the client not go through any actor. The replica
events are merged together with the events from the `sync_engine::live`
actor, and that's it. Should improve latency and ease work on the
actors. Because we merge the channels for replica events and live actor
events for client subscriptions, this means, that closing a replica from
the client does not end the subscription stream, because the
subscription on the live actor will remain active. To end the
subscription, the client would need to drop the receiver. I think this
is fine for now. What we might want to do instead is to split the
subscription from the client to two methods and channels,
`subscribe_insert_events` and `subscribe_sync_events` or so, then we
could close them separately. However the one for sync events would still
stay open indefinitely, because you wouldn't want to drop and recreate,
I think, in case you do `join` / `leave` / `join`. Or do we?

* Track the open state of Replicas in the `iroh_sync::actor`. I opted
for a simple implementation to start: The actor counts calls to `open`
and decrements on calls to `close` and closes the replica once the count
reaches zero. This works fine, however for replicas opened from the RPC
client, because there's no async drop, I spawn a tokio task in drop to
send the close call to the node. This means that if a client is
force-killed, the replica would remain open indefinitely. It would be
better to solve this cleaner - the only idea I had so far was to give
out something like `ReplicaDescriptor`s on `open` and then send require
regular keep-alive calls from the RPC client. I think it's fine to defer
this change to a followup (which will be straightforward with the
architecture in place now) because the impl in this PR is already quite
an improvement over the state in `main` or #1612, where we don't ever
close replicas.
* event subscriptions for replica insert events are now handled directly
in the `iroh_sync::actor`, which is much nicer IMO. For the RPC
subscription this event stream is merged with a subscription for sync
events from the `iroh::sync_engine::live` actor.


## Notes & open questions

Replicas may remain open indefinitely if the RPC client dies without
calling `Doc::close`. This should be fixed in a followup and will need
work in quic-rpc to get a notification once connections close.


## Change checklist

- [x] Self-review.
- [x] Documentation updates if relevant.
- [x] Tests if relevant.

---------

Co-authored-by: Friedel Ziegelmayer <me@dignifiedquire.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants