Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

MSC2246 async uploads #12484

Closed

Conversation

sumnerevans
Copy link
Contributor

@sumnerevans sumnerevans commented Apr 16, 2022

Implements MSC2246 by @tulir.

Note to reviewer: in general, you should be able to go commit-by-commit through this, but there are quite a few fixes at the end of the commit log since rebasing was getting difficult.

Outstanding questions:

  • Updating the content of an existing MXC URI must be atomic. Right now, it is not. I'm using LockStore, but there appears to be a bug (see MSC2246 async uploads #12484 (comment)).

    Blocked by LockStore: fix acquiring a lock via LockStore.try_acquire_lock #12832

  • Should the ?fi.mau.msc2246.max_stall_ms do anything when MSC2246 support is not enabled on the homeserver? My concern is that in the interim period when some homeservers have this enabled and some don't, that users could have a bad experience retrieving media over federation. (Decision: default to 20000ms)

  • Where is the best place to put tests for this. Should I create both unit tests and sytest tests? Not going to test this for now, since not in spec.

  • There's one failing sytest test. I tried to look at the logs, but I couldn't really understand what the problem was. If anyone has any suggestions on how to debug, please let me know. (Test has been fixed)

  • Where should the unused_expires_at config option be placed? (See first comment below)

Sign-off

Signed-off-by: Sumner Evans sumner@beeper.com

Pull Request Checklist

  • Pull request is based on the develop branch
  • Pull request includes a changelog file. The entry should:
    • Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
    • Use markdown where necessary, mostly for code blocks.
    • End with either a period (.) or an exclamation mark (!).
    • Start with a capital letter.
    • Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
  • Pull request includes a sign off
  • Code style is correct
    (run the linters)

@sumnerevans sumnerevans force-pushed the sumner/msc2246-async-uploads branch 7 times, most recently from f643b25 to 43ef05e Compare April 16, 2022 03:52
@clokep clokep requested a review from a team April 19, 2022 11:28
@clokep
Copy link
Member

clokep commented Apr 19, 2022

@sumnerevans had some questions in #synapse-dev about how to implement parts of this. See the PR description / comments from @sumnerevans about areas of question.

@squahtx
Copy link
Contributor

squahtx commented Apr 19, 2022

I ran into a problem where I need to call async code from within the non-async callback that db_poll.runInteraction requires.

We have run_as_background_process to kick off an async operation without waiting for it to complete. There's no good way to await an async operation inside runInteraction.

Be warned that runInteraction can retry a transaction multiple times when there are concurrent writes to the same row. It's not a good idea to add side effects to a transaction for this reason. There is also txn.call_after which has similar caveats.

What async code do you have in mind?

@sumnerevans
Copy link
Contributor Author

@squahtx I was specifically trying to use db_pool.runInteraction. See here on my WIP not-working branch: /~https://github.com/sumnerevans/synapse/blob/sumner/msc2246-async-uploads-try-atomic-update/synapse/rest/media/v1/upload_resource.py#L141-L161 to wrap the verify_can_upload call and the update_content call to make them atomic.

However, the media repo's update_content function has to call self.media_storage.store_file which is async and it must also generate thumbnails which is async. Maybe there's a way to refactor it so that it doesn't have to do those async?

@squahtx
Copy link
Contributor

squahtx commented Apr 20, 2022

synapse-core: We've discussed the first question in #synapse-devs and decided that worker_locks / LockStore was the way forward.

The second, third and fourth questions could do with some guidance.

@squahtx
Copy link
Contributor

squahtx commented Apr 25, 2022

Should the ?fi.mau.msc2246.max_stall_ms do anything when MSC2246 support is not enabled on the homeserver? My concern is that in the interim period when some homeservers have this enabled and some don't, that users could have a bad experience retrieving media over federation.

Is the concern is that users on a homeserver without MSC2246 enabled will get 404s when fetching async uploaded media from a remote homeserver? I think the remote homeserver will wait for the default max_stall_ms before sending back a 404, so it's only a problem for larger media?

Where is the best place to put tests for this. Should I create both unit tests and sytest tests?
Unit tests are good to have if you've already written them.

For specced behaviour, complement or sytest tests would be desired. We're trying to move away from sytest long term, so complement tests would be preferred.

There's one failing sytest test. I tried to look at the logs, but I couldn't really understand what the problem was. If anyone has any suggestions on how to debug, please let me know.

The failing test is: Error: FAILURE: #736: Can quarantine media in rooms

The test case in question can be found at /~https://github.com/matrix-org/sytest/blob/develop/tests/51media/48admin-quarantine.pl. It:

  1. uploads media
  2. sends a message containing the media
  3. quarantines the media
  4. tests /download/ locally for a 404
  5. tests /thumbnail/ locally for a 404
  6. tests /download/ on a remote homeserver for a 404
  7. tests /thumbnail/ on a remote homeserver for a 404

If you click on Summary and scroll down, you can download synapse logs from the sytest run under the Artifacts section:
Sytest Logs - failure - (focal)

local homeserver:

2022-04-16 04:04:34,486 - synapse.rest.media.v1.media_repository - 277 - INFO - POST-7116 - Stored local media in file '/work/server-0/media_store/local_content/Pm/qE/NWnndmMKkejEdMNqdVKk'
2022-04-16 04:04:34,503 - synapse.rest.media.v1.upload_resource - 118 - INFO - POST-7116 - Uploaded content with URI 'mxc://localhost:8800/PmqENWnndmMKkejEdMNqdVKk'
2022-04-16 04:04:34,504 - synapse.access.https.8800 - 427 - INFO - POST-7116 - 127.0.0.1 - 8800 - {@anon-20220416_035706-875:localhost:8800} Processed request: 0.019sec/0.001sec (0.003sec, 0.002sec) (0.001sec/0.002sec/4) 63B 200 "POST /_matrix/media/v3/upload?access_token=<redacted> HTTP/1.1" "Perl + Net::Async::HTTP/0.48" [0 dbevts]
2022-04-16 04:04:34,540 - synapse.access.https.8800 - 427 - INFO - PUT-7117 - 127.0.0.1 - 8800 - {@anon-20220416_035706-875:localhost:8800} Processed request: 0.029sec/0.004sec (0.002sec, 0.001sec) (0.004sec/0.003sec/5) 59B 200 "PUT /_matrix/client/v3/rooms/!QJyLOtoEVSHjkzSZoO:localhost:8800/send/m.room.message/1025?access_token=<redacted> HTTP/1.1" "Perl + Net::Async::HTTP/0.48" [0 dbevts]
2022-04-16 04:04:34,554 - synapse.logging.context - 81 - WARNING - PUT-7117 - Re-starting finished log context PUT-7117
2022-04-16 04:04:34,569 - root - 59 - INFO - POST-7118 - Quarantining room: !QJyLOtoEVSHjkzSZoO:localhost:8800
2022-04-16 04:04:34,569 - synapse.storage.databases.main.room - 798 - INFO - POST-7118 - Quarantining media in room: !QJyLOtoEVSHjkzSZoO:localhost:8800
2022-04-16 04:04:34,572 - synapse.access.https.8800 - 427 - INFO - POST-7118 - 127.0.0.1 - 8800 - {@anon-20220416_035706-874:localhost:8800} Processed request: 0.026sec/0.001sec (0.004sec, 0.001sec) (0.005sec/0.010sec/9) 21B 200 "POST /_synapse/admin/v1/quarantine_media/!QJyLOtoEVSHjkzSZoO:localhost:8800?access_token=<redacted> HTTP/1.1" "Perl + Net::Async::HTTP/0.48" [0 dbevts]
2022-04-16 04:04:34,577 - synapse.rest.media.v1.media_repository - 328 - INFO - GET-7119 - Media is quarantined
2022-04-16 04:04:34,578 - synapse.access.https.8800 - 427 - INFO - GET-7119 - 127.0.0.1 - 8800 - {None} Processed request: 0.002sec/0.001sec (0.000sec, 0.000sec) (0.000sec/0.000sec/1) 94B 404 "GET /_matrix/media/v3/download/localhost:8800/PmqENWnndmMKkejEdMNqdVKk HTTP/1.1" "Perl + Net::Async::HTTP/0.48" [0 dbevts]
2022-04-16 04:04:34,582 - synapse.rest.media.v1.media_repository - 328 - INFO - GET-7120 - Media is quarantined
2022-04-16 04:04:34,583 - synapse.access.https.8800 - 427 - INFO - GET-7120 - 127.0.0.1 - 8800 - {None} Processed request: 0.002sec/0.001sec (0.000sec, 0.000sec) (0.000sec/0.001sec/1) 94B 404 "GET /_matrix/media/v3/thumbnail/localhost:8800/PmqENWnndmMKkejEdMNqdVKk?width=32&height=32&method=scale HTTP/1.1" "Perl + Net::Async::HTTP/0.48" [0 dbevts]
2022-04-16 04:04:34,592 - synapse.rest.media.v1.media_repository - 328 - INFO - GET-7121 - Media is quarantined
2022-04-16 04:04:34,593 - synapse.access.https.8800 - 427 - INFO - GET-7121 - 127.0.0.1 - 8800 - {None} Processed request: 0.002sec/0.001sec (0.000sec, 0.000sec) (0.000sec/0.001sec/1) 94B 404 "GET /_matrix/media/r0/download/localhost:8800/PmqENWnndmMKkejEdMNqdVKk?allow_remote=false HTTP/1.1" "Synapse/1.56.0" [0 dbevts]
2022-04-16 04:04:34,605 - synapse.rest.media.v1.media_repository - 328 - INFO - GET-7122 - Media is quarantined
2022-04-16 04:04:34,607 - synapse.access.https.8800 - 427 - INFO - GET-7122 - 127.0.0.1 - 8800 - {None} Processed request: 0.002sec/0.001sec (0.000sec, 0.000sec) (0.000sec/0.000sec/1) 94B 404 "GET /_matrix/media/r0/download/localhost:8800/PmqENWnndmMKkejEdMNqdVKk?allow_remote=false HTTP/1.1" "Synapse/1.56.0" [0 dbevts]

remote homeserver:

2022-04-16 04:04:34,594 - synapse.http.matrixfederationclient - 595 - INFO - GET-1001 - {GET-O-436} [localhost:8800] Got response headers: 404 Not Found
2022-04-16 04:04:34,595 - synapse.http.matrixfederationclient - 673 - WARNING - GET-1001 - {GET-O-436} [localhost:8800] Request failed: GET matrix://localhost:8800/_matrix/media/r0/download/localhost:8800/PmqENWnndmMKkejEdMNqdVKk?allow_remote=false: HttpResponseException('404: Not Found')
2022-04-16 04:04:34,595 - synapse.rest.media.v1.media_repository - 614 - WARNING - GET-1001 - HTTP error fetching remote media localhost:8800/PmqENWnndmMKkejEdMNqdVKk: b'{"errcode":"M_NOT_FOUND","error":"Not found [b\'localhost:8800\', b\'PmqENWnndmMKkejEdMNqdVKk\']"}'
2022-04-16 04:04:34,596 - synapse.access.https.8839 - 427 - INFO - GET-1001 - 127.0.0.1 - 8839 - {None} Processed request: 0.009sec/0.001sec (0.002sec, 0.000sec) (0.000sec/0.000sec/1) 109B 404 "GET /_matrix/media/v3/download/localhost:8800/PmqENWnndmMKkejEdMNqdVKk HTTP/1.1" "Perl + Net::Async::HTTP/0.48" [0 dbevts]
2022-04-16 04:04:34,608 - synapse.http.matrixfederationclient - 595 - INFO - GET-1002 - {GET-O-437} [localhost:8800] Got response headers: 404 Not Found
2022-04-16 04:04:34,608 - synapse.http.matrixfederationclient - 673 - WARNING - GET-1002 - {GET-O-437} [localhost:8800] Request failed: GET matrix://localhost:8800/_matrix/media/r0/download/localhost:8800/PmqENWnndmMKkejEdMNqdVKk?allow_remote=false: HttpResponseException('404: Not Found')
2022-04-16 04:04:34,608 - synapse.rest.media.v1.media_repository - 614 - WARNING - GET-1002 - HTTP error fetching remote media localhost:8800/PmqENWnndmMKkejEdMNqdVKk: b'{"errcode":"M_NOT_FOUND","error":"Not found [b\'localhost:8800\', b\'PmqENWnndmMKkejEdMNqdVKk\']"}'
2022-04-16 04:04:35,274 - synapse.handlers.presence - 788 - INFO - persist_presence_changes-5 - Persisting 1 unpersisted presence updates
2022-04-16 04:04:35,295 - synapse.util.caches.lrucache - 154 - INFO - LruCache._expire_old_entries-13 - Dropped 0 items from caches
2022-04-16 04:04:44,484 - synapse.http.site - 353 - INFO - GET-1002 - Connection from client lost before response was sent
2022-04-16 04:04:44,485 - synapse.access.https.8839 - 427 - INFO - GET-1002 - 127.0.0.1 - 8839 - {None} Processed request: 0.008sec/9.876sec (0.000sec, 0.000sec) (0.000sec/0.001sec/1) 0B 200! "GET /_matrix/media/v3/thumbnail/localhost:8800/PmqENWnndmMKkejEdMNqdVKk?method=scale&height=32&width=32 HTTP/1.1" "Perl + Net::Async::HTTP/0.48" [0 dbevts]

/thumbnail/ on the remote homeserver is not returning a 404. It's returning a 200 after about 10 seconds.

@DMRobertson
Copy link
Contributor

If I'm reading the conversation correctly, I think we've given all of Sumner's questions a response. I'll take this out of the review queue and mark it as waiting on changes. If that's not right, please let me know!

@DMRobertson DMRobertson removed the request for review from a team April 28, 2022 18:36
@sumnerevans
Copy link
Contributor Author

I tried to use LockStore.try_acquire_lock in order to linearize requests to update the same media ID. But the lock doesn't seem to actually be working (at least in the single-worker case).

Here's my attempt: /~https://github.com/sumnerevans/synapse/blob/sumner/msc2246-async-uploads-try-lock/synapse/rest/media/v1/upload_resource.py#L143-L171

If I send the requests sequentially as fast as possible, it does not properly linearize and so it updates the content twice.

So, the question is whether or not this is good enough or if I should continue making the locking mechanism more foolproof?

@squahtx
Copy link
Contributor

squahtx commented May 3, 2022

If I send the requests sequentially as fast as possible, it does not properly linearize and so it updates the content twice.

Does it sometimes linearize correctly? And are you testing using postgres or sqlite? I wonder if you've found a bug in LockStore.

So, the question is whether or not this is good enough or if I should continue making the locking mechanism more foolproof?

I can't see anything wrong with your attempt at a glance. I think we should try to figure out why it's broken. It might be something to do with how try_acquire_lock is being used, or it might be that LockStore has a bug that needs fixing.

@squahtx
Copy link
Contributor

squahtx commented May 3, 2022

I think LockStore has a bug. I came up with a hacky test case to demonstrate it:

tests/storage/databases/main/test_lock.py:

from twisted.internet import defer
from twisted.internet.defer import Deferred

...

class LockTestCase(unittest.HomeserverTestCase):
    ...

    def test_acquire_contention(self):
        # Track the number of tasks holding the lock.
        # Should be at most 1.
        in_lock = 0
        max_in_lock = 0

        release_lock: "Deferred[None]" = Deferred()
        async def task():
            nonlocal in_lock
            nonlocal max_in_lock

            lock = await self.store.try_acquire_lock("name", "key")
            if not lock:
                return

            async with lock:
                in_lock += 1
                max_in_lock = max(max_in_lock, in_lock)

                # Block to allow other tasks to attempt to take the lock.
                await release_lock

                in_lock -= 1

        # Start 3 tasks.
        defer.ensureDeferred(task())
        defer.ensureDeferred(task())
        defer.ensureDeferred(task())

        # All three are now waiting for the database.
        # Give the reactor a kick so that the database transaction returns.
        self.pump()

        release_lock.callback(None)

        # At most one task should have held the lock at a time.
        self.assertEqual(max_in_lock, 1)

The above test fails with "twisted.trial.unittest.FailTest: 3 != 1". ie. 3 things held the lock at once.

We'll probably want to fix that bug in a separate PR first. Sorry for the scope creep!
It might be worth discussing the approach to fixing try_acquire_lock in #synapse-dev, since locking can be tricky to get right.

@sumnerevans sumnerevans force-pushed the sumner/msc2246-async-uploads branch from 5b1d225 to 394b1c1 Compare May 22, 2022 04:36
@sumnerevans sumnerevans force-pushed the sumner/msc2246-async-uploads branch 3 times, most recently from f996c9f to 12eddde Compare May 30, 2022 13:33
@sumnerevans sumnerevans marked this pull request as ready for review May 30, 2022 15:32
@sumnerevans sumnerevans requested a review from a team as a code owner May 30, 2022 15:32
@sumnerevans sumnerevans force-pushed the sumner/msc2246-async-uploads branch 2 times, most recently from c8c752b to dbdff27 Compare May 30, 2022 17:01
@sumnerevans sumnerevans force-pushed the sumner/msc2246-async-uploads branch from dbdff27 to f13152c Compare May 31, 2022 15:21
sumnerevans and others added 11 commits June 3, 2022 12:30
Signed-off-by: Sumner Evans <sumner@beeper.com>
Signed-off-by: Sumner Evans <sumner@beeper.com>
Signed-off-by: Sumner Evans <sumner@beeper.com>
This allows for better handling of all of the versioned media endpoints

Signed-off-by: Sumner Evans <sumner@beeper.com>
Signed-off-by: Sumner Evans <sumner@beeper.com>
This depends on the new newly-fixed locking mechanism from
matrix-org#12832 to prevent races.

Signed-off-by: Sumner Evans <sumner@beeper.com>
The stall will default to 20s if MSC2246 is not enabled.

For remote media, the max_stall_ms parameter through to remote server.

This way, if servers over federation attempt to request the media, the
request will only 404 if the media is large (or the client doesn't
upload as soon as it can).

Signed-off-by: Sumner Evans <sumner@beeper.com>
Signed-off-by: Sumner Evans <sumner@beeper.com>
Signed-off-by: Sumner Evans <sumner@beeper.com>
Signed-off-by: Sumner Evans <sumner@beeper.com>
@sumnerevans sumnerevans force-pushed the sumner/msc2246-async-uploads branch from f13152c to 0ab87e5 Compare June 3, 2022 20:59
@erikjohnston erikjohnston self-assigned this Jun 15, 2022
@erikjohnston
Copy link
Member

@sumnerevans could you split this up into separate PRs please? It is a bit too unwieldy for me to follow I'm afraid. I'd suggest something like:

  1. Add config flags and create endpoint
  2. Uploading async media
  3. Stall parameter?

@@ -0,0 +1,84 @@
# Copyright 2014-2016 OpenMarket Ltd
# Copyright 2020-2021 The Matrix.org Foundation C.I.C.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a new file so should be copy righted to 2022 to whatever legal entity you're contributing as

@DMRobertson DMRobertson added the X-Awaiting-Changes A contributed PR which needs changes and re-review before it can be merged label Feb 16, 2023
@sumnerevans sumnerevans mentioned this pull request Apr 20, 2023
11 tasks
@sumnerevans sumnerevans mentioned this pull request Apr 28, 2023
11 tasks
@clokep clokep mentioned this pull request May 23, 2023
12 tasks
@clokep
Copy link
Member

clokep commented May 24, 2023

@sumnerevans Should we close this in favor of #15503?

@sumnerevans
Copy link
Contributor Author

Closing in favor of #15503

@sumnerevans sumnerevans deleted the sumner/msc2246-async-uploads branch August 23, 2023 18:51
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
X-Awaiting-Changes A contributed PR which needs changes and re-review before it can be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants