-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
appservice_stream_position can get into such a state that ASes break silently #1834
Comments
so on my HS atm i have:
and it somehow got into a state that the value in appservice_stream_position was mangled (missing entirely, iirc) |
I've also been hit by this, now for the second time during the last week. The first time I "solved" this be destroying the DBs and starting from scratch as I was still setting things up anyway. Now on the second time my DB looked like this:
I was able to fix it by setting the stream_ordering column in appservice_stream_position to max(stream_ordering) from events like @ara4n said even though I was not missing the row.
|
@kegsay any idea? |
I also had this issue, but correcting the values fixed it. |
@ara4n no - I didn't do any of the stream position stuff when I did the original Synapse AS API implementation. You'd need to ask @erikjohnston. |
I had the same issue, fixed it via this. Then I joined a room (#riot:matrix.org) noone on the synapse server was in yet, and it got off-sync again and i had to re-apply the fix. Perhaps this'll help to hunt down the bug |
This is still very much an issue. There's absolutely no information in the appservice worker's log, as it seems to successfully send a transaction to an appservice, then it stops trying ever again. It does still receive rdata caches/events, and the following appears between rdata updates, but eventually it also stops receiving rdata:
A restart of the appservice worker did not help too much (it started receiving rdata again, but not sending transactions to appservices). I had to do the update query manually and restart the entire synapse instance. Are there particular flags/logging I can leave on to try and debug this if it happens again? (this is the 3rd time in ~5 months it has happened). |
This happened once again for me (t2bot.io), and I don't see anything useful in the logs. I have exported them for viewing if desired though (they have tokens and such, so I'd rather not post them publicly). |
and again :(
Logs are also available for this event. |
This is becoming a trend where every couple months I'm having to meddle with the database. This has once again happened. Is there a lead on a possible cause for this happening? |
ok so let me just try to understand the symptoms here:
Is that correct? If so, the only plausible conclusion is that synapse's appservice sender gets stuck on particular events; by updating appservice_stream_position, we skip the problematic event and things get going again. Since synapse handles events in batches of 100, the problem might lie anywhere in the next 100 events after the sticking point (ie, stream_orderings 65097573 to 65097672 in @turt2live 's latest example). I'm not quite sure how we can go about narrowing down the problem further :/. It's possible that some recent improvements to error handling in synapse (in particular, /~https://github.com/matrix-org/synapse/pull/3138/files#diff-df9c8550f76f9aca4912eaa72d55379eR121) will reveal the problem here. |
That's pretty much the problem. The additional issue is that there's no logging whatsoever for an issue occuring: it just... stops Improved error handling may catch it. I'm also looking at adding much more verbose logging in the area to see if it can catch what's going on. |
We got bitten by this again today, because of some events that existed in the |
Ran into this again today, fast forwarded with this query: update appservice_stream_position set stream_ordering = (select max(stream_ordering) from events); No synapse restart required |
got crazy setting up mautrix-telegram, thanks @TimePath for the postgresql command
after applying this, I see transactions between synapse and mautrix-telegram happening, phew |
please can people who observe this issue share logs from the homeserver. |
Checked my logs and I don't see anything relevant. We can coordinate privately about the logs you need |
closing this for lack of useful response from the reporters. |
it's because the logs are silent - this still happens. What logs are you hoping to see? |
ok, well that is useful information in itself. Last I heard we'd made some logging changes and were waiting to hear if that showed anything up. If you could share INFO-level logs from the time of the last successful transaction to a point where you would definitely have expected there to be another transaction, that would be a good starting point to check we're all on the same page. |
my logs have since rolled over (happened last week), but any day now it's due to happen again. |
I had this on my homeserver. Updating I can share logs from today, but not from around when it stopped working. I don't use Telegram much, so based on my Telegram/Matrix rooms, all I know is that it broke sometime between April 16th and August 30th. I only have about two weeks worth of logs on my HS. Edit: I hacked together a script to catch this next time it breaks: https://gist.github.com/Twi1ightSparkle/53d33ce50c813f84fcccb3abe8dcdfba |
@richvdh This just happened and I have logs this time. Would you like them personally or should I send them to someone else? |
Looking at logs and prometheus from @turt2live (thanks!), I'm seeing the following background process in flight: synapse/synapse/handlers/appservice.py Lines 86 to 179 in a7a9139
which is the loop that takes streams from the events table and dispatches to the various appservice classes for sending out. However, the outer loop is stuck, as the metric counts at the end of the loop aren't progressing: synapse/synapse/handlers/appservice.py Line 164 in a7a9139
But, the inner loop is progressing, at a rate of one iteration per minute, based on: synapse/synapse/handlers/appservice.py Lines 137 to 139 in a7a9139
During that time we see requests in the logs and metrics timing out after one minute:
which is probably being triggered from the loop by: synapse/synapse/handlers/appservice.py Lines 116 to 118 in a7a9139
Since we won't process the next batch of events until we've finished the current batch, this one minute lag per event processed is enough to cause all appservices to fall behind. (Side note: @turt2live has a patch to drop "old" events from being sent out, so we end up not seeing any transactions being sent out) Other observations:
I'm not entirely sure what's going on here, but there seems to suddenly be a large number of discord membership events all at once? Does the discord bridge then fail to handle the user querying correctly? Perhaps all user queries to the discord appservice are slow, and its only a noticeable problem due to the sheer number of membership requests all at once. |
It still seems to be happening, Synapse 1.31.0 stopped sending events to appservice rooms suddenly. |
Thanks so much, just ran into this issue! |
We ran into this for the Telegram bridge at SNT, and were able to identify the problem (at least for us): if all events that would be returned by PR incoming. |
Fixes matrix-org#1834 Signed-off-by: Willem Mulder <14mRh4X0r@gmail.com>
Fixes #1834. `get_new_events_for_appservice` internally calls `get_events_as_list`, which will filter out any rejected events. If all returned events are filtered out, `_notify_interested_services` will return without updating the last handled stream position. If there are 100 consecutive such events, processing will halt altogether. Breaking the loop is now done by checking whether we're up-to-date with `current_max` in the loop condition, instead of relying on an empty `events` list. Signed-off-by: Willem Mulder <14mRh4X0r@gmail.com>
I hit this and had to manually resolve it, would be nice if modern versions of Synapse were able to self-repair in this scenario. |
@t3chguy this issue is supposedly fixed - if you have evidence to the contrary, could you open a new issue with more details? |
I wasted 3h on arasphere last night trying to work out why it wouldn't send stuff to ASes. turns out the value in that table was somehow mangled. deleting it didn't help, but manually reinserting max streampos did. Plus it's inconsistently named wrt all the other AS stuff :(
The text was updated successfully, but these errors were encountered: