Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

appservice_stream_position can get into such a state that ASes break silently #1834

Closed
ara4n opened this issue Jan 19, 2017 · 30 comments · Fixed by #10107
Closed

appservice_stream_position can get into such a state that ASes break silently #1834

ara4n opened this issue Jan 19, 2017 · 30 comments · Fixed by #10107
Labels
A-Application-Service Related to AS support S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@ara4n
Copy link
Member

ara4n commented Jan 19, 2017

I wasted 3h on arasphere last night trying to work out why it wouldn't send stuff to ASes. turns out the value in that table was somehow mangled. deleting it didn't help, but manually reinserting max streampos did. Plus it's inconsistently named wrt all the other AS stuff :(

@ara4n
Copy link
Member Author

ara4n commented Apr 19, 2017

so on my HS atm i have:

sqlite> select max(stream_ordering) from events;
811112
sqlite> select * from appservice_stream_position;
X|809618

and it somehow got into a state that the value in appservice_stream_position was mangled (missing entirely, iirc)
and i unstuck it by manually inserting max(stream_ordering) from events into it

@varesa
Copy link

varesa commented Apr 19, 2017

I've also been hit by this, now for the second time during the last week.

The first time I "solved" this be destroying the DBs and starting from scratch as I was still setting things up anyway.

Now on the second time my DB looked like this:

sqlite> select max(stream_ordering) from events;                                                                                                                                                                                                                                                                               
82774
sqlite> select * from appservice_stream_position;
X|6481

I was able to fix it by setting the stream_ordering column in appservice_stream_position to max(stream_ordering) from events like @ara4n said even though I was not missing the row.

sqlite> update appservice_stream_position set stream_ordering=82774;

@ara4n
Copy link
Member Author

ara4n commented Apr 19, 2017

@kegsay any idea?

@Spydar007
Copy link

I also had this issue, but correcting the values fixed it.

@kegsay
Copy link
Member

kegsay commented Jul 14, 2017

@ara4n no - I didn't do any of the stream position stuff when I did the original Synapse AS API implementation. You'd need to ask @erikjohnston.

@Sorunome
Copy link
Contributor

I had the same issue, fixed it via this.

Then I joined a room (#riot:matrix.org) noone on the synapse server was in yet, and it got off-sync again and i had to re-apply the fix. Perhaps this'll help to hunt down the bug

@turt2live
Copy link
Member

This is still very much an issue. There's absolutely no information in the appservice worker's log, as it seems to successfully send a transaction to an appservice, then it stops trying ever again. It does still receive rdata caches/events, and the following appears between rdata updates, but eventually it also stops receiving rdata:

appservice - 2017-10-11 02:55:54,347 - synapse.storage.TIME - 219 - INFO - - Total database time: 0.000% {set_appservice_last_pos(0): 0.000%, get_users_in_room(0): 0.000%, get_user_by_id(0): 0.000%} {}

A restart of the appservice worker did not help too much (it started receiving rdata again, but not sending transactions to appservices). I had to do the update query manually and restart the entire synapse instance.

Are there particular flags/logging I can leave on to try and debug this if it happens again? (this is the 3rd time in ~5 months it has happened).

@turt2live
Copy link
Member

This happened once again for me (t2bot.io), and I don't see anything useful in the logs. I have exported them for viewing if desired though (they have tokens and such, so I'd rather not post them publicly).

@turt2live
Copy link
Member

and again :(

synapse=# select max(stream_ordering) from events;
   max
----------
 65145242
(1 row)

synapse=# select * from appservice_stream_position;
 lock | stream_ordering
------+-----------------
 X    |        65097572
(1 row)

Logs are also available for this event.

@turt2live
Copy link
Member

This is becoming a trend where every couple months I'm having to meddle with the database. This has once again happened.

Is there a lead on a possible cause for this happening?

@richvdh
Copy link
Member

richvdh commented May 9, 2018

ok so let me just try to understand the symptoms here:

  • synapse stops sending events to appservices
  • inspection of the db shows that appservice_stream_position is stuck at an old event
  • restarting synapse does not fix it
  • updating appservice_stream_position to a later value (and restarting synapse) does fix it

Is that correct?

If so, the only plausible conclusion is that synapse's appservice sender gets stuck on particular events; by updating appservice_stream_position, we skip the problematic event and things get going again.

Since synapse handles events in batches of 100, the problem might lie anywhere in the next 100 events after the sticking point (ie, stream_orderings 65097573 to 65097672 in @turt2live 's latest example). I'm not quite sure how we can go about narrowing down the problem further :/.

It's possible that some recent improvements to error handling in synapse (in particular, /~https://github.com/matrix-org/synapse/pull/3138/files#diff-df9c8550f76f9aca4912eaa72d55379eR121) will reveal the problem here.

@turt2live
Copy link
Member

That's pretty much the problem. The additional issue is that there's no logging whatsoever for an issue occuring: it just... stops

Improved error handling may catch it. I'm also looking at adding much more verbose logging in the area to see if it can catch what's going on.

@Half-Shot
Copy link
Collaborator

We got bitten by this again today, because of some events that existed in the events table, yet DataStore._get_events couldn't find any of them so the stream position couldn't advance. I believe we should skip over any events we couldn't fetch.

@TimePath
Copy link

Ran into this again today, fast forwarded with this query:

update appservice_stream_position set stream_ordering = (select max(stream_ordering) from events);

No synapse restart required

@pedro-nonfree
Copy link

got crazy setting up mautrix-telegram, thanks @TimePath for the postgresql command

[20:37:52] root@matrix:~# sudo -u postgres psql synapse
could not change directory to "/root": Permission denied
psql (11.7 (Debian 11.7-0+deb10u1))
Type "help" for help.

synapse=# update appservice_stream_position set stream_ordering = (select max(stream_ordering) from events);

after applying this, I see transactions between synapse and mautrix-telegram happening, phew

@richvdh
Copy link
Member

richvdh commented Aug 26, 2020

please can people who observe this issue share logs from the homeserver.

@pedro-nonfree
Copy link

Checked my logs and I don't see anything relevant. We can coordinate privately about the logs you need

@richvdh
Copy link
Member

richvdh commented Oct 20, 2020

closing this for lack of useful response from the reporters.

@richvdh richvdh closed this as completed Oct 20, 2020
@turt2live
Copy link
Member

it's because the logs are silent - this still happens. What logs are you hoping to see?

@turt2live turt2live reopened this Oct 20, 2020
@richvdh
Copy link
Member

richvdh commented Oct 20, 2020

ok, well that is useful information in itself. Last I heard we'd made some logging changes and were waiting to hear if that showed anything up.

If you could share INFO-level logs from the time of the last successful transaction to a point where you would definitely have expected there to be another transaction, that would be a good starting point to check we're all on the same page.

@turt2live
Copy link
Member

my logs have since rolled over (happened last week), but any day now it's due to happen again.

@Twi1ightSparkle
Copy link

Twi1ightSparkle commented Oct 23, 2020

I had this on my homeserver. Updating appservice_stream_position solved the problem.

I can share logs from today, but not from around when it stopped working. I don't use Telegram much, so based on my Telegram/Matrix rooms, all I know is that it broke sometime between April 16th and August 30th. I only have about two weeks worth of logs on my HS.
Based on the timestamp off the event that appservice_stream_position was stuck on, it broke May 20th

Edit: I hacked together a script to catch this next time it breaks: https://gist.github.com/Twi1ightSparkle/53d33ce50c813f84fcccb3abe8dcdfba

@turt2live
Copy link
Member

@richvdh This just happened and I have logs this time. Would you like them personally or should I send them to someone else?

@erikjohnston
Copy link
Member

Looking at logs and prometheus from @turt2live (thanks!), I'm seeing the following background process in flight:

async def _notify_interested_services(self, max_token: RoomStreamToken):
with Measure(self.clock, "notify_interested_services"):
self.is_processing = True
try:
limit = 100
while True:
(
upper_bound,
events,
) = await self.store.get_new_events_for_appservice(
self.current_max, limit
)
if not events:
break
events_by_room = {} # type: Dict[str, List[EventBase]]
for event in events:
events_by_room.setdefault(event.room_id, []).append(event)
async def handle_event(event):
# Gather interested services
services = await self._get_services_for_event(event)
if len(services) == 0:
return # no services need notifying
# Do we know this user exists? If not, poke the user
# query API for all services which match that user regex.
# This needs to block as these user queries need to be
# made BEFORE pushing the event.
await self._check_user_exists(event.sender)
if event.type == EventTypes.Member:
await self._check_user_exists(event.state_key)
if not self.started_scheduler:
async def start_scheduler():
try:
return await self.scheduler.start()
except Exception:
logger.error("Application Services Failure")
run_as_background_process("as_scheduler", start_scheduler)
self.started_scheduler = True
# Fork off pushes to these services
for service in services:
self.scheduler.submit_event_for_as(service, event)
now = self.clock.time_msec()
ts = await self.store.get_received_ts(event.event_id)
synapse.metrics.event_processing_lag_by_event.labels(
"appservice_sender"
).observe((now - ts) / 1000)
async def handle_room_events(events):
for event in events:
await handle_event(event)
await make_deferred_yieldable(
defer.gatherResults(
[
run_in_background(handle_room_events, evs)
for evs in events_by_room.values()
],
consumeErrors=True,
)
)
await self.store.set_appservice_last_pos(upper_bound)
now = self.clock.time_msec()
ts = await self.store.get_received_ts(events[-1].event_id)
synapse.metrics.event_processing_positions.labels(
"appservice_sender"
).set(upper_bound)
events_processed_counter.inc(len(events))
event_processing_loop_room_count.labels("appservice_sender").inc(
len(events_by_room)
)
event_processing_loop_counter.labels("appservice_sender").inc()
synapse.metrics.event_processing_lag.labels(
"appservice_sender"
).set(now - ts)
synapse.metrics.event_processing_last_ts.labels(
"appservice_sender"
).set(ts)
finally:
self.is_processing = False

which is the loop that takes streams from the events table and dispatches to the various appservice classes for sending out. However, the outer loop is stuck, as the metric counts at the end of the loop aren't progressing:

events_processed_counter.inc(len(events))

But, the inner loop is progressing, at a rate of one iteration per minute, based on:

synapse.metrics.event_processing_lag_by_event.labels(
"appservice_sender"
).observe((now - ts) / 1000)

During that time we see requests in the logs and metrics timing out after one minute:

INFO - notify_interested_services-743543- Error sending request to  GET https://<discord_appservice>/users/<user_id>?access_token=<redacted>: RequestTimedOutError 504: Timeout waiting for response from remote server

which is probably being triggered from the loop by:

await self._check_user_exists(event.sender)
if event.type == EventTypes.Member:
await self._check_user_exists(event.state_key)

Since we won't process the next batch of events until we've finished the current batch, this one minute lag per event processed is enough to cause all appservices to fall behind. (Side note: @turt2live has a patch to drop "old" events from being sent out, so we end up not seeing any transactions being sent out)

Other observations:

  • Things quite obviously get stuck going slow at a particular instance
  • At first we see events going out to the various different appservices, but then for the rest of the time we only see membership events for the discord bridge.
  • It looks like the user query hit to the discord appservice doesn't at first

I'm not entirely sure what's going on here, but there seems to suddenly be a large number of discord membership events all at once? Does the discord bridge then fail to handle the user querying correctly? Perhaps all user queries to the discord appservice are slow, and its only a noticeable problem due to the sheer number of membership requests all at once.

@richvdh richvdh added S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. and removed z-info-needed labels Apr 9, 2021
@waclaw66
Copy link

It still seems to be happening, Synapse 1.31.0 stopped sending events to appservice rooms suddenly.
appservice_stream_position mismatched to select max(stream_ordering) from events.
Fixed by #1834 (comment).

@chagai95
Copy link
Contributor

chagai95 commented May 3, 2021

Thanks so much, just ran into this issue!

@14mRh4X0r
Copy link
Contributor

We ran into this for the Telegram bridge at SNT, and were able to identify the problem (at least for us): if all events that would be returned by get_new_events_for_appservice are rejected, the events list will be empty.

PR incoming.

14mRh4X0r added a commit to 14mRh4X0r/synapse that referenced this issue Jun 2, 2021
Fixes matrix-org#1834

Signed-off-by: Willem Mulder <14mRh4X0r@gmail.com>
erikjohnston pushed a commit that referenced this issue Jun 7, 2021
Fixes #1834.

`get_new_events_for_appservice` internally calls `get_events_as_list`, which will filter out any rejected events. If all returned events are filtered out, `_notify_interested_services` will return without updating the last handled stream position. If there are 100 consecutive such events, processing will halt altogether.

Breaking the loop is now done by checking whether we're up-to-date with `current_max` in the loop condition, instead of relying on an empty `events` list.


Signed-off-by: Willem Mulder <14mRh4X0r@gmail.com>
@MadLittleMods MadLittleMods added the A-Application-Service Related to AS support label Nov 10, 2021
@t3chguy
Copy link
Member

t3chguy commented Dec 21, 2021

I hit this and had to manually resolve it, would be nice if modern versions of Synapse were able to self-repair in this scenario.

@richvdh
Copy link
Member

richvdh commented Dec 22, 2021

@t3chguy this issue is supposedly fixed - if you have evidence to the contrary, could you open a new issue with more details?

@t3chguy
Copy link
Member

t3chguy commented Dec 22, 2021

#11629

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Application-Service Related to AS support S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
Development

Successfully merging a pull request may close this issue.