Synapse workers Complement image results in flaky tests due to inconsistent worker process init #10065

anoadragon453 · 2021-05-25T17:47:20Z

The worker-version of Synapse running in Complement (as described here) currently uses Supervisor as an init system to start all worker processes in the container. See the current config template we're using.

This works, and eventually all processes start up. However, Complement checks whether a homeserver is ready to start testing by the fact that it responds successfully to a GET /_matrix/client/versions call. This endpoint may be successfully responded to by a worker that has started, while other workers are still starting up. This inconsistency can lead to test failures, where Complement finds a 502 from a call to a different endpoint that should be handled by a different worker. Since that worker hasn't started yet, nginx returns a 502, and the test fails.

The result of this is flaky Complement tests - which nobody wants.

I believe the solution is to start groups of processes in the container through a priority system. Only should the next group be started once the previous has successfully responded to healthchecks (indicating the process is ready to receive connections):

Redis
the main Synapse process
- thus all database migrations are handled before any workers start up.
all worker Synapse processes
nginx

(Note that caddy [just used for custom CA stuff] and Postgres are started before even Supervisor is.) By starting nginx at the very end, which is the reverse proxy that actually routes matrix requests to the appropriate Synapse process, Complement will not receive a successful response to /_matrix/client/versions until everything else has started.

Initially I had hoped to use systemd as an init system to replace Supervisor, but systemd apparently doesn't work in docker containers. Additionally, we need each process to output its logs to stdout, as otherwise Complement won't be able to display the homeserver logs after a test failure. systemd would make this a bit tricky as it tries to capture logs. Currently this has worked by having Supervisor simply redirecting all process logs to stdout, which the ENTRYPOINT of the docker container, configure_workers_and_start.py, would simply relay.

I don't believe we want to use synctl here, as the team has been trying to phase that out for a while now. We could simply do all of this via subprocess in configure_workers_and_start.py, but I'm hoping there's a better, less manual way. Any ideas? I'd also love to be proven wrong as to whether Supervisor actually can do the following:

Wait for another process to start up before starting the next.
Healthchecks using HTTP.

@richvdh and @erikjohnston also mentioned that Synapse has a way to signal to processes that it's ready to receive connections (that may potentially be better than just polling the /health endpoint), which may be useful for this discussion. Edit: I've just had a look, and it looks like we use a systemd-specific method called sdnotify, which won't be useful here unfortunately.

The text was updated successfully, but these errors were encountered:

anoadragon453 · 2021-05-25T18:05:34Z

Hmm, perhaps we could actually get away with doing the health checks in Python, and only telling Supervisor to start nginx when all health checks are passing using Supervisor's XML-RPC API.

richvdh · 2021-05-25T20:59:45Z

Edit: I've just had a look, and it looks like we use a systemd-specific method called sdnotify, which won't be useful here unfortunately.

For the record, this is the mechanism sytest uses (see /~https://github.com/matrix-org/sytest/blob/release-v1.35/lib/SyTest/Homeserver/ProcessManager.pm#L300-L362) - although it's defined by systemd, there's nothing particular to stop anyone else using the same mechanism - you just open a unix socket, set the NOTIFY_SOCKET env var, and wait for the READY=1 notification.

That said, I'm not sure if it really helps you with Supervisord.

and only telling Supervisor to start nginx when all health checks are passing using Supervisor's XML-RPC API.

At that point, I'm kinda led to wonder if supervisord is actually doing much for you - forking processes isn't particularly hard, maybe it would be easier to manage the whole thing in python yourself (as sytest does, only in perl).

Yet another alternative suggestion: give your docker image a HEALTHCHECK command. That command could know which workers should be running, and go and poll each of them (via some mechanism to be defined). Have Complement use the "health" of the docker container instead of polling /versions.

richvdh · 2021-05-25T21:02:24Z

yet a third idea: we already have a thing for managing synapse worker processes: it's called synctl. Maybe add support for sdnotify to it?

Edit: oh wait, that probably doesn't help you since there's still no way to get supervisord to wait before starting nginx.

kegsay · 2021-08-17T09:43:59Z

Yet another alternative suggestion: give your docker image a HEALTHCHECK command. That command could know which workers should be running, and go and poll each of them (via some mechanism to be defined). Have Complement use the "health" of the docker container instead of polling /versions.

I would be happy with this on the Complement-side. If no HEALTHCHECK exists, it'll fall back to /versions.

anoadragon453 · 2021-08-17T10:47:51Z

NOTIFY_SOCKET is one option. If it's easier, one could also poll /health on each individual worker process to see once they've started up.

I'd shop around a bit for supervisord plugins that could achieve waiting on a process to fully start up before starting the next one.

Otherwise just managing the processes in python should work with some light effort.

clokep · 2021-08-24T18:52:54Z

At that point, I'm kinda led to wonder if supervisord is actually doing much for you - forking processes isn't particularly hard, maybe it would be easier to manage the whole thing in python yourself (as sytest does, only in perl).

It isn't too bad, although we seem to depend on supervisor's "restart until you don't crash" to get past the upgrade your database error that happens. Any pointers in sytest for where we manage all this state?

clokep · 2021-08-24T19:13:07Z

At that point, I'm kinda led to wonder if supervisord is actually doing much for you - forking processes isn't particularly hard, maybe it would be easier to manage the whole thing in python yourself (as sytest does, only in perl).

It isn't too bad, although we seem to depend on supervisor's "restart until you don't crash" to get past the upgrade your database error that happens. Any pointers in sytest for where we manage all this state?

Actually, I guess this is the systemd stuff talked about earlier? (Unless there's more to it?)

richvdh · 2021-08-25T11:01:11Z

At that point, I'm kinda led to wonder if supervisord is actually doing much for you - forking processes isn't particularly hard, maybe it would be easier to manage the whole thing in python yourself (as sytest does, only in perl).

It isn't too bad, although we seem to depend on supervisor's "restart until you don't crash" to get past the upgrade your database error that happens. Any pointers in sytest for where we manage all this state?

Actually, I guess this is the systemd stuff talked about earlier? (Unless there's more to it?)

Sytest waits for a READY=1 notification from the main process before starting the workers. /~https://github.com/matrix-org/sytest/blob/develop/lib/SyTest/Homeserver/Synapse.pm#L1096-L1124.

clokep · 2021-09-07T15:27:33Z

I attempted to implement what is proposed above in #10705:

Directly start the processes in configure_workers_and_start.py and wait for the READY=1 notification to ensure the process is ready.

Note that working in these scripts is a bit painful as it seems that not all output actually ends up at stdout/stderr. I think some of this might be related to Python buffering output, but I'm not 100% sure.

Anyway -- I'm going to be unable to continue researching this due to other responsibilities.

michaelkaye · 2021-11-26T13:13:40Z

I ran into this and accidentally went down a rabbithole when trying to confirm some complement-perf changes hadn't broken anything and came up with these PRs that might make healthchecks more reliable for the synapse-workers image:

#11429
matrix-org/complement#293

Not sure if this is actually slower than doing the right thing in terms of ordering startup of the processes in the docker container, and i'm not sure if it's actually making everything more reliable.

ShadowJonathan · 2021-12-01T11:26:19Z

I think waiting for all healthchecks to go green, and encourage consumer homeservers to implement those healthchecks, is a good way forward.

This would take care of implementation-specific startup quirks, and while "querying /versions" is a good wet-finger approach, a healthcheck would be better, in my opinion.

richvdh · 2022-03-31T12:16:37Z

I feel like most of this is done now that #11429 and matrix-org/complement#293 have landed.

anoadragon453 added T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. X-Needs-Discussion labels May 25, 2021

callahad removed the X-Needs-Discussion label May 27, 2021

anoadragon453 mentioned this issue Jun 22, 2021

Complement: Add Synapse Workers build step matrix-org/pipelines#123

Closed

anoadragon453 mentioned this issue Aug 17, 2021

Test Synapse in worker mode under Complement #12638

Closed

clokep self-assigned this Aug 24, 2021

This was referenced Aug 24, 2021

Bump the postgres version used by Synapse workers to 13. matrix-org/complement#200

Merged

Ensure the base image is rebuilt when running complement with workers. #10686

Merged

clokep mentioned this issue Aug 26, 2021

Directly start Synapse processes instead of using supervisor for Docker image with workers #10705

Closed

clokep removed their assignment Sep 7, 2021

anoadragon453 mentioned this issue Sep 13, 2021

One time key counts differ between /sync and /keys/upload #10766

Open

MadLittleMods added the A-Workers Problems related to running Synapse in Worker Mode (or replication) label Nov 10, 2021

MadLittleMods mentioned this issue Nov 10, 2021

Does the Synapse Complement test runner for workers work? #11301

Closed

MadLittleMods added the z-testing label Nov 10, 2021

richvdh mentioned this issue Nov 26, 2021

Update docker image to permit running as non-root and other tweaks for complement build #11431

Closed

4 tasks

richvdh mentioned this issue Feb 16, 2022

Shard master process. #7593

Closed

richvdh mentioned this issue Apr 7, 2022

Fix up healthcheck generation for workers docker image #12405

Merged

richvdh closed this as completed in #12405 Apr 11, 2022

richvdh added the A-Testing Issues related to testing in complement, synapse, etc label Aug 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Synapse workers Complement image results in flaky tests due to inconsistent worker process init #10065

Synapse workers Complement image results in flaky tests due to inconsistent worker process init #10065

anoadragon453 commented May 25, 2021 •

edited

Loading

anoadragon453 commented May 25, 2021

richvdh commented May 25, 2021

richvdh commented May 25, 2021 •

edited

Loading

kegsay commented Aug 17, 2021

anoadragon453 commented Aug 17, 2021

clokep commented Aug 24, 2021

clokep commented Aug 24, 2021

richvdh commented Aug 25, 2021

clokep commented Sep 7, 2021

michaelkaye commented Nov 26, 2021 •

edited by richvdh

Loading

ShadowJonathan commented Dec 1, 2021

richvdh commented Mar 31, 2022

Synapse workers Complement image results in flaky tests due to inconsistent worker process init #10065

Synapse workers Complement image results in flaky tests due to inconsistent worker process init #10065

Comments

anoadragon453 commented May 25, 2021 • edited Loading

anoadragon453 commented May 25, 2021

richvdh commented May 25, 2021

richvdh commented May 25, 2021 • edited Loading

kegsay commented Aug 17, 2021

anoadragon453 commented Aug 17, 2021

clokep commented Aug 24, 2021

clokep commented Aug 24, 2021

richvdh commented Aug 25, 2021

clokep commented Sep 7, 2021

michaelkaye commented Nov 26, 2021 • edited by richvdh Loading

ShadowJonathan commented Dec 1, 2021

richvdh commented Mar 31, 2022

anoadragon453 commented May 25, 2021 •

edited

Loading

richvdh commented May 25, 2021 •

edited

Loading

michaelkaye commented Nov 26, 2021 •

edited by richvdh

Loading