-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Synapse workers Complement image results in flaky tests due to inconsistent worker process init #10065
Comments
Hmm, perhaps we could actually get away with doing the health checks in Python, and only telling Supervisor to start nginx when all health checks are passing using Supervisor's XML-RPC API. |
For the record, this is the mechanism sytest uses (see /~https://github.com/matrix-org/sytest/blob/release-v1.35/lib/SyTest/Homeserver/ProcessManager.pm#L300-L362) - although it's defined by systemd, there's nothing particular to stop anyone else using the same mechanism - you just open a unix socket, set the That said, I'm not sure if it really helps you with Supervisord.
At that point, I'm kinda led to wonder if supervisord is actually doing much for you - forking processes isn't particularly hard, maybe it would be easier to manage the whole thing in python yourself (as sytest does, only in perl). Yet another alternative suggestion: give your docker image a HEALTHCHECK command. That command could know which workers should be running, and go and poll each of them (via some mechanism to be defined). Have Complement use the "health" of the docker container instead of polling /versions. |
yet a third idea: we already have a thing for managing synapse worker processes: it's called Edit: oh wait, that probably doesn't help you since there's still no way to get supervisord to wait before starting nginx. |
I would be happy with this on the Complement-side. If no HEALTHCHECK exists, it'll fall back to |
I'd shop around a bit for supervisord plugins that could achieve waiting on a process to fully start up before starting the next one. Otherwise just managing the processes in python should work with some light effort. |
It isn't too bad, although we seem to depend on supervisor's "restart until you don't crash" to get past the upgrade your database error that happens. Any pointers in sytest for where we manage all this state? |
Actually, I guess this is the systemd stuff talked about earlier? (Unless there's more to it?) |
Sytest waits for a |
I attempted to implement what is proposed above in #10705:
Note that working in these scripts is a bit painful as it seems that not all output actually ends up at stdout/stderr. I think some of this might be related to Python buffering output, but I'm not 100% sure. Anyway -- I'm going to be unable to continue researching this due to other responsibilities. |
I ran into this and accidentally went down a rabbithole when trying to confirm some complement-perf changes hadn't broken anything and came up with these PRs that might make healthchecks more reliable for the synapse-workers image: #11429 Not sure if this is actually slower than doing the right thing in terms of ordering startup of the processes in the docker container, and i'm not sure if it's actually making everything more reliable. |
I think waiting for all healthchecks to go green, and encourage consumer homeservers to implement those healthchecks, is a good way forward. This would take care of implementation-specific startup quirks, and while "querying /versions" is a good wet-finger approach, a healthcheck would be better, in my opinion. |
I feel like most of this is done now that #11429 and matrix-org/complement#293 have landed. |
The worker-version of Synapse running in Complement (as described here) currently uses Supervisor as an init system to start all worker processes in the container. See the current config template we're using.
This works, and eventually all processes start up. However, Complement checks whether a homeserver is ready to start testing by the fact that it responds successfully to a
GET /_matrix/client/versions
call. This endpoint may be successfully responded to by a worker that has started, while other workers are still starting up. This inconsistency can lead to test failures, where Complement finds a 502 from a call to a different endpoint that should be handled by a different worker. Since that worker hasn't started yet, nginx returns a 502, and the test fails.The result of this is flaky Complement tests - which nobody wants.
I believe the solution is to start groups of processes in the container through a priority system. Only should the next group be started once the previous has successfully responded to healthchecks (indicating the process is ready to receive connections):
(Note that caddy [just used for custom CA stuff] and Postgres are started before even Supervisor is.) By starting nginx at the very end, which is the reverse proxy that actually routes matrix requests to the appropriate Synapse process, Complement will not receive a successful response to
/_matrix/client/versions
until everything else has started.Initially I had hoped to use systemd as an init system to replace Supervisor, but systemd apparently doesn't work in docker containers. Additionally, we need each process to output its logs to stdout, as otherwise Complement won't be able to display the homeserver logs after a test failure. systemd would make this a bit tricky as it tries to capture logs. Currently this has worked by having Supervisor simply redirecting all process logs to stdout, which the
ENTRYPOINT
of the docker container, configure_workers_and_start.py, would simply relay.I don't believe we want to use
synctl
here, as the team has been trying to phase that out for a while now. We could simply do all of this viasubprocess
inconfigure_workers_and_start.py
, but I'm hoping there's a better, less manual way. Any ideas? I'd also love to be proven wrong as to whether Supervisor actually can do the following:@richvdh and @erikjohnston also mentioned that Synapse has a way to signal to processes that it's ready to receive connections (that may potentially be better than just polling the /health endpoint), which may be useful for this discussion. Edit: I've just had a look, and it looks like we use a systemd-specific method called
sdnotify
, which won't be useful here unfortunately.The text was updated successfully, but these errors were encountered: