-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ratelimit presence updates #18000
base: develop
Are you sure you want to change the base?
Ratelimit presence updates #18000
Conversation
9ee8357
to
9a48930
Compare
9a48930
to
aed1b1e
Compare
I tested this PR on our production server (using v1.120.2) with 340 active users today. Here follows a test report that appears to fix #16843 for us. It shows the changes in CPU usage patterns when enabling or disabling the patch during our peak usage times. Test timeline2024-12-05T11:20:27+0100 systemctl restart nginx.service
2024-12-05T12:15:44+0100 systemctl restart netdata.service
2024-12-05T12:19:04+0100 systemctl restart synapse-phys.target # 1 restart v1.120.2
2024-12-05T12:22:01+0100 mv env env_orig; mv env_patched env
2024-12-05T12:26:56+0100 vim /etc/opt/synapse/phys.ethz.ch.homeserver.yaml
2024-12-05T12:30:11+0100 mv env env_patched; mv env_orig env
2024-12-05T12:30:18+0100 systemctl restart synapse-phys.target # 2 restart v1.120.2 patched
2024-12-05T12:30:51+0100 vim /etc/opt/synapse/phys.ethz.ch.homeserver.yaml # set per_second: 0.02
2024-12-05T12:45:03+0100 systemctl restart synapse-phys.target # 3 restart v1.120.2
2024-12-05T12:55:56+0100 systemctl restart synapse-phys.target # 4 restart v1.120.2
2024-12-05T12:57:11+0100 mv env env_orig; mv env_patched env
2024-12-05T12:57:17+0100 systemctl restart synapse-phys.target # 5 restart v1.120.2 patched (0.02)
2024-12-05T13:05:00+0100 cd /etc/opt/synapse/
2024-12-05T13:05:06+0100 vim phys.ethz.ch.homeserver.yaml # set per_second: 0.002
2024-12-05T13:05:34+0100 systemctl restart synapse-phys.target # 6 restart v1.120.2 patched (0.002)
2024-12-05T13:52:17+0100 mv env env_patched; mv env_orig env
2024-12-05T13:52:20+0100 systemctl restart synapse-phys.target # 7 restart v1.120.2
2024-12-05T13:58:36+0100 vim /etc/opt/synapse/phys.ethz.ch.homeserver.yaml # set per_second: 0.01?
2024-12-05T13:59:22+0100 mv env env_orig; mv env_patched env
2024-12-05T13:59:25+0100 systemctl restart synapse-phys.target # 8 restart v1.120.2 patched (0.01?)
2024-12-05T14:05:43+0100 vim /etc/opt/synapse/phys.ethz.ch.homeserver.yaml # set ?
2024-12-05T14:17:19+0100 systemctl restart synapse-phys.target # 9 restart v1.120.2 patched (?)
2024-12-05T14:18:32+0100 vim env/lib/python3.11/site-packages/synapse/rest/client/sync.py # logger.info -> logger.debug (disabling the "User set_presence ratelimit exceeded; ignoring it." log spam, synapse log level was always set to info during these tests)
2024-12-05T14:19:09+0100 vim /etc/opt/synapse/phys.ethz.ch.homeserver.yaml # set per_second: 0.02?
2024-12-05T14:20:27+0100 systemctl restart synapse-phys.target # 10 restart v1.120.2 patched (0.02?)
2024-12-05T15:12:21+0100 mv env env_patched; mv env_orig env
2024-12-05T15:12:27+0100 systemctl restart synapse-phys.target # 11 restart v1.120.2
2024-12-05T15:13:41+0100 vim /etc/opt/synapse/phys.ethz.ch.homeserver.yaml # unset (use defaults: per_second: 0.1)
2024-12-05T15:18:38+0100 mv env env_orig; mv env_patched env
2024-12-05T15:18:41+0100 systemctl restart synapse-phys.target # 12 restart v1.120.2 patched (0.1) Numbered screenshots of CPU usage patterns1 |
After 2.5 work days the issue (#16843) did not reappear: The day with the red line was the day of testing #18000 (comment). After the red line the PR is live with v1.120.2 using defaults ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation looks good to me! This is only thing missing are unit tests (though thank you for the proof-in-production graphs!).
At minimum, could you write two tests:
- Send a presence update, check that it went through, immediately send another and check that it was ignored.
- Send a presence update, check that it went through, advancing time a sufficient amount, send another presence update and check that it also worked.
Presence-related tests go in PresenceUpdateTestCase
. Here is an example of advancing time in a test. Time won't advance otherwise.
I'd recommend switching the presence state (offline
-> online
, etc.) so you can check that a change occurred.
Let me know or shout in #synapse-dev:matrix.org if you need further guidance!
1. Send a presence update, check that it went through, immediately send another one and check that it was ignored. 2. Send a presence update, check that it went through, advancing time a sufficient amount, send another presence update and check that it also worked.
@anoadragon453 I changed the docs as requested and implemented the unit tests. I Hope that looks good. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for continuing to update this! A couple things below.
@anoadragon453 I implemented the requested changes. I think the unit tests for the |
So that we don't get ratelimits while testing the logic of presence in our Complement test suite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me; thanks @rda0!
@rda0 looks like there are failures for the following tests:
presumably as we're now receiving 429's in spots we didn't expect before. |
@anoadragon453 I was not sure about these two errors. I think it is not 429's, the tests time out: $ poetry run trial -j8 tests.rest.client.test_sync.SyncCacheTestCase.test_noop_sync_does_not_tightloop tests.rest.client.test_sync.DeviceListSyncTestCase_0__sync.test_not_receiving_local_device_list_changes
The "poetry.dev-dependencies" section is deprecated and will be removed in a future version. Use "poetry.group.dev.dependencies" instead.
Running 2 tests.
tests.rest.client.test_sync
DeviceListSyncTestCase_0__sync
test_not_receiving_local_device_list_changes ... [ERROR]
SyncCacheTestCase
test_noop_sync_does_not_tightloop ... [ERROR]
===============================================================================
[ERROR]
Traceback (most recent call last):
File "/home/matrix/synapse/tests/rest/client/test_sync.py", line 855, in test_not_receiving_local_device_list_changes
bob_sync_channel.await_result()
File "/home/matrix/synapse/tests/server.py", line 313, in await_result
raise TimedOutException("Timed out waiting for request to finish.")
tests.server.TimedOutException: Timed out waiting for request to finish.
tests.rest.client.test_sync.DeviceListSyncTestCase_0__sync.test_not_receiving_local_device_list_changes
===============================================================================
[ERROR]
Traceback (most recent call last):
File "/home/matrix/synapse/tests/rest/client/test_sync.py", line 685, in test_noop_sync_does_not_tightloop
channel.await_result(timeout_ms=200)
File "/home/matrix/synapse/tests/server.py", line 313, in await_result
raise TimedOutException("Timed out waiting for request to finish.")
tests.server.TimedOutException: Timed out waiting for request to finish.
tests.rest.client.test_sync.SyncCacheTestCase.test_noop_sync_does_not_tightloop
-------------------------------------------------------------------------------
Ran 2 tests in 0.005s When I change synapse/synapse/api/ratelimiting.py Line 321 in 0a31cf1
to await self.clock.sleep(0.1) both report OK or when I change synapse/tests/rest/client/test_sync.py Line 685 in 0a31cf1
channel.await_result(timeout_ms=600) the test test_noop_sync_does_not_tightloop repost OK .
I am unsure what to do here, or how that is supposed to work. Could you check that? It looks like the sleep in the rate limiter introduced in 52af16c now increases the timeout in synapse/synapse/rest/client/sync.py Line 129 in 0a31cf1
set_presence is hitting the rate limit. I was not aware of that.
|
I now patched our production server again with the following patch, to see if it makes a difference for our issue in #18000 (comment). diff --git a/synapse/api/ratelimiting.py b/synapse/api/ratelimiting.py
index b80630c5d3..229329a5ae 100644
--- a/synapse/api/ratelimiting.py
+++ b/synapse/api/ratelimiting.py
@@ -275,6 +275,7 @@ class Ratelimiter:
update: bool = True,
n_actions: int = 1,
_time_now_s: Optional[float] = None,
+ pause: Optional[float] = 0.5,
) -> None:
"""Checks if an action can be performed. If not, raises a LimitExceededError
@@ -298,6 +299,8 @@ class Ratelimiter:
at all.
_time_now_s: The current time. Optional, defaults to the current time according
to self.clock. Only used by tests.
+ pause: Time in seconds to pause when an action is being limited. Defaults to 0.5
+ to stop clients from "tight-looping" on retrying their request.
Raises:
LimitExceededError: If an action could not be performed, along with the time in
@@ -316,9 +319,8 @@ class Ratelimiter:
)
if not allowed:
- # We pause for a bit here to stop clients from "tight-looping" on
- # retrying their request.
- await self.clock.sleep(0.5)
+ if pause:
+ await self.clock.sleep(pause)
raise LimitExceededError(
limiter_name=self._limiter_name,
diff --git a/synapse/rest/client/sync.py b/synapse/rest/client/sync.py
index b935830bae..e6742e8909 100644
--- a/synapse/rest/client/sync.py
+++ b/synapse/rest/client/sync.py
@@ -247,9 +247,9 @@ class SyncRestServlet(RestServlet):
# send any outstanding server notices to the user.
await self._server_notices_sender.on_user_syncing(user.to_string())
- # ignore the presence update if the ratelimit is exceeded
+ # ignore the presence update if the ratelimit is exceeded but do not pause the request
try:
- await self._presence_per_user_limiter.ratelimit(requester)
+ await self._presence_per_user_limiter.ratelimit(requester, pause=0)
except LimitExceededError:
affect_presence = False
logger.debug("User set_presence ratelimit exceeded; ignoring it.") |
This adds rate-limiting applied for presence updates per user.
This is my first contribution to synapse. The default settings probably need guidance from developers and should be tested on other server configurations. I will post a test report in the comments.
Pull Request Checklist
EventStore
toEventWorkerStore
.".code blocks
.(run the linters)