-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finding a solution to put all user data in KV store #93
Comments
For privacy reasons there is some noise necessary to prevent traffic analysis. Note that the artificial requests to other shards won't be processed by the shards using real logic. the main processing cost still comes from the shards actually doing the lookups and the server running the UDF processing. So the privacy protection overhead is mainly from the network noise and won't be a multiple of the number of shards. There is a flag to deactivate the chaffing mechanism in non-prod environment (sending artificial downstream lookup requests to irrelevant shards). Could you evaluate the cost difference between switching it on/off so we can get closer to a more quantitative understanding? |
If we have let's say 100 shards 99% of the incoming traffic will be overhead. We also have shown here that there is a real compute impact even with a dummy ROMA script (if the KV server executes logic). Is our provided code supposed to check for the right shard, or is there some middleware managed by Chrome that would handle this ? It looks unlikely that this will have no impact. We would welcome some Chrome benchmarks with let's say 100 shards or 1000 shards where one can really see that the expected impact would be negligible.
In the meantime this flag to deactivate sharding should apply to PROD mode also waiting for a proper solution. |
The artificial requests won’t be processed by Roma. The server as it receives the requests can tell that it is not a real request and it will return directly, skipping the intermediate steps.
If by “checking for the right shard” you mean letting the server framework know which shard to communicate with, that is not necessary. The UDF simply invokes a lookup API call, which under the hood decides whether to look up locally or remotely.
Agreed that for the time being it is unlikely that this will have no impact. We expect though that currently it is somewhere in between “no impact” and “multiply the infrastructure cost by the number of shards”. Our approach is to start with a focused version of sharding – a minimum viable product. This allows us to gather valuable feedback from early testers on key aspects like functionality and usability. We recognize that cost and performance might not be ideal initially, but this feedback-driven approach lets us prioritize optimization efforts effectively. By starting with a smaller scope, we can avoid premature optimization and focus on addressing any fundamental issues that arise through real-world usage. |
For our production use cases we need to handle datasets that don’t easily fit into the memory of one machine in a cost efficient way.
We can easily distribute them across many machines because the total dataset is big but individual 1st party user data is relatively small and this would allow horizontal scaling of our user data on many KV instances with a reasonable memory/cpu footprint.
Currently the doc on sharding mentions:
This seems not scalable. If each server receives all requests it would multiply the infrastructure cost by the number of shards.
The easiest seems to officially deactivate this mechanism and define a timeline to work on a viable long term solution.
The long term solution should include on-device and B&A use cases.
The text was updated successfully, but these errors were encountered: