Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finding a solution to put all user data in KV store #93

Open
fhoering opened this issue Dec 12, 2024 · 3 comments
Open

Finding a solution to put all user data in KV store #93

fhoering opened this issue Dec 12, 2024 · 3 comments

Comments

@fhoering
Copy link
Contributor

For our production use cases we need to handle datasets that don’t easily fit into the memory of one machine in a cost efficient way.

We can easily distribute them across many machines because the total dataset is big but individual 1st party user data is relatively small and this would allow horizontal scaling of our user data on many KV instances with a reasonable memory/cpu footprint.

Currently the doc on sharding mentions:

  • for any given kv server read request, downstream requests are always made to all shards. That is necessary to not reveal extra information about the looked up keys, as AdTechs know which keys live on which shards.
  • for any given kv server read request, when data shards are queried, the payloads of corresponding requests are of the same size, for the same reason.

This seems not scalable. If each server receives all requests it would multiply the infrastructure cost by the number of shards.

The easiest seems to officially deactivate this mechanism and define a timeline to work on a viable long term solution.

The long term solution should include on-device and B&A use cases.

@peiwenhu
Copy link

peiwenhu commented Dec 12, 2024

For privacy reasons there is some noise necessary to prevent traffic analysis.

Note that the artificial requests to other shards won't be processed by the shards using real logic. the main processing cost still comes from the shards actually doing the lookups and the server running the UDF processing. So the privacy protection overhead is mainly from the network noise and won't be a multiple of the number of shards.

There is a flag to deactivate the chaffing mechanism in non-prod environment (sending artificial downstream lookup requests to irrelevant shards). Could you evaluate the cost difference between switching it on/off so we can get closer to a more quantitative understanding?

@fhoering
Copy link
Contributor Author

Note that the artificial requests to other shards won't be processed by the shards using real logic. the main processing cost still comes from the shards actually doing the lookups and the server running the UDF processing. So the privacy protection overhead is mainly from the network noise and won't be a multiple of the number of shards.

If we have let's say 100 shards 99% of the incoming traffic will be overhead. We also have shown here that there is a real compute impact even with a dummy ROMA script (if the KV server executes logic).

Is our provided code supposed to check for the right shard, or is there some middleware managed by Chrome that would handle this ?

It looks unlikely that this will have no impact. We would welcome some Chrome benchmarks with let's say 100 shards or 1000 shards where one can really see that the expected impact would be negligible.

There is a flag to deactivate the chaffing mechanism in non-prod environment (sending artificial downstream lookup requests to irrelevant shards). Could you evaluate the cost difference between switching it on/off so we can get closer to a more quantitative understanding?

In the meantime this flag to deactivate sharding should apply to PROD mode also waiting for a proper solution.

@peiwenhu
Copy link

If we have let's say 100 shards 99% of the incoming traffic will be overhead. We also have shown here that there is a real compute impact even with a dummy ROMA script (if the KV server executes logic).

The artificial requests won’t be processed by Roma. The server as it receives the requests can tell that it is not a real request and it will return directly, skipping the intermediate steps.

Is our provided code supposed to check for the right shard, or is there some middleware managed by Chrome that would handle this ?

If by “checking for the right shard” you mean letting the server framework know which shard to communicate with, that is not necessary. The UDF simply invokes a lookup API call, which under the hood decides whether to look up locally or remotely.

It looks unlikely that this will have no impact. We would welcome some Chrome benchmarks with let's say 100 shards or 1000 shards where one can really see that the expected impact would be negligible.
In the meantime this flag to deactivate sharding should apply to PROD mode also waiting for a proper solution.

Agreed that for the time being it is unlikely that this will have no impact. We expect though that currently it is somewhere in between “no impact” and “multiply the infrastructure cost by the number of shards”.

Our approach is to start with a focused version of sharding – a minimum viable product. This allows us to gather valuable feedback from early testers on key aspects like functionality and usability. We recognize that cost and performance might not be ideal initially, but this feedback-driven approach lets us prioritize optimization efforts effectively. By starting with a smaller scope, we can avoid premature optimization and focus on addressing any fundamental issues that arise through real-world usage.
To ensure we meet privacy requirements on PROD, we can't disable this feature directly. However, a good first step for the migration might be to use the TKV in NONPROD mode with this disabled, effectively acting as a BYOS server.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants