Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisit default TBS storage size limit sampling.tail.storage_limit and storage limit handling #14933

Closed
7 tasks done
Tracked by #14931
carsonip opened this issue Dec 12, 2024 · 15 comments
Closed
7 tasks done
Tracked by #14931

Comments

@carsonip
Copy link
Member

carsonip commented Dec 12, 2024

sampling.tail.storage_limit has a default of 3GB which is usually insufficient for high throughput use cases. A storage limit too low will cause TBS to be bypassed and other downstream effects. Reconsider the default in 9.0 release.

Tasks

@carsonip
Copy link
Member Author

I have 2 ideas for the new default.

  • to allow infinite disk usage
  • to detect available disk space (on startup / periodically) and set the limit dynamically

The argument for either of them is that e.g. if you are running a mysql server, you expect it to use storage space but you don't expect yourself updating a "storage limit" every time it gets hit.

The argument for a detection is to avoid causing a node to be unhealthy in any environments.

@carsonip carsonip changed the title Revisit default TBS storage size limit sampling.tail.storage_size Revisit default TBS storage size limit sampling.tail.storage_limit Jan 13, 2025
@simitt
Copy link
Contributor

simitt commented Jan 15, 2025

@carsonip the solution to detect available disk space would be more elegant. We would need to ensure that it works on all supported OS systems + docker and have a sensible fallback if disk size cannot be detected.
Do you have capacity to look into this?

@carsonip
Copy link
Member Author

Do you have capacity to look into this?

Yes, it is in scope of this task. I'll write down some possible solutions soon.

@carsonip carsonip changed the title Revisit default TBS storage size limit sampling.tail.storage_limit Revisit default TBS storage size limit sampling.tail.storage_limit and storage limit handling Jan 15, 2025
@carsonip
Copy link
Member Author

carsonip commented Jan 22, 2025

This task highly depends on outcome of #15235

Assuming #15235 goes well, the pebble storage usage profile would be very different from badger. It should be much more predictable, and with that we will be able to come up with either

  1. keep storage limit handling, but set a much more reasonable and usable default, and documentation around how to set a good storage limit.
  2. (on top of point (1) above) default to disk space detection but fallback to a constant default. There may be complexity in a cross-platform solution.

@simitt what do you think of the other alternatives, are they no-go?
3. keep storage limit handling but default to unlimited (and maybe always set it to a number on ESS)
4. removing storage limit handling altogether. This will simplify #15235.

@simitt
Copy link
Contributor

simitt commented Jan 23, 2025

  1. keep storage limit handling but default to unlimited (and maybe always set it to a number on ESS)
  2. removing storage limit handling altogether. This will simplify TBS: Replace badger with pebble #15235.

These options do not sound like customer friendly options to me.
Option 1 and 2 both sound reasonable. If option 2 is risky or very complex, I wouldn't see a big problem with starting with option 1 and then potentially switching to option 2 later.

@carsonip
Copy link
Member Author

Option 1 is the baseline. It should be trivial.
I'm also optimistic about option 2 as we move to pebble in #15246 , because there seems to be a cross-platform disk usage stats (available, used, total) in pebble /~https://github.com/cockroachdb/pebble/blob/master/vfs/disk_usage_linux.go

@carsonip
Copy link
Member Author

carsonip commented Jan 29, 2025

I've created a PR to set storage_limit to 70% of (available disk size + existing pebble db size): #15450

However, as @simitt raising some questions about the implementation details, it gave me a pause to zoom out and think about the problem again.

Why is there a storage limit in the first place? As a user I wouldn't want TBS to fill my disk and bring down a node unexpectedly. It is less about TBS using 5GB or 6GB. As long as a TBS-enabled apm-server doesn't fill up the disk, and there's no other services competing for disk space (e.g. ES or MySQL on the same node), the user shouldn't care. On the other hand, the current 3GB limit is definitely too low for production workload, and user needs to fiddle with the config to find an appropriate config value.

If we are going to make a change to the default in 9.0, we want the most user friendly default. The current 3GB limit isn't friendly. Neither is filling the disk and bringing down the node. That's why we want an option that is good enough for most use cases by default.

An attempt (like what I did in #15450 ) is to set the current storage limit config based on total / available disk space.

  • Auto adjustment based on total disk space
    • The issue with this is that it disregards what's available. Let's say the disk is already 50% filled, setting storage limit to 70% of total disk space means an over-commitment and will risk filling up the disk.
  • Auto adjustment based on available disk space
    • There's a caveat that as apm-server TBS storage consumes more space, available disk space will become lower. Let's say apm-server auto adjusts to use 60% of available disk space, e.g. apm-server already uses 70% of 100GB, storage limit will be 30*60% = 18GB, which is lower than what it already uses. It also means that we'll have to factor in existing db size.
    • But what if there's another service on the node that uses disk? It'll make apm-server overestimate the disk available, and again risks filling the disk.

As apm-server TBS uses an embedded database, it is a database. Looking at how databases handle these, e.g. Elasticsearch has cluster.routing.allocation.disk.watermark.high config that changes behavior based on how much the disk is filled, not how much storage is "allocated" to the database.

Therefore, I propose we add a disk_threshold (for the lack of a better name, accepting suggestions) accepting a float. e.g. 0.9 means when the disk reaches 90% usage, TBS returns "storage limit reached". This means we check disk_used < 0.9 * disk_total instead of db_size < 0.9 * disk_total. i.e. apm-server never writes to the last 10% of disk. And in 9.0 we switch the default to use 0.9 disk_threshold, while maintaining the existing storage_limit config.

@simitt @axw WDYT?

@axw
Copy link
Member

axw commented Jan 30, 2025

@carsonip I like it. SGTM.

@lahsivjar
Copy link
Contributor

@carsonip Having a storage threshold based on disk space sounds good to me, however, I am not sure if this needs to be a dedicated config as this sounds more like a protection mechanism for using too much disk than a limit. Are there cases were user will set a limit to 10%? What factors will help decide this number?

If the answers to the above questions are not specific then I think it would be best to have this (90% of the total disk) as an internal limit without any exposed configuration. By default, we can set the existing storage_limit to be unset (0) - meaning only the internal 90% of the total disk limit would be applied. If the customer/user wants to limit the disk space, they can set storage_limit value explicitly and that would take effect. We can decide if we want both the options to co-exist or prefer one over another.

Additionally, I am not sure how beneficial the configuration would be for ESS. It would definitely allow for more disk space to the APM-Server deployments but could open new can of works - for example having 2 ESS instances on the same node with one taking 90% of the disk space and the other left with only 10% to work with. I think for ESS we need dedicated disks based on the configured storage_limit by the user.

@carsonip
Copy link
Member Author

I am not sure if this needs to be a dedicated config as this sounds more like a protection mechanism for using too much disk than a limit. Are there cases were user will set a limit to 10%? What factors will help decide this number?

I agree that users may not find the config too useful, unless they want to aggressively increase the disk utilization to e.g. 95%, or 100%.

If the answers to the above questions are not specific then I think it would be best to have this (90% of the total disk) as an internal limit without any exposed configuration. By default, we can set the existing storage_limit to be unset (0) - meaning only the internal 90% of the total disk limit would be applied. If the customer/user wants to limit the disk space, they can set storage_limit value explicitly and that would take effect. We can decide if we want both the options to co-exist or prefer one over another.

Personally it sounds fine to keep it not configurable, as long as the default behavior is unlimited db size + not writing to last 10% of disk. We can make it configurable later.

On the point user specifying the old storage_limit of the db, I don't have a strong opinion on whether we enforce both limiting db size + not writing to last 10% of disk, or just the former. I think enforcing both is slightly more predictable, since the 2 mechanisms are related but not conflicting (one limits the db size, one limits the total disk utilization)

I am not sure how beneficial the configuration would be for ESS

As discussed, this is not a new problem. With or without this new mechanism, a neighbor can be noisy if disk limits are not enforced properly. We should involve cloud folks to see how disk limits are enforced on ESS.

@carsonip
Copy link
Member Author

Created #15467 for implementation in apm-server and elastic/integrations#12543 for integration package default change. We'll also need a observability doc change.

@carsonip
Copy link
Member Author

With a new default to only limit based on disk usage, not db size, we can avoid developing workarounds like /~https://github.com/elastic/cloud/issues/104788

@raultorrecilla
Copy link

moving it from it-106 to it-107

@raultorrecilla
Copy link

moving it from it107 to it108

@carsonip
Copy link
Member Author

Closing as all tasks have been completed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants