-
Notifications
You must be signed in to change notification settings - Fork 526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revisit default TBS storage size limit sampling.tail.storage_limit
and storage limit handling
#14933
Comments
I have 2 ideas for the new default.
The argument for either of them is that e.g. if you are running a mysql server, you expect it to use storage space but you don't expect yourself updating a "storage limit" every time it gets hit. The argument for a detection is to avoid causing a node to be unhealthy in any environments. |
sampling.tail.storage_size
sampling.tail.storage_limit
@carsonip the solution to detect available disk space would be more elegant. We would need to ensure that it works on all supported OS systems + docker and have a sensible fallback if disk size cannot be detected. |
Yes, it is in scope of this task. I'll write down some possible solutions soon. |
sampling.tail.storage_limit
sampling.tail.storage_limit
and storage limit handling
This task highly depends on outcome of #15235 Assuming #15235 goes well, the pebble storage usage profile would be very different from badger. It should be much more predictable, and with that we will be able to come up with either
@simitt what do you think of the other alternatives, are they no-go? |
These options do not sound like customer friendly options to me. |
Option 1 is the baseline. It should be trivial. |
I've created a PR to set storage_limit to 70% of (available disk size + existing pebble db size): #15450 However, as @simitt raising some questions about the implementation details, it gave me a pause to zoom out and think about the problem again. Why is there a storage limit in the first place? As a user I wouldn't want TBS to fill my disk and bring down a node unexpectedly. It is less about TBS using 5GB or 6GB. As long as a TBS-enabled apm-server doesn't fill up the disk, and there's no other services competing for disk space (e.g. ES or MySQL on the same node), the user shouldn't care. On the other hand, the current 3GB limit is definitely too low for production workload, and user needs to fiddle with the config to find an appropriate config value. If we are going to make a change to the default in 9.0, we want the most user friendly default. The current 3GB limit isn't friendly. Neither is filling the disk and bringing down the node. That's why we want an option that is good enough for most use cases by default. An attempt (like what I did in #15450 ) is to set the current storage limit config based on total / available disk space.
As apm-server TBS uses an embedded database, it is a database. Looking at how databases handle these, e.g. Elasticsearch has Therefore, I propose we add a |
@carsonip I like it. SGTM. |
@carsonip Having a storage threshold based on disk space sounds good to me, however, I am not sure if this needs to be a dedicated config as this sounds more like a protection mechanism for using too much disk than a limit. Are there cases were user will set a limit to 10%? What factors will help decide this number? If the answers to the above questions are not specific then I think it would be best to have this (90% of the total disk) as an internal limit without any exposed configuration. By default, we can set the existing Additionally, I am not sure how beneficial the configuration would be for ESS. It would definitely allow for more disk space to the APM-Server deployments but could open new can of works - for example having 2 ESS instances on the same node with one taking 90% of the disk space and the other left with only 10% to work with. I think for ESS we need dedicated disks based on the configured storage_limit by the user. |
I agree that users may not find the config too useful, unless they want to aggressively increase the disk utilization to e.g. 95%, or 100%.
Personally it sounds fine to keep it not configurable, as long as the default behavior is unlimited db size + not writing to last 10% of disk. We can make it configurable later. On the point user specifying the old storage_limit of the db, I don't have a strong opinion on whether we enforce both limiting db size + not writing to last 10% of disk, or just the former. I think enforcing both is slightly more predictable, since the 2 mechanisms are related but not conflicting (one limits the db size, one limits the total disk utilization)
As discussed, this is not a new problem. With or without this new mechanism, a neighbor can be noisy if disk limits are not enforced properly. We should involve cloud folks to see how disk limits are enforced on ESS. |
Created #15467 for implementation in apm-server and elastic/integrations#12543 for integration package default change. We'll also need a observability doc change. |
With a new default to only limit based on disk usage, not db size, we can avoid developing workarounds like /~https://github.com/elastic/cloud/issues/104788 |
moving it from it-106 to it-107 |
moving it from it107 to it108 |
Closing as all tasks have been completed |
sampling.tail.storage_limit
has a default of 3GB which is usually insufficient for high throughput use cases. A storage limit too low will cause TBS to be bypassed and other downstream effects. Reconsider the default in 9.0 release.Tasks
sampling.tail.storage_limit
to 0 but limit disk usage to 90% #15467The text was updated successfully, but these errors were encountered: