-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(webserver): implement metrics endpoint #3140
feat(webserver): implement metrics endpoint #3140
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀 nice thanks for getting this up so quickly!
userspace/falco/webserver.cpp
Outdated
metrics_collector.snapshot(); | ||
auto metrics_snapshot = metrics_collector.get_metrics(); | ||
|
||
for (auto& metric: metrics_snapshot) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we also add all the wrapper metrics already in, see for example #3131 (comment) (see the libs unit tests) or the website https://falco.org/docs/metrics/falco-metrics/ or run Falco w/ the output rule metrics.
Or would you just want to do that in another PR? Also ok for me.
Some more notes:
For Prometheus we do not need the _prev
metrics values and even the drop percentage is not needed as it can be calculated via Prometheus queries, hence not a 1:1 translation in terms of the wrapper or additional metrics fields.
In addition, we distinguish between falco.
and scap.
prefixes -- probably best to preserve that and as namespace I personally would prefer falcosecurity
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a code organization point of view we can also iterate as it would be nice to not add metrics logic into the webserver file itself and rather maybe create a more formal falco_metrics_collector
class or something else.
Or maybe you planned that already in a follow up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional note I forgot: It appears we need to move the code logic to the stats_writer::collector::collector
so that the metrics get refreshed and reposted to /metrics
every x interval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition, we distinguish between falco. and scap. prefixes -- probably best to preserve that and as namespace I personally would prefer falcosecurity.
Agree with falcosecurity
.
From what i see, we are refreshing the stats whenever the /metrics
endpoint is called, instead of passing through the stats_writer::collector::collector
.
This ensures that we follow Prometheus settings (ie: if one setups prometheus to scrape every 2s, we honor that even if metrics.interval
is higher).
Don't know if this is the behavior we want though.
Otherwise, we might still follow metrics.interval
and then when prometheus tries to scrape more often, we will offer stale data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see the dilemma. At the same time I also don't see a problem with just recommending to having the same Falco metrics interval as the Prometheus scrape interval? How is this solved in other projects? At the end of the day metrics time series are always discrete ... I would also be fine to refresh the metrics in the Prometheus case whenever /metrics
was called as @FedeDP suggested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that comes to mind, doesn't the stats_writer class contain the required metrics generation logic ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metrics wrapper fields (the ones not part of the libs_metrics_collector
) are easy to duplicate if needed, or we create some sort of more formal class falco-metrics_collector
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the same time I also don't see a problem with just recommending to having the same Falco metrics interval as the Prometheus scrape interval?
Yep, but:
/metrics
endpoint is the only "interactive" metrics generator we have, sinceoutput_rule
andoutput_file
are passive. Therefore, it makes sense that is up to the caller to decide the frequency.- also, consider the case where prometheus is setup to scrape every 5seconds; consider also that
output_file
is enabled too. This has a cost in terms of IO perf if user has to set 5smetrics.interval
in the config file
At the same time, i quite don't like having a metric that does not obey to metrics.interval
though...
I would love to gather more feedback! cc @falcosecurity/falco-maintainers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we need to just make everything configurable, for example, we could add sub metrics configs similar to below:
metrics:
enabled: false
interval: 1h
# Typically, in production, you only use `output_rule` or `output_file`, but not both.
# However, if you have a very unique use case, you can use both together.
output_rule: true
# output_file: /tmp/falco_stats.jsonl
output_prometheus:
# Will only be enabled when `metrics.enabled` is set to true
enabled: false
# Take a new metrics snapshot automatically after each Prometheus server scrape
auto_refresh_enabled: false
# Alternative to `auto_refresh_enabled` refreshes /metrics with new metrics snapshot every x interval; supersedes and is separate from the global `metrics.interval` setting
refresh_interval: 5m
resource_utilization_enabled: true
state_counters_enabled: true
kernel_event_counters_enabled: true
libbpf_stats_enabled: true
convert_memory_to_mb: true
include_empty_values: false
/milestone 0.38.0 |
885cefa
to
d3775ae
Compare
falco.yaml
Outdated
@@ -982,6 +982,7 @@ metrics: | |||
libbpf_stats_enabled: true | |||
convert_memory_to_mb: true | |||
include_empty_values: false | |||
prometheus_enabled: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that this is indeed exposed by the webserver, shouldn't it live under webserver:
? perhaps we might want to specify that it is useful only when metrics
are enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point and agreed
One related question: should the endpoint name also be configurable ? It usually is /metrics
but...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd leave /metrics
for now. We can always make int configurable in the future if there is demand for it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now we only have the heathz endpoint in our configuration 👇
Lines 662 to 699 in 12cd72a
# [Stable] `webserver` | |
# | |
# Falco supports an embedded webserver that runs within the Falco process, | |
# providing a lightweight and efficient way to expose web-based functionalities | |
# without the need for an external web server. The following endpoints are | |
# exposed: | |
# - /healthz: designed to be used for checking the health and availability of | |
# the Falco application (the name of the endpoint is configurable). | |
# - /versions: responds with a JSON object containing the version numbers of the | |
# internal Falco components (similar output as `falco --version -o | |
# json_output=true`). | |
# | |
# Please note that the /versions endpoint is particularly useful for other Falco | |
# services, such as `falcoctl`, to retrieve information about a running Falco | |
# instance. If you plan to use `falcoctl` locally or with Kubernetes, make sure | |
# the Falco webserver is enabled. | |
# | |
# The behavior of the webserver can be controlled with the following options, | |
# which are enabled by default: | |
# | |
# The `ssl_certificate` option specifies a combined SSL certificate and | |
# corresponding key that are contained in a single file. You can generate a | |
# key/cert as follows: | |
# | |
# $ openssl req -newkey rsa:2048 -nodes -keyout key.pem -x509 -days 365 -out | |
# certificate.pem $ cat certificate.pem key.pem > falco.pem $ sudo cp falco.pem | |
# /etc/falco/falco.pem | |
webserver: | |
enabled: true | |
# When the `threadiness` value is set to 0, Falco will automatically determine | |
# the appropriate number of threads based on the number of online cores in the system. | |
threadiness: 0 | |
listen_port: 8765 | |
# Can be an IPV4 or IPV6 address, defaults to IPV4 | |
listen_address: 0.0.0.0 | |
k8s_healthz_endpoint: /healthz | |
ssl_enabled: false | |
ssl_certificate: /etc/falco/falco.pem |
I believe we have to reaudit our configuration before making a decision, so I agree to leave the configuration part out of this PR. We can add it later, eventually.
37b8085
to
667b84b
Compare
There's something that is escaping me with regard to the |
Samuel, pulled your branch and wanted to test drive it, but it doesn't compile. Mind fixing the build? Thank you! I'll review the changes then. In addition, we still need to agree on the design wrt refreshing the metrics, @leogr to compile a final proposal based on suggestions we have so far. Thanks. |
667b84b
to
5374769
Compare
This endpoint currently returns only prometheus metrics. Signed-off-by: Samuel Gaist <samuel.gaist@idiap.ch>
… documentation The cross reference makes it easier to pair the web server and the metrics configuration elements. Signed-off-by: Samuel Gaist <samuel.gaist@idiap.ch>
…otonic Signed-off-by: Samuel Gaist <samuel.gaist@idiap.ch>
4e38520
to
815af40
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
Let's defer further changes to a follow up PR. Thanks @sgaist!
LGTM label has been added. Git tree hash: 7266865d0b743e48936d17fda8c04dabcbaf4973
|
Now just need to check CI fixes. |
The textual content was fixed but not the metrics name. Co-authored-by: Melissa Kilby <melissa.kilby.oss@gmail.com> Signed-off-by: Samuel Gaist <samuel.gaist@idiap.ch>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
LGTM label has been added. Git tree hash: b0bf6a5ca9cc69fc00fd9bae353af6d4009cb5fd
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM too.
I am waiting for someone of the previous reviewer to take another look before giving the final approval
/assign |
@@ -695,6 +695,9 @@ webserver: | |||
# Can be an IPV4 or IPV6 address, defaults to IPV4 | |||
listen_address: 0.0.0.0 | |||
k8s_healthz_endpoint: /healthz | |||
# Enable the metrics endpoint providing Prometheus values | |||
# It will only have an effect if metrics.enabled is set to true as well. | |||
prometheus_metrics_enabled: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@leogr do we need to enforce a feat status for this? Like Incubating
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be preferable, IMO. It is not a blocker for this PR anyway. We may reaudit all options later, but still before the 0.38 release. Does it make sense?
cc @falcosecurity/falco-maintainers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@leogr proposing to merge it as is and address the follow up items in a new PR. I believe we need another touch up PR anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: FedeDP, incertum, sgaist The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Thank you very much Samuel! You did a terrific job! |
👏 |
What type of PR is this?
/kind feature
Any specific area of the project related to this PR?
/area engine
What this PR does / why we need it:
This PR implements the
/metrics
endpoint in the webserver providing currently prometheus metrics.Which issue(s) this PR fixes:
Fixes #3131
Fixes #1772
Special notes for your reviewer:
Does this PR introduce a user-facing change?: