Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TRACKING] metrics framework future refactors / cleanups #3194

Open
Tracked by #3148
incertum opened this issue May 14, 2024 · 18 comments
Open
Tracked by #3148

[TRACKING] metrics framework future refactors / cleanups #3194

incertum opened this issue May 14, 2024 · 18 comments
Assignees
Milestone

Comments

@incertum
Copy link
Contributor

incertum commented May 14, 2024

Motivation

Tracking pending cleanups, refactors or additional features for the Falco internal metrics framework https://falco.org/docs/metrics/

Tasks

Preview Give feedback
No tasks being tracked yet.
@incertum incertum added this to the 0.39.0 milestone May 14, 2024
@incertum
Copy link
Contributor Author

CC @FedeDP @sgaist @leogr

@incertum
Copy link
Contributor Author

incertum commented May 14, 2024

Great discussions happening in #3140, few follow up items

  • Falco's wrapper metrics num_evts is still missing in the Prometheus output since it requires greater code refactors.
  • Falco can run with a single event source (either syscalls or one plugin source) or with multiple event sources. Initially, the goal was to have metrics work with either syscalls or a plugin source. However, challenges arise when dealing with two or more event sources. For example, should the metrics display the number of events for the syscalls source or the plugin source? What should be the value of evt.source? Since we can only provide one view of the metrics at a time, instead of adding nested fields or implementing another solution, Falco metrics should focus on the syscalls or the primary plugin source only. When running Falco with syscalls and one plugin, the new plugin metrics API should be used to retrieve plugin metrics in addition to the syscalls metrics.
  • libs now includes a new metrics collector class that consolidates metrics across the libs codebase. Falco needs a similar solution. In feat(webserver): implement metrics endpoint #3140, @sgaist referred to it as a "proper Falco metrics model," especially since we now have more output channels for metrics (e.g., Prometheus, web server, output rule, output file). The goal is to simplify the codebase and reduce code duplication (e.g. see code duplication and fragmentation in falco_metrics.cpp, stats_writer.cpp, stats_manager.cpp).

@leogr
Copy link
Member

leogr commented May 16, 2024

  • The metrics framework should target the primary event source only, as the metrics snapshots can realistically only expose one current view, especially for Prometheus. Plugin metrics should instead be supported via the new plugin metrics support; see new(plugin_api): add plugin metrics support libs#1828
  • A consolidated and proper Falco metrics model is needed given that we now have even more outputs channels for the metrics (e.g. Prometheus)

Hey @incertum could you elaborate more on these two points?

@incertum
Copy link
Contributor Author

@leogr I rewrote the text #3194 (comment), is it more clear? happy to add more details.

@leogr
Copy link
Member

leogr commented May 17, 2024

Much clearer now, thank you!

Just one thought:

  • Since we can only provide one view of the metrics at a time

Why? I guess this is a current limitation, but we can fix it in the future. Am I wrong?
I believe that in the long run, all data sources should be first-citizen, and it shouldn't be technically impossible to accommodate this.

@incertum
Copy link
Contributor Author

Much clearer now, thank you!

Just one thought:

  • Since we can only provide one view of the metrics at a time

Why? I guess this is a current limitation, but we can fix it in the future. Am I wrong? I believe that in the long run, all data sources should be first-citizen, and it shouldn't be technically impossible to accommodate this.

We can emit multiple rules outputs or lines into the output file ( I would not do it though), but for Prometheus there is just one endpoint to scrape at a time ... IMO there should be more separate plugin specific metrics handling, something that was started in libs. Most metrics are syscalls source specific or generic (e.g. CPU and memory usages or rules counters) anyways. In a way right now I can only think of number of events as useful to be plugin / source specific in case you have multiple sources.

@incertum
Copy link
Contributor Author

CC @sboschman (metrics for Falco w/ plugin only)

@sboschman
Copy link
Contributor

sboschman commented Jun 3, 2024

From an operational point of view I like to have the falco metrics easily integrated with our metrics platform. So, I would like to thank everyone involved with exposing the falco metrics in a Prometheus compatible way.

I am not familiar with the falco code at all, so consider the following comments more as an outside view of things, not in any way directly mapping to any part of the code.

Falco metrics:

  1. General metrics; unrelated to syscall or any plugin
    • falco version
    • start_timestamp
    • ...
  2. Process resource utilization metrics; unrelated to syscall or any plugin, preferably in standard naming conform the default C library for prometheus
    • num_cpus
    • cpu_seconds_total
    • memory_bytes_total
    • memory_used_bytes
    • ...
  3. Falco rule engine metrics; syscall/plugin event source can be a labeled dimension of the time serie
    • events_processed_total{event_source="syscall"} or events_processed_total{event_source="k8saudit"}
    • rule_matches_total{event_source="syscall"} or rulle_matches_total{event_source="k8saudit"}
    • ...
  4. syscall specific metrics
    • ...
  5. custom plugin metrics; provided with the plugin API / SDK, implemented by the plugin
    • cloudtrail_xxx
    • github_repositories
    • gcpaudit_pubsub_errors_total
    • okta_xxx
    • ...

Notes:

  • Process metrics (2) are not Falco specific, any application/process should be able to provide these metrics in a standard way. If you are familiar with these standard metrics, you can easily apply your existing knowledge to any application/process. E.g. for Golang and Java we even have default dashboards for this base set of metrics.
  • Falco rule engine metrics (3) are dimensioned by event source. An overall total can easily be calculated by the metrics platform, e.g. with PromQL sum without(event_source) (events_processed_total{}) and has not to be explicitly exposed by Falco
  • I realise syscall is the original falco event source and the plugin framework, and support for other event sources, has been implemented later. As of 0.35 plugins can also output syscall events, so to me 'falco drivers' and 'plugins' are just a way to provide event input to the falco rule engine and syscall is just one of the event sources. Hence the metrics being split into items 3, 4 and 5.
  • Different plugins can provide the same event source, e.g. the k8saudit, k8saudit_eks and k8saudit_gke plugin all provide the k8s_audit event source. So (5) are plugin specific metrics, not event source specific metrics.

@incertum
Copy link
Contributor Author

incertum commented Jun 3, 2024

Few more thoughts:

@leogr
Copy link
Member

leogr commented Aug 28, 2024

Any update on this? 🤔

@leogr
Copy link
Member

leogr commented Aug 28, 2024

/assign @incertum

@incertum
Copy link
Contributor Author

Updated checkboxes in
#3194 (comment)

Still pending a fix / no open PR yet

  • Falco's wrapper metrics num_evts is still missing in the Prometheus output since it requires greater code refactors.
  • re the evt.source reporting and plugin metrics. Perhaps now that @mrgian did this tremendously awesome refactor of allowing custom plugin metrics, I believe plugin metrics should be exposed over plugins. Also coming back to this now, isn't num_evts the sum of syscalls and plugin events anyways? Maybe evt.source in the metrics should be the primary event source?
  • @sgaist suggested some more cleanup and consolidation of falco_metrics and stats_writer duplicative code pieces.

In flight:

  • multi-platform metrics support -> in flight @mrgian is on it, thanks again!

Lastly I do agree that there are many posts here. I may need to try to re-read everything again.

@leogr
Copy link
Member

leogr commented Sep 2, 2024

Question: do we expect to introduce any breaking change in the metrics system?
I guess no more. Still, I wanted to check.

@incertum
Copy link
Contributor Author

incertum commented Sep 2, 2024

Compared to Falco 0.38.2 no breaking changes that I am aware of.

@FedeDP
Copy link
Contributor

FedeDP commented Sep 9, 2024

Indeed more breaking changes will happen :D See #3309.

@FedeDP
Copy link
Contributor

FedeDP commented Sep 16, 2024

I will move to 0.40.0 since we weren't able to finish everything in time for the 0.39 release.
/milestone 0.40.0

@poiana poiana modified the milestones: 0.39.0, 0.40.0 Sep 16, 2024
@poiana
Copy link
Contributor

poiana commented Dec 15, 2024

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via /~https://github.com/falcosecurity/community.

/lifecycle stale

@leogr
Copy link
Member

leogr commented Dec 16, 2024

I guess we are in good shape here. Any updates?
/remove-lifecycle stale

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants