Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle network stats parsing gracefully #8212

Closed
wants to merge 9 commits into from

Conversation

mmynk
Copy link
Contributor

@mmynk mmynk commented Nov 6, 2023

Goal: Error in recording one of the network stats should not result in completely empty network stats.

A process died and recording snpm6 stats failed for that process which resulted in none of network stats being recorded. Error:

Sep 26 21:11:49 ip-10-66-6-18 below[21496]: Sep 27 04:11:49.866 ERRO Os { code: 2, kind: NotFound, message: "No such file or directory" }: "/proc/21496/net/snmp6"

Error in recording one of the network stats should not result in empty network stats
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 6, 2023
@facebook-github-bot
Copy link
Contributor

@brianc118 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@danobi danobi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth logging the errors still or will it be too noisy?

@mmynk
Copy link
Contributor Author

mmynk commented Nov 6, 2023

@dxu i think it should be worth adding since we're only logging the errors, however it will be logged at every sample collection instance

Copy link
Contributor

@danobi danobi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking closer, I think the root cause is misunderstood here.

procfs crate tries to read from /proc/net. That's a symlink to /proc/self/net. Which expands to /proc/1234/net, where 1234 is the current process pid. It's not possible for the current running process to not be running (given we don't use much threading in below). So in other words, it's not that the process died, it's that the snmp6 file doesn't exist.

That tracks cuz we disable ipv6 on the kernel cmdline in our product (a problem for a different time). So I think this PR should handle when ipv6 is disabled. Collecting these stats should really be infallible.

Also another thing I noticed, I'm not sure we should be logging from this crate. If anything, we should pass a logging facade into the crate so the caller can specify the logging configuration. Cuz I think the intention of the procfs crate is so that other people could consume it.

@facebook-github-bot
Copy link
Contributor

@brianc118 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@brianc118
Copy link
Contributor

Agreed with Daniel here, we probably should not be logging from this crate

below/procfs/src/lib.rs Outdated Show resolved Hide resolved
below/procfs/src/lib.rs Outdated Show resolved Hide resolved
@mmynk
Copy link
Contributor Author

mmynk commented Nov 7, 2023

@brianc118 could you please take a look again if the changes look good now? thanks!

@@ -11,11 +11,13 @@ repository = "/~https://github.com/facebookincubator/below"
license = "Apache-2.0"

[dependencies]
common = { package = "below-common", version = "0.7.1", path = "../common" }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's preferable to not have dependency on this crate. We can just use

@@ -80,11 +81,12 @@ impl TestProcfs {
}

fn get_net_reader(&self) -> NetReader {
let logger = get_logger();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just do

    let plain = slog_term::PlainSyncDecorator::new(std::io::stderr());
    let logger = slog::Logger::root(slog_term::FullFormat::new(plain).build().fuse(), slog::o!())

to avoid using common

logger: &slog::Logger,
result: Result<BTreeMap<K, V>>,
) -> Result<Option<BTreeMap<K, V>>> {
let netstat_map = if let Err(Error::IoError(_, err)) = result {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we only handle ENOENT (std::io::ErrorKind::NotFound ) here? I don't know if we want to swallow all IO errors.

let netstat_map = self.read_kv_diff_line("netstat")?;
let snmp_map = self.read_kv_diff_line("snmp")?;
let snmp6_map = self.read_kv_same_line("snmp6")?;
let netstat_map = handle_io_error(&self.logger, self.read_kv_diff_line("netstat"))?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment before this block for the reasons why it's valid for files to be missing? (e.g. the reasons Daniel mentioned below)

@mmynk
Copy link
Contributor Author

mmynk commented Nov 9, 2023

@brianc118 thanks! addressed the comments.

@facebook-github-bot
Copy link
Contributor

@brianc118 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@mmynk
Copy link
Contributor Author

mmynk commented Nov 13, 2023

@brianc118 not sure why linter is failing, rustfmt doesn't complain on any of the changed files.

@facebook-github-bot
Copy link
Contributor

@brianc118 merged this pull request in 4bef99f.

@mmynk mmynk deleted the fix/net-stats branch November 13, 2023 22:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants