-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle network stats parsing gracefully #8212
Conversation
Error in recording one of the network stats should not result in empty network stats
@brianc118 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth logging the errors still or will it be too noisy?
@dxu i think it should be worth adding since we're only logging the errors, however it will be logged at every sample collection instance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking closer, I think the root cause is misunderstood here.
procfs crate tries to read from /proc/net
. That's a symlink to /proc/self/net
. Which expands to /proc/1234/net
, where 1234
is the current process pid. It's not possible for the current running process to not be running (given we don't use much threading in below). So in other words, it's not that the process died, it's that the snmp6
file doesn't exist.
That tracks cuz we disable ipv6 on the kernel cmdline in our product (a problem for a different time). So I think this PR should handle when ipv6 is disabled. Collecting these stats should really be infallible.
Also another thing I noticed, I'm not sure we should be logging from this crate. If anything, we should pass a logging facade into the crate so the caller can specify the logging configuration. Cuz I think the intention of the procfs crate is so that other people could consume it.
@brianc118 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Agreed with Daniel here, we probably should not be logging from this crate |
@brianc118 could you please take a look again if the changes look good now? thanks! |
below/procfs/Cargo.toml
Outdated
@@ -11,11 +11,13 @@ repository = "/~https://github.com/facebookincubator/below" | |||
license = "Apache-2.0" | |||
|
|||
[dependencies] | |||
common = { package = "below-common", version = "0.7.1", path = "../common" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's preferable to not have dependency on this crate. We can just use
@@ -80,11 +81,12 @@ impl TestProcfs { | |||
} | |||
|
|||
fn get_net_reader(&self) -> NetReader { | |||
let logger = get_logger(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just do
let plain = slog_term::PlainSyncDecorator::new(std::io::stderr());
let logger = slog::Logger::root(slog_term::FullFormat::new(plain).build().fuse(), slog::o!())
to avoid using common
below/procfs/src/lib.rs
Outdated
logger: &slog::Logger, | ||
result: Result<BTreeMap<K, V>>, | ||
) -> Result<Option<BTreeMap<K, V>>> { | ||
let netstat_map = if let Err(Error::IoError(_, err)) = result { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we only handle ENOENT (std::io::ErrorKind::NotFound
) here? I don't know if we want to swallow all IO errors.
below/procfs/src/lib.rs
Outdated
let netstat_map = self.read_kv_diff_line("netstat")?; | ||
let snmp_map = self.read_kv_diff_line("snmp")?; | ||
let snmp6_map = self.read_kv_same_line("snmp6")?; | ||
let netstat_map = handle_io_error(&self.logger, self.read_kv_diff_line("netstat"))?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment before this block for the reasons why it's valid for files to be missing? (e.g. the reasons Daniel mentioned below)
@brianc118 thanks! addressed the comments. |
@brianc118 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@brianc118 not sure why linter is failing, |
@brianc118 merged this pull request in 4bef99f. |
Goal: Error in recording one of the network stats should not result in completely empty network stats.
A process died and recording
snpm6
stats failed for that process which resulted in none of network stats being recorded. Error: