Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make InvalidCharError support UTF-8 #135

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

tcharding
Copy link
Member

@tcharding tcharding commented Jan 9, 2025

Currently the InvalidCharError error returns a u8 if an invalid hex character is found. If that invalid character happened to be a char that encodes as UTF-8 using more than one byte (i.e. not ASCII) then this means the u8 returned is not the same as the invalid character in the input string.

Add support for invalid hex characters that fall outside of a single ASCII byte.

Because there are so many edge cases, this is a best effort. I'm not claiming to have gotten this right for all input strings. We default to just returning the byte.

Currently a 4-byte encoded character that falls on an odd position in the hex string is not correctly handled but this patch is definitely an improvement none the less.

Fix: #100

@tcharding tcharding force-pushed the 01-09-utf8-error-char branch from 0b41da3 to 8369090 Compare January 9, 2025 04:46
Currently the `InvalidCharError` error returns a `u8` if an invalid
hex character is found. If that invalid character happened to be a
`char` that encodes as UTF-8 using more than one byte (i.e. not ASCII)
then this means the `u8` returned is not the same as the invalid
character in the input string.

Add support for invalid hex characters that fall outside of a single
ASCII byte.

Because there are so many edge cases, this is a best effort. I'm not
claiming to have gotten this right for all input strings. We default to
just returning the byte.

Currently a 4-byte encoded character that falls on an odd position in
the hex string is not correctly handled but this patch is definitely
an improvement none the less.
@tcharding tcharding force-pushed the 01-09-utf8-error-char branch from 8369090 to ce28e9b Compare January 9, 2025 04:48
@tcharding
Copy link
Member Author

BTW I did not read up on the intricacies of UTF-8 encoding, I probably should have.

@tcharding
Copy link
Member Author

Oh and I spent waaay to long on this today. Hope it was worth it.

let hih = (hi as char).to_digit(16).ok_or((hi, true))?;
let loh = (lo as char).to_digit(16).ok_or((lo, false))?;
/// Returns the valid byte or a bool indicating error for `hi` or `lo`.
fn hex_digits_to_byte(hi: u8, lo: u8) -> Result<u8, bool> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In ce28e9b:

I find the word "digit" when used to apply to a number, when we are working with a character encoding that contains actual digits, is extremely confusing.

Copy link
Member Author

@tcharding tcharding Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amusingly I don't even know what hi and lo are from just looking at this diff and I worked on this code yesterday for hours. Maybe 'symbol' instead of 'digit' or perhaps hex_ascii_value_to_byte but the perhaps to_byte is actually better described by 'decode' because the two hex symbols encode a byte.

Perhaps: decode_hex_ascii_value_pair

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, ascii_value_pair sounds good to me.

And yeah, hi and lo are not great... they are obviously the two nybbles of the byte, but are they ASCII values or numbers? Not easy to tell when we are using u8 for both.

let hi = byte >> 4;
let lo = byte & 0x0f;

let hi = char::from_digit(hi.into(), 16).unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In ce28e9b:

This function is so badly named that I consider it a bug in stdlib. We should just do this manually by indexing into a static array containing "0123456789abcdef". (I am not interested in filing bugs against stdlib to be told that "this is not Haskell" and I shouldn't complain that about words being used wrong. Let's just forget this function and move on with our lives.)

This would also have clearer semantics regarding capitalization, and to the extent there is any performance difference at all, likely to be faster.

}
}

#[cfg(feature = "std")]
impl std::error::Error for InvalidCharError {}

/// Invalid ASCII encoding of a hex digit.
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct InvalidDigitError {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In ce28e9b:

BTW, here you use "digit" correctly to refer to a numeral character.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...though the structure is still a little confusing since it actually holds two digits. But that's fine.


impl InvalidDigitError {
/// Returns the invalid byte.
pub fn invalid_byte(&self) -> u8 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In ce28e9b:

Ok, here you have a error which holds two digits, which are supposed to comprise a byte but don't. You have a method called invalid_byte which returns one of the two digits. I think this name is confusing. It should be called invalid_digit.

rich_error.unwrap_or(InvalidCharError {
invalid: if self.is_hi { self.hi.into() } else { self.lo.into() },
pos: self.pos,
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In ce28e9b:

This logic is really confusing. It seems like if the original "bad digit" was an ASCII byte then that byte can just be promoted to char and we're done. If not, we need to read more. So you could have an initial if self.invalid_byte() < 128 { return InvalidCharError { invalid: self.invalid_byte().into(), pos: self.pos() }. (And ofc self.invalid_byte should be renamed, see above.) So you can do an early return in this case and save yourself at least one level of nesting. (And also fix the bug here self.pos is wrong and you actually meant to return self.pos().)

// Try to map the UTF-8 values in `v` to an invalid character.
fn error(v: &[u8], pos: usize) -> Option<InvalidCharError> {
core::str::from_utf8(v)
.map(|s| InvalidCharError { invalid: s.chars().next().unwrap(), pos })
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In ce28e9b:

It seems like this function will return None in the case that you give it bad UTF-8 (for example, a non-ASCII character immediately followed by an ASCII character, which never happens in UTF-8 because it would make it ambiguous how to scan backward from the ASCII character to the nearest character boundary.)

But then below you seem to interpret None as "promote one of the bad digits in self to a char". Which, aside from being logically wrong, might not even result in a valid character.

@apoelstra
Copy link
Member

Done first pass on ce28e9b

@tcharding
Copy link
Member Author

Thanks for the review man. Seems this will not only take a lot of dev time but also a lot of review time - is it definitely worth doing? Or should we consider a bandaid solution like an error message that says "invalid character is <whatever>, note that if the invalid character falls outside of the single byte ASCII range then this error will not include the actual character"

@apoelstra
Copy link
Member

I think it'd be totally fine to say "invalid character c" if c is a printable ASCII byte and "invalid byte 0xAB" if c is not.

I wouldn't call it a "bandaid solution" to accurately describe the error condition without implementing a UTF-8 parser for the sake of providing more context (which might not even be valuable BTW because it assumes that the input came from a UTF-8 string that the user can see in a decoded form).

@tankyleo
Copy link
Contributor

I think it'd be totally fine to say "invalid character c" if c is a printable ASCII byte and "invalid byte 0xAB" if c is not.

For reference, I attempted something along these lines in #124 - we discussed and ended up closing the PR.

@apoelstra
Copy link
Member

Right -- I complained about exposing an API that returns Option<char>. I continue to think this is not worth the extra API surface (it's not clear what the user would expect -- a char if the bad byte was ASCII? if it was printable? if it was alphanumeric? -- and even if we get it right and this can be quickly understood by the author and all readers of the code, it barely saves any characters because the inline logic is so simple).

But for purposes of implementing Display on this error I think it's reasonable to special-case printable characters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Return char instead of u8 from InvalidCharError::invalid_char
3 participants