Make `InvalidCharError` support UTF-8 #135

tcharding · 2025-01-09T04:39:34Z

Currently the InvalidCharError error returns a u8 if an invalid hex character is found. If that invalid character happened to be a char that encodes as UTF-8 using more than one byte (i.e. not ASCII) then this means the u8 returned is not the same as the invalid character in the input string.

Add support for invalid hex characters that fall outside of a single ASCII byte.

Because there are so many edge cases, this is a best effort. I'm not claiming to have gotten this right for all input strings. We default to just returning the byte.

Currently a 4-byte encoded character that falls on an odd position in the hex string is not correctly handled but this patch is definitely an improvement none the less.

Fix: #100

Currently the `InvalidCharError` error returns a `u8` if an invalid hex character is found. If that invalid character happened to be a `char` that encodes as UTF-8 using more than one byte (i.e. not ASCII) then this means the `u8` returned is not the same as the invalid character in the input string. Add support for invalid hex characters that fall outside of a single ASCII byte. Because there are so many edge cases, this is a best effort. I'm not claiming to have gotten this right for all input strings. We default to just returning the byte. Currently a 4-byte encoded character that falls on an odd position in the hex string is not correctly handled but this patch is definitely an improvement none the less.

tcharding · 2025-01-09T04:48:56Z

BTW I did not read up on the intricacies of UTF-8 encoding, I probably should have.

tcharding · 2025-01-09T04:51:42Z

Oh and I spent waaay to long on this today. Hope it was worth it.

apoelstra · 2025-01-09T14:19:46Z

src/iter.rs

-    let hih = (hi as char).to_digit(16).ok_or((hi, true))?;
-    let loh = (lo as char).to_digit(16).ok_or((lo, false))?;
+/// Returns the valid byte or a bool indicating error for `hi` or `lo`.
+fn hex_digits_to_byte(hi: u8, lo: u8) -> Result<u8, bool> {


In ce28e9b:

I find the word "digit" when used to apply to a number, when we are working with a character encoding that contains actual digits, is extremely confusing.

Amusingly I don't even know what hi and lo are from just looking at this diff and I worked on this code yesterday for hours. Maybe 'symbol' instead of 'digit' or perhaps hex_ascii_value_to_byte but the perhaps to_byte is actually better described by 'decode' because the two hex symbols encode a byte.

Perhaps: decode_hex_ascii_value_pair

Yep, ascii_value_pair sounds good to me.

And yeah, hi and lo are not great... they are obviously the two nybbles of the byte, but are they ASCII values or numbers? Not easy to tell when we are using u8 for both.

apoelstra · 2025-01-09T14:21:54Z

src/iter.rs

+    let hi = byte >> 4;
+    let lo = byte & 0x0f;
+
+    let hi = char::from_digit(hi.into(), 16).unwrap();


In ce28e9b:

This function is so badly named that I consider it a bug in stdlib. We should just do this manually by indexing into a static array containing "0123456789abcdef". (I am not interested in filing bugs against stdlib to be told that "this is not Haskell" and I shouldn't complain that about words being used wrong. Let's just forget this function and move on with our lives.)

This would also have clearer semantics regarding capitalization, and to the extent there is any performance difference at all, likely to be faster.

apoelstra · 2025-01-09T14:22:56Z

src/error.rs

    }
 }

 #[cfg(feature = "std")]
 impl std::error::Error for InvalidCharError {}

+/// Invalid ASCII encoding of a hex digit.
+#[derive(Debug, Clone, PartialEq, Eq)]
+pub struct InvalidDigitError {


In ce28e9b:

BTW, here you use "digit" correctly to refer to a numeral character.

...though the structure is still a little confusing since it actually holds two digits. But that's fine.

apoelstra · 2025-01-09T14:31:26Z

src/error.rs

+
+impl InvalidDigitError {
+    /// Returns the invalid byte.
+    pub fn invalid_byte(&self) -> u8 {


In ce28e9b:

Ok, here you have a error which holds two digits, which are supposed to comprise a byte but don't. You have a method called invalid_byte which returns one of the two digits. I think this name is confusing. It should be called invalid_digit.

apoelstra · 2025-01-09T14:33:37Z

src/error.rs

+        rich_error.unwrap_or(InvalidCharError {
+            invalid: if self.is_hi { self.hi.into() } else { self.lo.into() },
+            pos: self.pos,
+        })


In ce28e9b:

This logic is really confusing. It seems like if the original "bad digit" was an ASCII byte then that byte can just be promoted to char and we're done. If not, we need to read more. So you could have an initial if self.invalid_byte() < 128 { return InvalidCharError { invalid: self.invalid_byte().into(), pos: self.pos() }. (And ofc self.invalid_byte should be renamed, see above.) So you can do an early return in this case and save yourself at least one level of nesting. (And also fix the bug here self.pos is wrong and you actually meant to return self.pos().)

apoelstra · 2025-01-09T14:35:57Z

src/error.rs

+        // Try to map the UTF-8 values in `v` to an invalid character.
+        fn error(v: &[u8], pos: usize) -> Option<InvalidCharError> {
+            core::str::from_utf8(v)
+                .map(|s| InvalidCharError { invalid: s.chars().next().unwrap(), pos })


In ce28e9b:

It seems like this function will return None in the case that you give it bad UTF-8 (for example, a non-ASCII character immediately followed by an ASCII character, which never happens in UTF-8 because it would make it ambiguous how to scan backward from the ASCII character to the nearest character boundary.)

But then below you seem to interpret None as "promote one of the bad digits in self to a char". Which, aside from being logically wrong, might not even result in a valid character.

apoelstra · 2025-01-09T14:36:31Z

Done first pass on ce28e9b

tcharding · 2025-01-09T23:07:08Z

Thanks for the review man. Seems this will not only take a lot of dev time but also a lot of review time - is it definitely worth doing? Or should we consider a bandaid solution like an error message that says "invalid character is <whatever>, note that if the invalid character falls outside of the single byte ASCII range then this error will not include the actual character"

apoelstra · 2025-01-10T15:35:21Z

I think it'd be totally fine to say "invalid character c" if c is a printable ASCII byte and "invalid byte 0xAB" if c is not.

I wouldn't call it a "bandaid solution" to accurately describe the error condition without implementing a UTF-8 parser for the sake of providing more context (which might not even be valuable BTW because it assumes that the input came from a UTF-8 string that the user can see in a decoded form).

tankyleo · 2025-01-10T16:46:57Z

I think it'd be totally fine to say "invalid character c" if c is a printable ASCII byte and "invalid byte 0xAB" if c is not.

For reference, I attempted something along these lines in #124 - we discussed and ended up closing the PR.

apoelstra · 2025-01-11T14:59:45Z

Right -- I complained about exposing an API that returns Option<char>. I continue to think this is not worth the extra API surface (it's not clear what the user would expect -- a char if the bad byte was ASCII? if it was printable? if it was alphanumeric? -- and even if we get it right and this can be quickly understood by the author and all readers of the code, it barely saves any characters because the inline logic is so simple).

But for purposes of implementing Display on this error I think it's reasonable to special-case printable characters.

tcharding force-pushed the 01-09-utf8-error-char branch from 0b41da3 to 8369090 Compare January 9, 2025 04:46

tcharding force-pushed the 01-09-utf8-error-char branch from 8369090 to ce28e9b Compare January 9, 2025 04:48

apoelstra reviewed Jan 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make `InvalidCharError` support UTF-8 #135

Make `InvalidCharError` support UTF-8 #135

tcharding commented Jan 9, 2025 •

edited

Loading

tcharding commented Jan 9, 2025

tcharding commented Jan 9, 2025

apoelstra Jan 9, 2025

tcharding Jan 9, 2025 •

edited

Loading

apoelstra Jan 10, 2025

apoelstra Jan 9, 2025

apoelstra Jan 9, 2025

apoelstra Jan 9, 2025

apoelstra Jan 9, 2025

apoelstra Jan 9, 2025

apoelstra Jan 9, 2025

apoelstra commented Jan 9, 2025

tcharding commented Jan 9, 2025

apoelstra commented Jan 10, 2025

tankyleo commented Jan 10, 2025

apoelstra commented Jan 11, 2025

Make InvalidCharError support UTF-8 #135

Are you sure you want to change the base?

Make InvalidCharError support UTF-8 #135

Conversation

tcharding commented Jan 9, 2025 • edited Loading

tcharding commented Jan 9, 2025

tcharding commented Jan 9, 2025

Choose a reason for hiding this comment

tcharding Jan 9, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apoelstra commented Jan 9, 2025

tcharding commented Jan 9, 2025

apoelstra commented Jan 10, 2025

tankyleo commented Jan 10, 2025

apoelstra commented Jan 11, 2025

Make `InvalidCharError` support UTF-8 #135

Make `InvalidCharError` support UTF-8 #135

tcharding commented Jan 9, 2025 •

edited

Loading

tcharding Jan 9, 2025 •

edited

Loading