Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add low-level APIs for converting integers to strings #46281

Closed
wants to merge 9 commits into from

Conversation

SimonSapin
Copy link
Contributor

@SimonSapin SimonSapin commented Nov 26, 2017

This adds the following new APIs on primitive integer types, unstable under the int_to_str feature gate:

impl $Int {
    pub fn to_uppercase_str_radix(self, buffer: &mut [u8], radix: u32) -> &mut str { /* … */ }
    pub fn to_lowercase_str_radix(self, buffer: &mut [u8], radix: u32) -> &mut str { /* … */ }
    pub fn to_str(self, buffer: &mut [u8]) -> &mut str { /* … */ }
    pub const MAX_STR_LEN: usize = /* … */;
}

They are the “mirror” APIs of from_str and from_str_radix.

The &mut [u8] -> &mut str pattern is the same as char::encode_utf8: take a (possibly stack-allocated) buffer, write into it, and return a slice of it. It is maximally flexible in terms of allocation/copying strategy.

to_str is equivalent to the existing impls of the fmt::Display trait (it uses the same optimized algorithm) but does not have the overhead of creating a fmt::Formatter and otherwise invoking the string formatting machinery. This overhead was deemed high enough to make a crate dedicated to avoiding it: /~https://github.com/dtolnay/itoa/

It always requires a buffer of at least MAX_STR_LEN bytes, even for values that would fit in a smaller buffer. This is to avoid adding bound checks to the algorithm. Alternatives are to add bound checks, or add them only for to_str and not Display with more (internal) generics magic.

to_lowercase_str_radix and to_lowercase_str_radix are similar to the existing impls of the fmt::LowerHex, UpperHex, Octal, and Binary traits; but they support more bases and behave differently for negative integers: they use a minus sign followed by the representation of the mathematically opposite integer, whereas the formatting traits use the two’s complement. For example:

assert_eq!(format!("{:X}", -26), "FFFFFFE6");
assert_eq!((-26).to_uppercase_str_radix(&mut [0; 8], 16), "-1A");

There’s two separate methods for upper v.s. lower case to avoid a bool parameter that wouldn’t be self-explanatory at call sites, or a dedicated enum that seems like more new API surface than is warranted. An alternative might be to pick one of upper or lower case and only provide that, but that seems arbitrary since neither is obviously superior IMO.

@SimonSapin
Copy link
Contributor Author

CC @rust-lang/libs

@@ -1315,6 +1420,98 @@ macro_rules! uint_impl {
from_str_radix(src, radix)
}

/// Write the representation in a given base in a pre-allocated buffer.
///
/// Digits are a subset (depending on `radix`) of `0-9A-F`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0-9A-Z, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. Fixed.

@kennytm kennytm added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Nov 27, 2017
@hanna-kruppe
Copy link
Contributor

hanna-kruppe commented Nov 27, 2017

In general I'm in favor of exposing low-level APIs like that to be able to reuse the optimized libstd implementation as building block for tasks that can't or don't want to use std::fmt. However, this part:

to_lowercase_str_radix and to_lowercase_str_radix are similar to the existing impls of the fmt::LowerHex, UpperHex, Octal, and Binary traits; but [...] behave differently for negative integers

seems worrying. I can see the desire for the different behavior, but if these APIs are supposed to be the "high level of control" equivalent of std::fmt APIs, any difference in behavior is an unexpected footgun, especially when the difference is not due to accepting arbitrary bases rather than knowing the base ahead of time.


I'm also wondering whether it could be worthwhile to wait for const generics for the *_radix APIs. In the typical use cases I've seen, the radix is virtually always constant, and for the common case of a power-of-two radix, it being constant could enable nice optimizations.

@SimonSapin
Copy link
Contributor Author

For what it’s worth, I believe that the footgun is that the formatting traits use the two’s complement representation in the first place. I trace the git history of this code and found no evidence that this isn’t accidental (such as discussion of this behavior or why it is desirable). I’d propose changing that behavior, but it’s been #[stable] long enough that we probably can’t. So all that’s left to do is documenting it #46285 and moving on.

I think of of these new APIs less as the low-level equivalent of formatting traits, and more as the inverse of from_str and from_str_radix. I’ve only mentioned the formatting trait to explain why the existing functionality is not sufficient.

As to making the radix a type parameter instead of a runtime parameter, we already have a from_str_radix that uses the latter and is stable. We can always add something else later, maybe to optimize hexadecimal specifically.

@hanna-kruppe
Copy link
Contributor

For what it’s worth, I believe that the footgun is that the formatting traits use the two’s complement representation in the first place.

I fully agree, but if it's here to stay, internal consistency also matters.

I think of of these new APIs less as the low-level equivalent of formatting traits, and more as the inverse of from_str and from_str_radix

I didn't consider this. A valid perspective, but these new APIs are also solving the same problem as std::fmt solves, so they're also connected to std::fmt. So we're between a rock and a hard place -- either be consistent with the other integer formatting APIs, or be consistent with the inverse function. I'm almost tempted to suggest changing the behavior of std::fmt.

As to making the radix a type parameter instead of a runtime parameter, we already have a from_str_radix that uses the latter and is stable

Good point.

@shepmaster
Copy link
Member

Slipped through assignment...

r? @dtolnay

/// assert_eq!(i16::min_value().to_str(&mut [0; i16::MAX_STR_LEN]), "-32768")
/// ```
#[unstable(feature = "int_to_str", issue = /* FIXME */ "0")]
pub fn to_str(self, buffer: &mut [u8]) -> &mut str {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not thrilled about how an inherent method to_str (which is almost never what you want) competes in rustdoc searches and autocomplete with a trait method to_string (which is almost always what you want). Would it make sense to name this in a scarier way?

@dtolnay
Copy link
Member

dtolnay commented Dec 7, 2017

It would be neat to explore how typenum could work for specifying the radix, similar to what #46281 (comment) suggested.

I am sympathetic to these being the inverse of from_str_radix and not the inverse of LowerHex etc. I think that is the right choice for what to expose.

println!("{}", <i16>::from_str_radix("-ff", 16).unwrap());
println!("{}", <i16>::from_str_radix("ffff", 16).unwrap_err());

Overall I would prefer to have these bake in a high-quality integer formatting library on crates.io first. Let's run it by the rest of the team in case someone wants to justify why these need to be exposed by std.

@rfcbot fcp close

@rfcbot rfcbot added the proposed-final-comment-period Proposed to merge/close by relevant subteam, see T-<team> label. Will enter FCP once signed off. label Dec 7, 2017
@rfcbot
Copy link

rfcbot commented Dec 7, 2017

Team member @dtolnay has proposed to close this. The next step is review by the rest of the tagged teams:

No concerns currently listed.

Once these reviewers reach consensus, this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

See this document for info about what commands tagged team members can give me.

@SimonSapin
Copy link
Contributor Author

I don’t mind renaming to_str but I don’t have a good name idea right now. Do you have a suggestion?

It would be neat to explore how typenum could work for specifying the radix

Do we really want something like typenum in the standard library?

I would prefer to have these bake in a high-quality integer formatting library on crates.io first

Hasn’t that already happened with https://crates.io/crates/itoa ?

We could also keep only the to_str (possibly renamed) part of this PR, and leave the _radix case for later.

@carols10cents
Copy link
Member

ping @BurntSushi @Kimundi @aturon @sfackler waiting for your ticky boxes here!

@carols10cents carols10cents added S-waiting-on-team Status: Awaiting decision from the relevant subteam (see the T-<team> label). and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Dec 12, 2017
@SimonSapin
Copy link
Contributor Author

@dtolnay, FCP to close immediately after stating some concerns feels premature. What do you think of not keeping the _radix method, and renaming to_str to write_to for example?

Copy link
Member

@dtolnay dtolnay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback Simon! I see what you mean. I apologize for not expressing my perspective clearly on the first try. It isn't so much that I had a few concerns -- it is that I have a general sense that it would be more productive to iterate on this feature in a library than adding it here now.

Even restricted to just the base-10 method, it feels like there is enough design space that we shouldn't hope to get it right on the first try. Having it bake in a library is better for everyone involved.

  • A library would be immediately available on the stable channel, so we can migrate code that currently uses itoa or similar to collect feedback. I would be happy to migrate serde_json as long as the new one is at least as fast.
  • A library can be versioned by Cargo so we avoid breaking nightly users as we settle on the best approach.

This is similar to the path that failure is following. For something like this, I believe bringing an established high-quality library to the table can be a faster path to stability in the standard library than iterating on #[unstable] stuff.

As an example of there being more design space than I think we have explored, here is a totally different approach that brings many advantages. Compare:

let mut buf = unsafe { mem::uninitialized::<[u8; i16::MAX_STR_LEN]>() };
let s = i.write_to(&mut buf);
// wrapper around uninitialized buffer of the right size
let mut buf = NumberBuffer::new();
let s = buf.stringify(i);

Advantages

  • Easier to use correctly.
  • Impossible to use incorrectly.
  • No more awkward MAX_STR_LEN constant.
  • No bounds check that your &[u8] is big enough.
  • No ambiguity about whether the &[u8] needs to be big enough to hold any integer, or just your integer.
  • No ambiguity about alignment of the result within the buffer.
  • Cannot panic.
  • Safe code gets to use an uninitialized buffer.
  • Googleable type name.
  • Forward compatible with controlling the base in a way that allocates a buffer of exactly the right size for your base using a const fn, struct NumberBuffer<const RADIX: usize = 10>.
  • No conflict with to_string autocomplete on the primitives.
  • The &mut is provided by autoref.
  • No ambiguity about whether the method stomps on the front of your buffer as scratch space.

Disadvantages

  • No longer exposes the ability to write right-aligned into an existing buffer. Unclear if this is ever valuable in practice.
  • Other things, I'm sure.

@SimonSapin
Copy link
Contributor Author

As maintainer of the itoa crate, how do you feel about adding NumberBuffer (IntegerBuffer?) there?

@dtolnay
Copy link
Member

dtolnay commented Dec 14, 2017

I don't think I am going to have bandwidth for this in the near term. It would be better if someone else can drive this and changes are not blocked on me, so I think releasing a different library would make more sense.

@rfcbot rfcbot added final-comment-period In the final comment period and will be merged soon unless new substantive objections are raised. and removed proposed-final-comment-period Proposed to merge/close by relevant subteam, see T-<team> label. Will enter FCP once signed off. labels Dec 19, 2017
@rfcbot
Copy link

rfcbot commented Dec 19, 2017

🔔 This is now entering its final comment period, as per the review above. 🔔

@bors
Copy link
Contributor

bors commented Dec 22, 2017

☔ The latest upstream changes (presumably #46922) made this pull request unmergeable. Please resolve the merge conflicts.

@dtolnay
Copy link
Member

dtolnay commented Dec 23, 2017

I will go ahead and close this. Thanks for the PR and discussion! Again I am not opposed to adding a low-level number to string API in the standard library but I believe it will be more productive to iterate on this in a crate. The nb crate name is available in case anyone wants to run with the NumberBuffer design. 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
final-comment-period In the final comment period and will be merged soon unless new substantive objections are raised. S-waiting-on-team Status: Awaiting decision from the relevant subteam (see the T-<team> label). T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants