Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ACP: Expose more Unicode casing data in libcore #530

Open
clarfonthey opened this issue Jan 31, 2025 · 5 comments
Open

ACP: Expose more Unicode casing data in libcore #530

clarfonthey opened this issue Jan 31, 2025 · 5 comments
Labels
api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api

Comments

@clarfonthey
Copy link

Proposal

Problem statement

While Rust has pretty robust Unicode support, it only offers casing data in a limited capacity, via methods like char::is_lowercase and str::to_lowercase. While the standard library itself already contains data on the additional Unicode properties Cased and Case_Ignorable, this data is not exposed publicly, and code cannot reuse them to implement their own versions of methods like to_lowercase on their own custom string types.

Additionally, lowercase and uppercase alone are not enough to do proper case-insensitive matching: this requires case folding, which is entirely absent from the standard library. The compiler (mostly via its clap dependency) even brings in the external unicase crate to solve this problem.

Motivating examples or use cases

As mentioned, the standard library already includes the Cased and Case_Ignorable property data in its own code, but does not expose this publicly. There would not be a substantial maintenance burden to expose char::is_cased and char::is_case_ignorable methods in libcore, since it's just a matter of offering a public API surface.

While case folding data isn't directly included in the standard library, it is no different from the lowercase and uppercase mapping tables and could easily be generated in the code as well and offered in a very similar API fashion.

While this code isn't strictly required in the standard library and the ecosystem has done mostly fine with crates like unicase, the primary benefit of including this data in the standard library is to expose data that is mostly already used by the compiler and to offer a solution to people who are averse to the idea of adding new dependencies.

Solution sketch

I'm going to separate this into a basic core of methods that I think should be added for this proposal, and a set of "stretch goal" methods which would be nice complements to these, but not strictly required.

The base methods:

impl char {
    // currently exported as unstable `core::unicode::Cased`
    // corresponds to unicode `Cased` property
    // is not equivalent to `is_lowercase() || is_uppercase()`:
    //   it also includes title-case ligature characters like Lj
    const fn is_cased(self) -> bool;

    // currently exported as unstable `core::unicode::Case_Ignorable`
    // corresponds to unicode `Case_Ignorable` property
    // indicates characters which are completely ignored when case mapping;
    //   is mostly used for implementing casing algorithms
    const fn is_case_ignorable(self) -> bool;

    // not included currently
    // represents full case-folding as defined by `CaseFolding.txt`
    // should use same code as `ToLowercase` and `ToUppercase`
    // note that Turkic mappings are excluded;
    //   they're excluded by default, and the mapping is only two characters, so
    //   anyone can trivially special-case those ones
    fn to_folded_case(self) -> ToFoldedCase;
}

impl str {
    // not included currently
    // equivalent to `chars().flat_map(char::to_folded_case).collect()`
    // analogue to `to_lowercase` and `to_uppercase`
    fn to_folded_case(&self) -> String;

    // not included currently
    // equivalent to `chars().flat_map(char::to_folded_case)`
    // analogue to ACP-accepted `lowercase_chars` and `uppercase_chars`
    fn folded_chars(&self) -> FoldedChars;
}

impl String {
    // not included currently
    // equivalent to `*self = self.to_folded_case()`
    // analogue to ACP-accepted `make_lowercase` and `make_uppercase`
    fn fold_case(&mut self);
}

It additionally would be nice to make the case-folding methods usable in const code, as a stretch goal:

impl char {
    // these methods are now made const;
    // they only weren't because they weren't useful as const before
    const fn to_lowercase(self) -> ToLowercase;
    const fn to_uppercase(self) -> ToUppercase;
    const fn to_folded_case(self) -> ToFoldedCase;
}

impl To{Lowercase,Uppercase,FoldedCase} {
    // it seems incredibly unlikely that the internal representation would be
    //   changed to make this difficult
    const fn as_chars(&self) -> &[char];

    // effectively same as `fmt::Write` impl, analogue to `char::encode_utf8`
    // allows usage in const code before const traits,
    //   and can be used for implementing own case methods
    const fn encode_utf8(&self, buffer: &mut [u8; 12]) -> &mut str;

    // analogue to `char::len_utf8`
    const fn len_utf8(&self) -> usize;

    // analogue to `char::encode_utf16`
    const fn encode_utf16(&self, buffer: &mut [u16; 6]) -> &mut [u16];

    // analogue to `char::len_utf16`
    const fn len_utf16(&self) -> usize;
}

Perhaps it would be also useful to have title-case data as well, since the number of title-cased characters is small. However, this is less useful because many people will not want explicit title-case (for example, "This Title Is Title Case" should probably be "This Title is Title Case") and because title-case is much more language-dependent.

Note that this uses Unicode's choice of "Titlecase" as one word instead of two separate words.

impl char {
    // note: `is_cased()` is now explicitly `is_lowercase() || is_uppercase() || is_titlecase()`
    // titlecase follows the unicode property `Titlecase_Letter`
    const fn is_titlecase(self) -> bool;

    // equivalent to `to_uppercase` for most characters,
    //   but different specifically for ligature characters
    // marking as const here in case we include eariler proposal
    const fn to_titlecase(self) -> ToTitlecase;
}

impl str {
    // would implement title-case algorithm, *and* include the final sigma rules
    fn to_titlecase(&self) -> String;
    fn titlecase_chars(&self) -> TitlecaseChars;
}

impl String {
    fn make_titlecase(&mut self);
}

It might be nice to have an eq_ignore_case method for strings that uses case folding:

impl str {
    // uses case folding
    fn eq_ignore_case(&self, rhs: &str) -> bool;
}

Using case folding, these methods are omitted but might be useful to include. I put them at the end since generally, people will prefer to perform case-folding in advance rather than doing them on-demand every time, and we may want to encourage that specifically.

// note that char::eq_ignore_case is absent,
//   since case conversions can expand to multiple characters

impl str {
     fn cmp_ignore_ascii_case(&self, rhs: &str) -> Ordering;
     fn cmp_ignore_case(&self, rhs: &str) -> Ordering;
}

I'm also adding this one just because I wanted it myself. I will be surprised if it's actually accepted, but it technically is included in unicase:

impl str {
    fn hash_ignore_ascii_case<H: Hasher>(&self, state: &mut H);
    fn hash_ignore_case<H: Hasher>(&self, state: &mut H);
}

Alternatives

Right now, the alternatives already exist as crates on crates.io. The primary benefit of adding this to the standard library is that a lot of the work is already done for upper/lowercase mappings, and some of this data is already included but not exposed publicly. But also, adding case mapping to the standard library will make people aware of its existence, rather than simply converting everything to lowercase or uppercase to compare, which is technically incorrect. (the simplest example is that lower(ß) != lower(SS), but fold(ß) = fold(SS))

Links and related work

I'm mostly creating this issue because it hasn't really been discussed since before 1.0.

If desired, I can dredge up some of the suggestions I found, but there aren't many, and I'm not sure they're relevant.

What happens now?

This issue contains an API change proposal (or ACP) and is part of the libs-api team feature lifecycle. Once this issue is filed, the libs-api team will review open proposals as capability becomes available. Current response times do not have a clear estimate, but may be up to several months.

Possible responses

The libs team may respond in various different ways. First, the team will consider the problem (this doesn't require any concrete solution or alternatives to have been proposed):

  • We think this problem seems worth solving, and the standard library might be the right place to solve it.
  • We think that this probably doesn't belong in the standard library.

Second, if there's a concrete solution:

  • We think this specific solution looks roughly right, approved, you or someone else should implement this. (Further review will still happen on the subsequent implementation PR.)
  • We're not sure this is the right solution, and the alternatives or other materials don't give us enough information to be sure about that. Here are some questions we have that aren't answered, or rough ideas about alternatives we'd want to see discussed.
@clarfonthey clarfonthey added api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api labels Jan 31, 2025
@Amanieu
Copy link
Member

Amanieu commented Feb 4, 2025

@Manishearth Do you have any thoughts on this? You are the unicode expert.

My personal inclination is to accept the first set of methods (the first code block in the ACP) since everything else can be built on top of it.

@clarfonthey
Copy link
Author

clarfonthey commented Feb 5, 2025

I should also add, since I realised I didn't directly mention it here: Cased and Case_Ignorable are used to implement the Final_Sigma rule in str::to_lowercase, and are also required to implement the titlecase algorithm. They don't really have a purpose outside of these, and they could be confusing to people who don't understand this.

(Final_Sigma is the rule that the Greek uppercase sigma (Σ) becomes either (σ) or (ς) depending on whether it's at the end of a word, and this is the only locale-independent rule for casing in Unicode.)

While the data for Cased and Case_Ignorable isn't massive, it is nontrivial, and currently already included in the standard library, just not exposed. The data for titlecase characters is substantially smaller since there are so few of them, which is either an argument for or against including it in libstd based upon your point of view. (It's currently omitted.)

There are also other Unicode properties in libstd, like Grapheme_Extended, which are used by the Debug implementation, but these are indirectly exposed via escape_debug and are less generally useful, which is why I don't touch on them here.

@Manishearth
Copy link
Member

(I still plan to respond here, but am traveling)

@Manishearth
Copy link
Member

I don't think exposing this data is a bad idea, though I do wonder if it's too niche.

I think "implement Unicode algorithms using the same unicode version as libstd" is useful. This is a good reason to limit the amount of unicode stuff going on in std, though. It's a good thing we moved segmentation and normalization out of std.

@clarfonthey
Copy link
Author

I definitely think that it's valid to consider it niche; my main justification is that the standard library already has to maintain this code in one form, so, there's a lower burden to offer an API for it than there would be if it didn't need it at all.

There may be some justification also in reducing binary size by avoiding the amount of times this data is effectively duplicated across crates, but I'm not so sure that's as big a concern here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api
Projects
None yet
Development

No branches or pull requests

3 participants