-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ACP: Expose more Unicode casing data in libcore #530
Comments
@Manishearth Do you have any thoughts on this? You are the unicode expert. My personal inclination is to accept the first set of methods (the first code block in the ACP) since everything else can be built on top of it. |
I should also add, since I realised I didn't directly mention it here: ( While the data for There are also other Unicode properties in libstd, like |
(I still plan to respond here, but am traveling) |
I don't think exposing this data is a bad idea, though I do wonder if it's too niche. I think "implement Unicode algorithms using the same unicode version as libstd" is useful. This is a good reason to limit the amount of unicode stuff going on in std, though. It's a good thing we moved segmentation and normalization out of std. |
I definitely think that it's valid to consider it niche; my main justification is that the standard library already has to maintain this code in one form, so, there's a lower burden to offer an API for it than there would be if it didn't need it at all. There may be some justification also in reducing binary size by avoiding the amount of times this data is effectively duplicated across crates, but I'm not so sure that's as big a concern here. |
Proposal
Problem statement
While Rust has pretty robust Unicode support, it only offers casing data in a limited capacity, via methods like
char::is_lowercase
andstr::to_lowercase
. While the standard library itself already contains data on the additional Unicode propertiesCased
andCase_Ignorable
, this data is not exposed publicly, and code cannot reuse them to implement their own versions of methods liketo_lowercase
on their own custom string types.Additionally, lowercase and uppercase alone are not enough to do proper case-insensitive matching: this requires case folding, which is entirely absent from the standard library. The compiler (mostly via its
clap
dependency) even brings in the externalunicase
crate to solve this problem.Motivating examples or use cases
As mentioned, the standard library already includes the
Cased
andCase_Ignorable
property data in its own code, but does not expose this publicly. There would not be a substantial maintenance burden to exposechar::is_cased
andchar::is_case_ignorable
methods in libcore, since it's just a matter of offering a public API surface.While case folding data isn't directly included in the standard library, it is no different from the lowercase and uppercase mapping tables and could easily be generated in the code as well and offered in a very similar API fashion.
While this code isn't strictly required in the standard library and the ecosystem has done mostly fine with crates like
unicase
, the primary benefit of including this data in the standard library is to expose data that is mostly already used by the compiler and to offer a solution to people who are averse to the idea of adding new dependencies.Solution sketch
I'm going to separate this into a basic core of methods that I think should be added for this proposal, and a set of "stretch goal" methods which would be nice complements to these, but not strictly required.
The base methods:
It additionally would be nice to make the case-folding methods usable in const code, as a stretch goal:
Perhaps it would be also useful to have title-case data as well, since the number of title-cased characters is small. However, this is less useful because many people will not want explicit title-case (for example, "This Title Is Title Case" should probably be "This Title is Title Case") and because title-case is much more language-dependent.
Note that this uses Unicode's choice of "Titlecase" as one word instead of two separate words.
It might be nice to have an
eq_ignore_case
method for strings that uses case folding:Using case folding, these methods are omitted but might be useful to include. I put them at the end since generally, people will prefer to perform case-folding in advance rather than doing them on-demand every time, and we may want to encourage that specifically.
I'm also adding this one just because I wanted it myself. I will be surprised if it's actually accepted, but it technically is included in
unicase
:Alternatives
Right now, the alternatives already exist as crates on crates.io. The primary benefit of adding this to the standard library is that a lot of the work is already done for upper/lowercase mappings, and some of this data is already included but not exposed publicly. But also, adding case mapping to the standard library will make people aware of its existence, rather than simply converting everything to lowercase or uppercase to compare, which is technically incorrect. (the simplest example is that
lower(ß) != lower(SS)
, butfold(ß) = fold(SS)
)Links and related work
I'm mostly creating this issue because it hasn't really been discussed since before 1.0.
If desired, I can dredge up some of the suggestions I found, but there aren't many, and I'm not sure they're relevant.
What happens now?
This issue contains an API change proposal (or ACP) and is part of the libs-api team feature lifecycle. Once this issue is filed, the libs-api team will review open proposals as capability becomes available. Current response times do not have a clear estimate, but may be up to several months.
Possible responses
The libs team may respond in various different ways. First, the team will consider the problem (this doesn't require any concrete solution or alternatives to have been proposed):
Second, if there's a concrete solution:
The text was updated successfully, but these errors were encountered: