-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor unicode.py script #60081
Refactor unicode.py script #60081
Conversation
r? @KodrAus (rust_highfive has picked a reviewer for you, use r? to override) |
The job Click to expand the log.
I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact |
should each of your functions have some doc-strings? |
Possibly, I can annotate the most obvious ones but I would have to ask the authors about some of the Unicode table calculations -- to explain what is being done and why. |
Following the suggestion above, I have added docstrings and type annotations for all functions. To fully comprehend what's going on I had to give the code a few good, thorough reads. This spawned more cleanups and the entire script is now refactored. |
I don't think I've got the python experience to be able to properly review this so I'll assign somebody else based on the commit history... r? @SimonSapin |
Sorry, I don’t have bandwidth available at the moment for non-trivial Rust reviews. |
r? @varkor, based on git-shortlog |
I'll try to take a look soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little divided about pull requests like this. On the one hand, refactoring and cleaning code up is a good thing. However, for scripts like this, which are quite large, but also very rarely used, and also written in Python, it's (unfortunately) possibly not worth the effort to clean up. For instance, it's very likely this script will never need to be modified (functionally), so readability isn't a huge issue, and it will continue to work in older versions of Python indefinitely. I do appreciate the desire to improve the codebase as a whole, but (as can be seen) reviewing such refactorings can be difficult, as it takes a lot of time.
The one thing I might be tempted to do is rewrite this (and the other table-generating scripts) in Rust, which would then be accessible to more of the reviewers and also give us more type safety and reassurance that we aren't accidentally breaking anything.
That said, I've taken a look at these changes. Overall I think they are an improvement. Honestly, as long as we continue to generate the same tables and the changes look mostly like moving code around, I'm happy. I would like the comment style to be normalised a bit more. One thing that would be nice to have is a test to check that the current table.rs
matches the output of unicode.py -v 11.0.0
, so we know that the script is (mostly) correct, but as that requires a connection to the Unicode website, it's probably not feasible.
If you could fix the few comments I had, I'll approve.
Such an effort might be able to rely on (parts of?) ucd-generate or unic-ucd. |
Ping from triage @pawroman, there are still outstanding review items that need to be addressed |
Thanks for being patient - I couldn't find the time to look into this recently. @varkor Thanks for reviewing - I will address the remarks soon. I am tempted to rewrite this in Rust myself, however I wanted to clean up the Python code first (so I understand it better) and then perhaps work on a Rust re-implementation. |
Co-Authored-By: varkor <github@varkor.com>
@pawroman: if you could squash the last two commits (so the submodules never change), I'm happy to approve with the new changes 👍 |
@bors r=varkor |
📌 Commit 2b47a08 has been approved by |
…varkor Refactor unicode.py script Hi, I noticed that the `unicode.py` script used some deprecated escapes in regular expressions. E.g. `\d`, `\w`, `\.` will be illegal in the future without "raw strings". This is now fixed. I have also cleaned up the script quite a bit. ## Escape deprecation OK (note the `r`): `re.compile(r"\d")` Deprecated (from Python 3.6 onwards, see [here][link1] and [here][link2]): `re.compile("\d")`. [link1]: https://docs.python.org/3.6/whatsnew/3.6.html#deprecated-python-behavior [link2]: https://bugs.python.org/issue27364 This was evident running the script using Python 3.7 like so: ``` $ python3 -Wall unicode.py unicode.py:227: DeprecationWarning: invalid escape sequence \w re1 = re.compile("^ *([0-9A-F]+) *; *(\w+)") unicode.py:228: DeprecationWarning: invalid escape sequence \. re2 = re.compile("^ *([0-9A-F]+)\.\.([0-9A-F]+) *; *(\w+)") unicode.py:453: DeprecationWarning: invalid escape sequence \d pattern = "for Version (\d+)\.(\d+)\.(\d+) of the Unicode" ``` The documentation states that > A backslash-character pair that is not a valid escape sequence now generates a DeprecationWarning. Although this will eventually become a SyntaxError, that will not be for several Python releases. ## Testing To test my changes, I had to add support for choosing the Unicode version to use. The script will default to latest release (which is 12.0.0 at the moment, repo has 11.0.0 checked in). The script generates the exact same output for version 11.0.0 with Python 2.7 and 3.7 and no longer generates any deprecation warnings: ``` $ python3 -Wall unicode.py -v 11.0.0 Using Unicode version: 11.0.0 Regenerated tables.rs. $ git diff tables.rs $ python2 -Wall unicode.py -v 11.0.0 Using Unicode version: 11.0.0 Regenerated tables.rs. $ git diff tables.rs $ python2 --version Python 2.7.16 $ python3 --version Python 3.7.3 ``` ## Extra functionality Furthermore, the script will check and download the latest Unicode version by default (without the `-v` argument). The `--help` is below: ``` $ ./unicode.py --help usage: unicode.py [-h] [-v VERSION] Regenerate Unicode tables (tables.rs). optional arguments: -h, --help show this help message and exit -v VERSION, --version VERSION Unicode version to use (if not specified, defaults to latest available final release). ``` ## Cleanups I have cleaned up the code quite a bit, with Python best practices and code style in mind. I'm happy to provide more details and rationale for all my changes if the reviewers so desire. One externally visible change is that the Unicode data will now be downloaded into `src/libcore/unicode/downloaded` directory suffixed by Unicode version: ``` $ pwd .../rust/src/libcore/unicode $ exa -T downloaded/ downloaded ├── 11.0.0 │ ├── DerivedCoreProperties.txt │ ├── DerivedNormalizationProps.txt │ ├── PropList.txt │ ├── ReadMe.txt │ ├── Scripts.txt │ ├── SpecialCasing.txt │ └── UnicodeData.txt └── 12.0.0 ├── DerivedCoreProperties.txt ├── DerivedNormalizationProps.txt ├── PropList.txt ├── ReadMe.txt ├── Scripts.txt ├── SpecialCasing.txt └── UnicodeData.txt ```
Rollup of 4 pull requests [2] Successful merges: - rust-lang#59800 (rustc: Remove `dylib` crate type from most rustc crates) - rust-lang#60081 (Refactor unicode.py script) - rust-lang#62270 (Move async-await tests from run-pass to ui) - rust-lang#62281 (Add support for pc-relative addressing on 64-bit RISC-V) Failed merges: r? @ghost
…varkor Refactor unicode.py script Hi, I noticed that the `unicode.py` script used some deprecated escapes in regular expressions. E.g. `\d`, `\w`, `\.` will be illegal in the future without "raw strings". This is now fixed. I have also cleaned up the script quite a bit. ## Escape deprecation OK (note the `r`): `re.compile(r"\d")` Deprecated (from Python 3.6 onwards, see [here][link1] and [here][link2]): `re.compile("\d")`. [link1]: https://docs.python.org/3.6/whatsnew/3.6.html#deprecated-python-behavior [link2]: https://bugs.python.org/issue27364 This was evident running the script using Python 3.7 like so: ``` $ python3 -Wall unicode.py unicode.py:227: DeprecationWarning: invalid escape sequence \w re1 = re.compile("^ *([0-9A-F]+) *; *(\w+)") unicode.py:228: DeprecationWarning: invalid escape sequence \. re2 = re.compile("^ *([0-9A-F]+)\.\.([0-9A-F]+) *; *(\w+)") unicode.py:453: DeprecationWarning: invalid escape sequence \d pattern = "for Version (\d+)\.(\d+)\.(\d+) of the Unicode" ``` The documentation states that > A backslash-character pair that is not a valid escape sequence now generates a DeprecationWarning. Although this will eventually become a SyntaxError, that will not be for several Python releases. ## Testing To test my changes, I had to add support for choosing the Unicode version to use. The script will default to latest release (which is 12.0.0 at the moment, repo has 11.0.0 checked in). The script generates the exact same output for version 11.0.0 with Python 2.7 and 3.7 and no longer generates any deprecation warnings: ``` $ python3 -Wall unicode.py -v 11.0.0 Using Unicode version: 11.0.0 Regenerated tables.rs. $ git diff tables.rs $ python2 -Wall unicode.py -v 11.0.0 Using Unicode version: 11.0.0 Regenerated tables.rs. $ git diff tables.rs $ python2 --version Python 2.7.16 $ python3 --version Python 3.7.3 ``` ## Extra functionality Furthermore, the script will check and download the latest Unicode version by default (without the `-v` argument). The `--help` is below: ``` $ ./unicode.py --help usage: unicode.py [-h] [-v VERSION] Regenerate Unicode tables (tables.rs). optional arguments: -h, --help show this help message and exit -v VERSION, --version VERSION Unicode version to use (if not specified, defaults to latest available final release). ``` ## Cleanups I have cleaned up the code quite a bit, with Python best practices and code style in mind. I'm happy to provide more details and rationale for all my changes if the reviewers so desire. One externally visible change is that the Unicode data will now be downloaded into `src/libcore/unicode/downloaded` directory suffixed by Unicode version: ``` $ pwd .../rust/src/libcore/unicode $ exa -T downloaded/ downloaded ├── 11.0.0 │ ├── DerivedCoreProperties.txt │ ├── DerivedNormalizationProps.txt │ ├── PropList.txt │ ├── ReadMe.txt │ ├── Scripts.txt │ ├── SpecialCasing.txt │ └── UnicodeData.txt └── 12.0.0 ├── DerivedCoreProperties.txt ├── DerivedNormalizationProps.txt ├── PropList.txt ├── ReadMe.txt ├── Scripts.txt ├── SpecialCasing.txt └── UnicodeData.txt ```
⌛ Testing commit 2b47a08 with merge 5a94ebbc9dc43dfa82b165b2c41a115b0e869804... |
…varkor Refactor unicode.py script Hi, I noticed that the `unicode.py` script used some deprecated escapes in regular expressions. E.g. `\d`, `\w`, `\.` will be illegal in the future without "raw strings". This is now fixed. I have also cleaned up the script quite a bit. ## Escape deprecation OK (note the `r`): `re.compile(r"\d")` Deprecated (from Python 3.6 onwards, see [here][link1] and [here][link2]): `re.compile("\d")`. [link1]: https://docs.python.org/3.6/whatsnew/3.6.html#deprecated-python-behavior [link2]: https://bugs.python.org/issue27364 This was evident running the script using Python 3.7 like so: ``` $ python3 -Wall unicode.py unicode.py:227: DeprecationWarning: invalid escape sequence \w re1 = re.compile("^ *([0-9A-F]+) *; *(\w+)") unicode.py:228: DeprecationWarning: invalid escape sequence \. re2 = re.compile("^ *([0-9A-F]+)\.\.([0-9A-F]+) *; *(\w+)") unicode.py:453: DeprecationWarning: invalid escape sequence \d pattern = "for Version (\d+)\.(\d+)\.(\d+) of the Unicode" ``` The documentation states that > A backslash-character pair that is not a valid escape sequence now generates a DeprecationWarning. Although this will eventually become a SyntaxError, that will not be for several Python releases. ## Testing To test my changes, I had to add support for choosing the Unicode version to use. The script will default to latest release (which is 12.0.0 at the moment, repo has 11.0.0 checked in). The script generates the exact same output for version 11.0.0 with Python 2.7 and 3.7 and no longer generates any deprecation warnings: ``` $ python3 -Wall unicode.py -v 11.0.0 Using Unicode version: 11.0.0 Regenerated tables.rs. $ git diff tables.rs $ python2 -Wall unicode.py -v 11.0.0 Using Unicode version: 11.0.0 Regenerated tables.rs. $ git diff tables.rs $ python2 --version Python 2.7.16 $ python3 --version Python 3.7.3 ``` ## Extra functionality Furthermore, the script will check and download the latest Unicode version by default (without the `-v` argument). The `--help` is below: ``` $ ./unicode.py --help usage: unicode.py [-h] [-v VERSION] Regenerate Unicode tables (tables.rs). optional arguments: -h, --help show this help message and exit -v VERSION, --version VERSION Unicode version to use (if not specified, defaults to latest available final release). ``` ## Cleanups I have cleaned up the code quite a bit, with Python best practices and code style in mind. I'm happy to provide more details and rationale for all my changes if the reviewers so desire. One externally visible change is that the Unicode data will now be downloaded into `src/libcore/unicode/downloaded` directory suffixed by Unicode version: ``` $ pwd .../rust/src/libcore/unicode $ exa -T downloaded/ downloaded ├── 11.0.0 │ ├── DerivedCoreProperties.txt │ ├── DerivedNormalizationProps.txt │ ├── PropList.txt │ ├── ReadMe.txt │ ├── Scripts.txt │ ├── SpecialCasing.txt │ └── UnicodeData.txt └── 12.0.0 ├── DerivedCoreProperties.txt ├── DerivedNormalizationProps.txt ├── PropList.txt ├── ReadMe.txt ├── Scripts.txt ├── SpecialCasing.txt └── UnicodeData.txt ```
@bors retry |
Rollup of 6 pull requests Successful merges: - #60081 (Refactor unicode.py script) - #61862 (Make the Weak::{into,as}_raw methods) - #62243 (Improve documentation for built-in macros) - #62422 (Remove some uses of mem::uninitialized) - #62432 (Update rustfmt to 1.3.2) - #62436 (normalize use of backticks/lowercase in compiler messages for librustc_mir) Failed merges: r? @ghost
…varkor Refactor unicode.py script Hi, I noticed that the `unicode.py` script used some deprecated escapes in regular expressions. E.g. `\d`, `\w`, `\.` will be illegal in the future without "raw strings". This is now fixed. I have also cleaned up the script quite a bit. ## Escape deprecation OK (note the `r`): `re.compile(r"\d")` Deprecated (from Python 3.6 onwards, see [here][link1] and [here][link2]): `re.compile("\d")`. [link1]: https://docs.python.org/3.6/whatsnew/3.6.html#deprecated-python-behavior [link2]: https://bugs.python.org/issue27364 This was evident running the script using Python 3.7 like so: ``` $ python3 -Wall unicode.py unicode.py:227: DeprecationWarning: invalid escape sequence \w re1 = re.compile("^ *([0-9A-F]+) *; *(\w+)") unicode.py:228: DeprecationWarning: invalid escape sequence \. re2 = re.compile("^ *([0-9A-F]+)\.\.([0-9A-F]+) *; *(\w+)") unicode.py:453: DeprecationWarning: invalid escape sequence \d pattern = "for Version (\d+)\.(\d+)\.(\d+) of the Unicode" ``` The documentation states that > A backslash-character pair that is not a valid escape sequence now generates a DeprecationWarning. Although this will eventually become a SyntaxError, that will not be for several Python releases. ## Testing To test my changes, I had to add support for choosing the Unicode version to use. The script will default to latest release (which is 12.0.0 at the moment, repo has 11.0.0 checked in). The script generates the exact same output for version 11.0.0 with Python 2.7 and 3.7 and no longer generates any deprecation warnings: ``` $ python3 -Wall unicode.py -v 11.0.0 Using Unicode version: 11.0.0 Regenerated tables.rs. $ git diff tables.rs $ python2 -Wall unicode.py -v 11.0.0 Using Unicode version: 11.0.0 Regenerated tables.rs. $ git diff tables.rs $ python2 --version Python 2.7.16 $ python3 --version Python 3.7.3 ``` ## Extra functionality Furthermore, the script will check and download the latest Unicode version by default (without the `-v` argument). The `--help` is below: ``` $ ./unicode.py --help usage: unicode.py [-h] [-v VERSION] Regenerate Unicode tables (tables.rs). optional arguments: -h, --help show this help message and exit -v VERSION, --version VERSION Unicode version to use (if not specified, defaults to latest available final release). ``` ## Cleanups I have cleaned up the code quite a bit, with Python best practices and code style in mind. I'm happy to provide more details and rationale for all my changes if the reviewers so desire. One externally visible change is that the Unicode data will now be downloaded into `src/libcore/unicode/downloaded` directory suffixed by Unicode version: ``` $ pwd .../rust/src/libcore/unicode $ exa -T downloaded/ downloaded ├── 11.0.0 │ ├── DerivedCoreProperties.txt │ ├── DerivedNormalizationProps.txt │ ├── PropList.txt │ ├── ReadMe.txt │ ├── Scripts.txt │ ├── SpecialCasing.txt │ └── UnicodeData.txt └── 12.0.0 ├── DerivedCoreProperties.txt ├── DerivedNormalizationProps.txt ├── PropList.txt ├── ReadMe.txt ├── Scripts.txt ├── SpecialCasing.txt └── UnicodeData.txt ```
Rollup of 5 pull requests Successful merges: - #60081 (Refactor unicode.py script) - #61862 (Make the Weak::{into,as}_raw methods) - #62243 (Improve documentation for built-in macros) - #62422 (Remove some uses of mem::uninitialized) - #62436 (normalize use of backticks/lowercase in compiler messages for librustc_mir) Failed merges: r? @ghost
@bors r- |
@bors retry |
Hi, I noticed that the
unicode.py
script used some deprecated escapes in regular expressions. E.g.\d
,\w
,\.
will be illegal in the future without "raw strings". This is now fixed. I have also cleaned up the script quite a bit.Escape deprecation
OK (note the
r
):re.compile(r"\d")
Deprecated (from Python 3.6 onwards, see here and here):
re.compile("\d")
.This was evident running the script using Python 3.7 like so:
The documentation states that
Testing
To test my changes, I had to add support for choosing the Unicode version to use. The script will default to latest release (which is 12.0.0 at the moment, repo has 11.0.0 checked in).
The script generates the exact same output for version 11.0.0 with Python 2.7 and 3.7 and no longer generates any deprecation warnings:
Extra functionality
Furthermore, the script will check and download the latest Unicode version by default (without the
-v
argument). The--help
is below:Cleanups
I have cleaned up the code quite a bit, with Python best practices and code style in mind. I'm happy to provide more details and rationale for all my changes if the reviewers so desire.
One externally visible change is that the Unicode data will now be downloaded into
src/libcore/unicode/downloaded
directory suffixed by Unicode version: