-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Autoformat confusable units #4430
Conversation
PR Check ResultsEcosystemℹ️ ecosystem check detected changes. (+2, -0, 0 error(s)) zulip (+2, -0)
+ zerver/tests/test_invite.py:1197:32: RUF001 [*] String contains ambiguous `µ` (MICRO SIGN). Did you mean `μ` (GREEK SMALL LETTER MU)?
+ zerver/tests/test_signup.py:1033:29: RUF001 [*] String contains ambiguous `µ` (MICRO SIGN). Did you mean `μ` (GREEK SMALL LETTER MU)?
BenchmarkLinux
Windows
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution.
I'm not very familiar with unicode confusables. So bare with me ;)
Our current table is based on the VS code table and the standard seems to differentiate between mixed script and other confusables. I believe we currently only test for mixed-script confusables (similar to unicode-security). Do you know if these added scripts are also mixed script confusables or general confusables?
|
||
use once_cell::sync::Lazy; | ||
use rustc_hash::FxHashMap; | ||
|
||
/// Via: </~https://github.com/hediet/vscode-unicode-data/blob/main/out/ambiguous.json> | ||
/// See: </~https://github.com/microsoft/vscode/blob/095ddabc52b82498ee7f718a34f9dd11d59099a8/src/vs/base/common/strings.ts#L1094> | ||
pub(crate) static CONFUSABLES: Lazy<FxHashMap<u32, u8>> = Lazy::new(|| { | ||
pub(crate) static CONFUSABLES: Lazy<FxHashMap<u32, u32>> = Lazy::new(|| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I wonder if we should instead change the type to <char, char>
. Having the char
s in place could make the table a bit more readable and avoids the unwrap
calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MichaReiser - That seems reasonable to me, but will it increase the size of this map?
Thanks for taking a look! The VS Code list only contains "... characters (key) that are confusable with a basic ascii [0x20 to 0x7E] character (value)". /~https://github.com/hediet/vscode-unicode-data/blob/a2ee3e17acaad2957f910cf6c46f6ea2c6eae753/index.ts#L159 I wonder why they did that. Was it an easy starting point and they haven't been motivated to build it out further? Would a full table would be too slow in node.js? 🤷 I too am new to Unicode security, and am still trying to figure out how to apply domain name spoofing concepts to code linting and autoformatting. TR 39 suggested reading TR 36 first, where I found this:
So it seems like normalization comes first in the process, before mixed script or single script handling. The substitutions I'm adding are normalizations. A longer discussion of normalization from TR 36:
|
@@ -2116,5 +2116,8 @@ pub(crate) static CONFUSABLES: Lazy<FxHashMap<u32, u8>> = Lazy::new(|| { | |||
(1059, 89), | |||
(65283, 35), | |||
(65307, 59), | |||
(0x212B, 0x00C5), // ANGSTROM SIGN → LATIN CAPITAL LETTER A WITH RING ABOVE | |||
(0x2126, 0x03A9), // OHM SIGN → GREEK CAPITAL LETTER OMEGA | |||
(0x00B5, 0x03BC), // MICRO SIGN → GREEK SMALL LETTER MU |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on Section 2.5 in https://www.unicode.org/reports/tr25/, should this also include Kelvin sign?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The list from VS Code covers Kelvin with (8490, 75)
.
6aaf205
to
a9cbfdd
Compare
I believe we were concerned about increasing the size of the map here but with use of |
This reverts commit e8879d5.
a9cbfdd
to
2bb2135
Compare
2bb2135
to
292bf7f
Compare
Thanks, and sorry for the extensive delay here. |
|
code | total | + violation | - violation | + fix | - fix |
---|---|---|---|---|---|
RUF001 | 6 | 6 | 0 | 0 | 0 |
RUF003 | 3 | 3 | 0 | 0 | 0 |
Linter (preview)
ℹ️ ecosystem check detected linter changes. (+9 -0 violations, +0 -0 fixes in 41 projects)
bokeh/bokeh (+1 -0 violations, +0 -0 fixes)
ruff check --no-cache --exit-zero --select ALL --preview
+ examples/interaction/tools/subcoordinates_zoom.py:22:23: RUF001 String contains ambiguous `µ` (MICRO SIGN). Did you mean `μ` (GREEK SMALL LETTER MU)?
pandas-dev/pandas (+6 -0 violations, +0 -0 fixes)
ruff check --no-cache --exit-zero --preview
+ pandas/_libs/tslibs/timedeltas.pyi:58:6: RUF001 String contains ambiguous `µ` (MICRO SIGN). Did you mean `μ` (GREEK SMALL LETTER MU)? + pandas/core/indexes/base.py:5352:26: RUF003 Comment contains ambiguous `µ` (MICRO SIGN). Did you mean `μ` (GREEK SMALL LETTER MU)? + pandas/core/indexes/base.py:5353:46: RUF003 Comment contains ambiguous `µ` (MICRO SIGN). Did you mean `μ` (GREEK SMALL LETTER MU)? + pandas/core/indexes/base.py:5354:69: RUF003 Comment contains ambiguous `µ` (MICRO SIGN). Did you mean `μ` (GREEK SMALL LETTER MU)? + pandas/tests/io/test_clipboard.py:67:34: RUF001 String contains ambiguous `µ` (MICRO SIGN). Did you mean `μ` (GREEK SMALL LETTER MU)? + pandas/tests/scalar/timedelta/test_constructors.py:306:25: RUF001 String contains ambiguous `µ` (MICRO SIGN). Did you mean `μ` (GREEK SMALL LETTER MU)?
zulip/zulip (+2 -0 violations, +0 -0 fixes)
ruff check --no-cache --exit-zero --select ALL --preview
+ zerver/tests/test_invite.py:1281:32: RUF001 String contains ambiguous `µ` (MICRO SIGN). Did you mean `μ` (GREEK SMALL LETTER MU)? + zerver/tests/test_signup.py:1021:29: RUF001 String contains ambiguous `µ` (MICRO SIGN). Did you mean `μ` (GREEK SMALL LETTER MU)?
Changes by rule (2 rules affected)
code | total | + violation | - violation | + fix | - fix |
---|---|---|---|---|---|
RUF001 | 6 | 6 | 0 | 0 | 0 |
RUF003 | 3 | 3 | 0 | 0 | 0 |
Formatter (stable)
✅ ecosystem check detected no format changes.
Formatter (preview)
✅ ecosystem check detected no format changes.
Note to self: revisit these ecosystem changes today. |
\cc @zanieb - Do you have an opinion on whether this should be a preview change...? |
@charliermarsh looks like it creates new violations and it's not a bug fix so it should probably be preview? |
Done in #8473. |
Thanks! I've looked over the hits and made a best guess at how to proceed. Over time I'll try to test and submit Pull Requests.
I would suggest applying the fix for this abbreviation of microvolts in tooltip text.
timedeltas.pyi and timedeltas.pyx list many aliases for microseconds (among other units). test_constructors.py tests using the aliases. I would suggest
This multiline comment documents performance analysis timings. I would suggest applying the fix.
This is part of
These tests use the same copy+pasted string to test "non-ASCII characters", 3 from the Basic Latin block and 6 from the Latin-1 Supplement block (5 of which aren't flagged as confusable). Unless the project would like to begin testing characters from higher blocks, I would suggest dropping the character or adding |
Bokeh updated in bokeh/bokeh#13668 |
Thanks @covracer! |
I've seen errors crop up from using the different micro and mu characters. Follow matching recommendations on which character to prefer for micro, ohm, and angstrom. References: