-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New fmt::Arguments representation. #115129
Conversation
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
bdcae60
to
f8c2d63
Compare
This comment has been minimized.
This comment has been minimized.
tests/mir-opt/inline/inline_into_box_place.main.Inline.panic-abort.diff
Outdated
Show resolved
Hide resolved
@bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
⌛ Trying commit 6091d4dbd9e0cae96c550bd08ed860d3c76df120 with merge 0650fbf1bf480ad163fba30602ccdf6013aacd39... |
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
tests/mir-opt/inline/inline_into_box_place.main.Inline.panic-abort.diff
Outdated
Show resolved
Hide resolved
tests/mir-opt/pre-codegen/slice_iter.enumerated_loop.PreCodegen.after.panic-abort.mir
Outdated
Show resolved
Hide resolved
tests/mir-opt/pre-codegen/slice_iter.forward_loop.PreCodegen.after.panic-unwind.mir
Outdated
Show resolved
Hide resolved
Finished benchmarking commit (0650fbf1bf480ad163fba30602ccdf6013aacd39): comparison URL. Overall result: ❌✅ regressions and improvements - ACTION NEEDEDBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @bors rollup=never Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Binary sizeResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Bootstrap: 635.819s -> 633.195s (-0.41%) |
That looks pretty good, but I'm a bit surprised to see a few things like |
Let's see what happens if I add a few @bors try @rust-timer queue |
⌛ Trying commit d1922d5ad206618355b837e688e857dd38e5dffc with merge 0208b67f46c3dc86f3df10c3e01fb70047717ff6... |
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
Finished benchmarking commit (3373eb1ec6feda72af1b09f19aba78a9e3a88f39): comparison URL. Overall result: ❌✅ regressions and improvements - ACTION NEEDEDBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @bors rollup=never Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Binary sizeResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Bootstrap: 630.718s -> 629.174s (-0.24%) |
Well that's quite a mixed bag of perf results. (: Also a few runtime benchmarks degraded quite a bit. If anyone feels like helping out: I could use some help figuring out what is causing these results. ^^ |
GitHub is now somehow super eager to delete old try commits. I'll rerun perf so that I can get a baseline commit SHA to compare against. @bory try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
@bors try |
New fmt::Arguments representation. This changes the internal representation of `fmt::Arguments` to reduce it in size and make it possible to convert a `&'static str` to a `fmt::Arguments` without additional indirection. Part of rust-lang#99012 fmt::Arguments stores three slices: 1. The slice of string literals between the placeholders. 2. The details of the placeholders (format options, which argument to display there, etc.). 3. The arguments. 2 is omitted when it's just all arguments once in order without additional options. Before this change, fmt::Arguments stores each of these as a &[] (or Option of that), resulting in a total size of six words/pointers. However, we don't need to store the length of each of these slices separately, as we may (unsafely) assume that the indexes of the fmt::Arguments generated by format_args!() are never out of bounds, and the number of placeholders and number of strings between them must be nearly equal. This PR changes the struct to store three pointers without length, and a single 'number of parts' counter that is the sum of the number of placeholders and string pieces. Additionally, this PR adds a special case for 1 part (that is, one string piece and no placeholders) to store the string pointer directly instead of through a slice of length 1. This makes it possible for `fmt::Arguments::new_str(&'static str)` to exist, which unlike before (`new_const(&[&'static str])`) doesn't need to borrow anything other than the string itself.
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
Finished benchmarking commit (cf92f86): comparison URL. Overall result: ❌✅ regressions and improvements - ACTION NEEDEDBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @bors rollup=never Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Binary sizeResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Bootstrap: 633.484s -> 629.614s (-0.61%) |
I analyzed the # baseline commit
$ rustup-toolchain-install-master e7f9f48d767c1e17e125d83e3102d568f87b868d
# PR commit
$ rustup-toolchain-install-master cf92f86bcc15ebf3a31a9467a890c989b7665512
# analyse
$ RUST_LOG=debug cargo run --bin collector codegen_diff [asm/llvm/mir] fmt `rustup +e7f9f48d767c1e17e125d83e3102d568f87b868d which rustc` `rustup +cf92f86bcc15ebf3a31a9467a890c989b7665512 which rustc` > diff.txt I haven't really found any interesting differences between ASM and LLVM. The only thing I noticed is that MIR got bigger for one function, probably because of different MIR inlining (?). MIR diff
|
/// | ||
/// The placaeholders pointer can be null for default placeholders: | ||
/// a placeholder for each argument once, in order, with default formatting options. | ||
strings_and_placeholders: (*const &'static str, *const rt::Placeholder), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use an enum? A variant for the num_parts == 1
case, and the other for the general case.
That's how the type is used AFAIU.
pub fn new_v1(pieces: &'a [&'static str], args: &'a [rt::Argument<'a>]) -> Arguments<'a> { | ||
if pieces.len() < args.len() || pieces.len() > args.len() + 1 { | ||
panic!("invalid args"); | ||
pub fn new_v1(strings: &'a [&'static str], args: &'a [rt::Argument<'a>]) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the posted MIR snippet, would it be worth optimizing new_v1
to be easier to optimize? For instance by checking more invariants in rustc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by checking more invariants in rustc
@cjgillot What do you mean?
Edit: bring up an optimization opportunity. Have you considered doing a hybrid of structured arrays and bytecode, like this? You can consider this as an alternative to this PR as it's mutually exclusive with some of the proposed functions here.
Multiple things are being suggested here:
Emit's considerably more complicated, but it not only trims the runtime size to 2 pointers + arguments stack array for arguments, but also shaves the runtime array sizes so much, it's about the size of a C format string (but much simpler to process). Here's how it compares with purely static strings:
This would help in three areas:
As for why I'm not also inlining function references, those generally lack stable runtime positions anyways, and so they can't be fully static. One could lazily resolve it, but it's not worth it for such a small number of local functions. Simple And yes, this function could, in theory, be made polymorphic over UTF-8 strings and byte strings with few changes. Only real difference is character filling. Data can be made all 32-bit and unconditionally present if the above variable-length encoding isn't okay.
This increases static size overhead to the following:
And of course, there is a middle ground, where whole fields are only conditionally present, but lengths are always 32-bit:
This results in the following static sizes:
|
@dead-claudia I think most of the brainstorming is at #99012, this PR is for a specific experiment. That being said, the ideas sound great! One thing is that I don't think we want to use invalid UTF-8 characters to store padding information, see #99012 (comment). Also it seems like it may be a good idea to reserve some bits for future fill options I don't think there would be any opposition to storing indices as u8, the only downside being the unaligned reads that result. Are you up for trying this? I'd be interested to see an experiment PR similar to this one that tests out your ideas to see how it stacks up for performance
For anyone's reference, here an older C printf implementation ( |
Yeah, didn't know about that issue. Thanks!
To be clear, that character + flags field stores a literal Unicode scalar code point, not a UTF-8 character. It can easily accommodate a byte, and technically speaking, that field for bytes could be shrunk down to 8 bits for the fill character, 8 bits for the flags as it stands currently.
There's a whole 5 more that could feasibly fit in that character + flags bit set (4 if you want a "continuation" bit). There's room to grow. 🙂
True, but how often is this really being called? Also, while unaligned reads are slower, this one's already very likely in cache, and so in practice there's likely at most 1 memory bus fetch anyways for most platforms. This is also why this person found little difference in practice - most their accesses were already likely pre-fetched into cache. With architectures like 32-bit ARM that at best only half-support unaligned access, the relative performance might be close enough to not matter, considering the frequent dynamic function calls and the like. Also, with the compact encoding, most small accesses (where unaligned accesses would have the most impact) would do 8-bit loads anyways, so it'd already be 1-byte aligned in the common case. And this can be bypassed if no fill is needed. Also, LLVM only generates two loads for misaligned loads, so the impact should at worst be 2-3x worse even on those. I do need to inspect their code gen, though. Worst case scenario (if it causes perf issues), this structure's easy to tweak to ensure proper padding - just pad the string lengths to compensate. For one, the existing bits should probably be moved even now so it all exists on the same byte, but I obviously didn't think that far when laying that out initially. 🙃
I'd love to, but it'll be a while (few months minimum) before I'd have any real time to play around and try to implement this. |
Oh, right, good points. I just read
Oh, I don't have any perf concerns, the APIs for unaligned things are just messier and can overall be a bit more eh (e.g.
Too bad :) maybe somebody will be interested in picking it up as an experiment in the meantime |
@dead-claudia you refer to "start data" a few times. What exactly is this? I assume it stores the argument information, but it's not clear to me how it gets processed (perhaps an array of indices that use dynamic data?) |
Sorry, should've explained it then. It's the byte contents of the first string segment. For instance, |
Closing this as inactive. Feel free to reöpen this pr or create a new pr if you get the time to work on this. Thanks |
This changes the internal representation of
fmt::Arguments
to reduce it in size and make it possible to convert a&'static str
to afmt::Arguments
without additional indirection.Part of #99012
fmt::Arguments stores three slices:
2 is omitted when it's just all arguments once in order without additional options.
Before this change, fmt::Arguments stores each of these as a &[] (or Option of that), resulting in a total size of six words/pointers.
However, we don't need to store the length of each of these slices separately, as we may (unsafely) assume that the indexes of the fmt::Arguments generated by format_args!() are never out of bounds, and the number of placeholders and number of strings between them must be nearly equal.
This PR changes the struct to store three pointers without length, and a single 'number of parts' counter that is the sum of the number of placeholders and string pieces.
Additionally, this PR adds a special case for 1 part (that is, one string piece and no placeholders) to store the string pointer directly instead of through a slice of length 1.
This makes it possible for
fmt::Arguments::new_str(&'static str)
to exist, which unlike before (new_const(&[&'static str])
) doesn't need to borrow anything other than the string itself.