-
Notifications
You must be signed in to change notification settings - Fork 420
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't zero memory when we don't have to. #578
Conversation
Current coverage is 95.65% (diff: 100%)@@ master #578 diff @@
==========================================
Files 16 16
Lines 5614 5615 +1
Methods 0 0
Messages 0 0
Branches 403 403
==========================================
+ Hits 5370 5371 +1
Misses 167 167
Partials 77 77
|
Can you show a simple benchmark and numbers? |
@alex Would you like the allocator tested directly or do you want the full PyOpenSSL stack tested? |
Full stack please.
…On Nov 27, 2016 5:00 PM, "Cory Benfield" ***@***.***> wrote:
@alex </~https://github.com/alex> Would you like the allocator tested
directly or do you want the full PyOpenSSL stack tested?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#578 (comment)>, or mute
the thread
</~https://github.com/notifications/unsubscribe-auth/AAADBBdDFzUZpyMmdZlJNAWiDqwdB93Eks5rCf2WgaJpZM4K9Q0y>
.
|
I'd like to know what other places it makes sense to use the dirty allocator, but I'm also fine with those being a separate PR. Look forward to seeing the benchmark :) |
Ok, so here's a quick and dirty benchmark. It uses 1GB as the allocation size because I like showing off how sizeable this can be, but we can discuss whether we need a more representative one. The benchmark is below: import socket
from OpenSSL import SSL, crypto
CERT = crypto.load_certificate(
crypto.FILETYPE_PEM,
"""-----BEGIN CERTIFICATE-----
MIIDUjCCAjoCCQCQmNzzpQTCijANBgkqhkiG9w0BAQUFADBrMQswCQYDVQQGEwJH
QjEPMA0GA1UECBMGTG9uZG9uMQ8wDQYDVQQHEwZMb25kb24xETAPBgNVBAoTCGh5
cGVyLWgyMREwDwYDVQQLEwhoeXBleS1oMjEUMBIGA1UEAxMLZXhhbXBsZS5jb20w
HhcNMTUwOTE2MjAyOTA0WhcNMTYwOTE1MjAyOTA0WjBrMQswCQYDVQQGEwJHQjEP
MA0GA1UECBMGTG9uZG9uMQ8wDQYDVQQHEwZMb25kb24xETAPBgNVBAoTCGh5cGVy
LWgyMREwDwYDVQQLEwhoeXBleS1oMjEUMBIGA1UEAxMLZXhhbXBsZS5jb20wggEi
MA0GCSqGSIb3DQEBAQUAA4IBDwAwggEKAoIBAQC74ZeB4Jdb5cnC9KXXLJuzjwTg
45q5EeShDYQe0TbKgreiUP6clU3BR0fFAVedN1q/LOuQ1HhvrDk1l4TfGF2bpCIq
K+U9CnzcQknvdpyyVeOLtSsCjOPk4xydHwkQxwJvHVdtJx4CzDDqGbHNHCF/9gpQ
lsa3JZW+tIZLK0XMEPFQ4XFXgegxTStO7kBBPaVIgG9Ooqc2MG4rjMNUpxa28WF1
SyqWTICf2N8T/C+fPzbQLKCWrFrKUP7WQlOaqPNQL9bCDhSTPRTwQOc2/MzVZ9gT
Xr0Z+JMTXwkSMKO52adE1pmKt00jJ1ecZBiJFyjx0X6hH+/59dLbG/7No+PzAgMB
AAEwDQYJKoZIhvcNAQEFBQADggEBAG3UhOCa0EemL2iY+C+PR6CwEHQ+n7vkBzNz
gKOG+Q39spyzqU1qJAzBxLTE81bIQbDg0R8kcLWHVH2y4zViRxZ0jHUFKMgjONW+
Aj4evic/2Y/LxpLxCajECq/jeMHYrmQONszf9pbc0+exrQpgnwd8asfsM3d/FJS2
5DIWryCKs/61m9vYL8icWx/9cnfPkBoNv1ER+V1L1TH3ARvABh406SBaeqLTm/kG
MNuKytKWJsQbNlxzWHVgkKzVsBKvYj0uIEJpClIhbe6XNYRDy8T8mKXVWhJuxH4p
/agmCG3nxO8aCrUK/EVmbWmVIfCH3t7jlwMX1nJ8MsRE7Ydnk8I=
-----END CERTIFICATE-----"""
)
PKEY = crypto.load_privatekey(
crypto.FILETYPE_PEM,
"""-----BEGIN RSA PRIVATE KEY-----
MIIEpQIBAAKCAQEAu+GXgeCXW+XJwvSl1yybs48E4OOauRHkoQ2EHtE2yoK3olD+
nJVNwUdHxQFXnTdavyzrkNR4b6w5NZeE3xhdm6QiKivlPQp83EJJ73acslXji7Ur
Aozj5OMcnR8JEMcCbx1XbSceAsww6hmxzRwhf/YKUJbGtyWVvrSGSytFzBDxUOFx
V4HoMU0rTu5AQT2lSIBvTqKnNjBuK4zDVKcWtvFhdUsqlkyAn9jfE/wvnz820Cyg
lqxaylD+1kJTmqjzUC/Wwg4Ukz0U8EDnNvzM1WfYE169GfiTE18JEjCjudmnRNaZ
irdNIydXnGQYiRco8dF+oR/v+fXS2xv+zaPj8wIDAQABAoIBAQCsdq278+0c13d4
tViSh4k5r1w8D9IUdp9XU2/nVgckqA9nOVAvbkJc3FC+P7gsQgbUHKj0XoVbhU1S
q461t8kduPH/oiGhAcKR8WurHEdE0OC6ewhLJAeCMRQwCrAorXXHh7icIt9ClCuG
iSWUcXEy5Cidx3oL3r1xvIbV85fzdDtE9RC1I/kMjAy63S47YGiqh5vYmJkCa8rG
Dsd1sEMDPr63XJpqJj3uHRcPvySgXTa+ssTmUH8WJlPTjvDB5hnPz+lkk2JKVPNu
8adzftZ6hSun+tsc4ZJp8XhGu/m/7MjxWh8MeupLHlXcOEsnj4uHQQsOM3zHojr3
aDCZiC1pAoGBAOAhwe1ujoS2VJ5RXJ9KMs7eBER/02MDgWZjo54Jv/jFxPWGslKk
QQceuTe+PruRm41nzvk3q4iZXt8pG0bvpgigN2epcVx/O2ouRsUWWBT0JrVlEzha
TIvWjtZ5tSQExXgHL3VlM9+ka40l+NldLSPn25+prizaqhalWuvTpP23AoGBANaY
VhEI6yhp0BBUSATEv9lRgkwx3EbcnXNXPQjDMOthsyfq7FxbdOBEK1rwSDyuE6Ij
zQGcTOfdiur5Ttg0OQilTJIXJAlpoeecOQ9yGma08c5FMXVJJvcZUuWRZWg1ocQj
/hx0WVE9NwOoKwTBERv8HX7vJOFRZyvgkJwFxoulAoGAe4m/1XoZrga9z2GzNs10
AdgX7BW00x+MhH4pIiPnn1yK+nYa9jg4647Asnv3IfXZEnEEgRNxReKbi0+iDFBt
aNW+lDGuHTi37AfD1EBDnpEQgO1MUcRb6rwBkTAWatsCaO00+HUmyX9cFLm4Vz7n
caILyQ6CxZBlLgRIgDHxADMCgYEAtubsJGTHmZBmSCStpXLUWbOBLNQqfTM398DZ
QoirP1PsUQ+IGUfSG/u+QCogR6fPEBkXeFHxsoY/Cvsm2lvYaKgK1VFn46Xm2vNq
JuIH4pZCqp6LAv4weddZslT0a5eaowRSZ4o7PmTAaRuCXvD3VjTSJwhJFMo+90TV
vEWn7gkCgYEAkk+unX9kYmKoUdLh22/tzQekBa8WqMxXDwzBCECTAs2GlpL/f73i
zD15TnaNfLP6Q5RNb0N9tb0Gz1wSkwI1+jGAQLnh2K9X9cIVIqJn8Mf/KQa/wUDV
Tb1j7FoGUEgX7vbsyWuTd8P76kNYyGqCss1XmbttcSolqpbIdlSUcO0=
-----END RSA PRIVATE KEY-----"""
)
GIGABYTE = 2 ** 30
def handshake(client, server):
conns = [client, server]
while conns:
for conn in conns:
try:
conn.do_handshake()
except SSL.WantReadError:
pass
else:
conns.remove(conn)
server_ctx = SSL.Context(SSL.SSLv23_METHOD)
server_ctx.use_privatekey(PKEY)
server_ctx.use_certificate(CERT)
client_ctx = SSL.Context(SSL.SSLv23_METHOD)
server_sock, client_sock = socket.socketpair()
server_sock.setblocking(False)
client_sock.setblocking(False)
server_conn = SSL.Connection(server_ctx, server_sock)
client_conn = SSL.Connection(client_ctx, client_sock)
server_conn.set_accept_state()
client_conn.set_connect_state()
handshake(client_conn, server_conn)
server_conn.setblocking(True)
client_conn.setblocking(True)
for _ in range(100):
client_conn.sendall(b'\x00')
server_conn.recv(GIGABYTE) The results from the current PyPI release of PyOpenSSL:
The results from this branch:
That's...quite a difference. |
With more reasonable Master:
This branch:
The difference in this case is much more subdued, which isn't an enormous surprise. With much longer cycle counts (e.g. gigabytes of data transferred) we'd probably see bigger effects though. |
Well, it’s still a solid 4% performance increase. |
It's really the best kind of performance increase because it's simply across-the-board. Even if your allocations are small, if you do enough of them this begins to matter. |
Good work, @Lukasa ! I have a feeling that the speedup is even greater on 32bit systems, because it takes twice the amount of instructions to zero the same amount of memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests pass, it uses clear and standard CFFI functions, tests pass. But, uh, https://travis-ci.org/pyca/pyopenssl/jobs/179282980 is maybe a thing.
Oh, uh, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don’t merge until the Twisted test suite is actually ran. That’s a nice stress test for the changes too.
See #579 for a PR that re-adds Twisted testing. I'll do a catch-up merge once we've dealt with that. |
Ok, I've added a changelog entry and merged in the current master. |
Ok, tested against Twisted now! |
Shouldn’t we mention the effect of the changes? I wouldn’t use concrete numbers, but since people have been complaining about perf/mem sind 0.14, it feels like we should push a bit more for it. |
5d64be3
to
6f57c04
Compare
6f57c04
to
ee0e999
Compare
How do you like the new message @hynek? |
Content: A+ I will require a signature by legal guardian on this report Cory! (semantic new lines pls :)) |
ACK |
Let’s welcome more vroom vroom to our lives! |
Wooo! 🚀 |
Before you do this, you should probably see if the cffi folks fix the underlying bug and make it unnecessary :-): https://bitbucket.org/cffi/cffi/issues/295/cffinew-is-way-slower-than-it-should-be-it |
Huh, once again @njsmith schools me. I had no idea calloc did that. |
So I had a fuzzy recollection that this was a thing, but presumed that the information I found while investigating this issue meant it wasn't. So I thought to myself: why did I think that? What a weird thing to think? Presumably I investigated this? And, as it turns out, I did. Consider the following C program (this may be the simplest C repro I've ever written) #include <stdlib.h>
#define ALLOCATION_SIZE (1 * 1024 * 1024)
int main (int argc, char *argv[]) {
for (int i = 0; i < 10000; i++) {
void *temp = calloc(ALLOCATION_SIZE, 1);
free(temp);
}
return 0;
} This code does 100,000 allocations of And it doesn't perform well on my Mac. At all. The following are the results of timing the program execution using
Platform is macOS Sierra version 10.12.2 Beta (16C53a). This suggests that at the very least there are some platforms without an optimised calloc path. This is also why I discarded any notion of investigating allocation strategies: on my local machine calloc doesn't appear to be in any sense optimised. So unless I'm continuing to misunderstand something I think that regardless of what cffi does with their internal implementation, we need to continue to use this fix because at least some platforms do not have a calloc fast-path. |
Further investigation shows that the process fault count doesn't increase for most of its runtime. So the kernel isn't needing to fault pages in. I can't fully explain that at this time. More debugging is needed there. |
Ah, and that's right, we got here from the fact that @dreid showed me a sample from the CFFI program that looked very similar to this sample I just took of my C repro program:
Note the presence of Exactly why this doesn't need to fault pages in is somewhat unclear, but it's possible that a free list is being maintained. I'm downloading the library source as we speak. |
The other odd thing about those numbers is that your benchmark appears to be doing worse than O(n). I can also confirm that on "Darwin zibi.bids.berkeley.edu 15.5.0 Darwin Kernel Version 15.5.0: Tue Apr 19 18:36:36 PDT 2016; root:xnu-3248.50.21~8/RELEASE_X86_64 x86_64", a.k.a. MacOS 10.11.5, allocating a giant all-zeros matrix using numpy (which uses calloc) happens instantly and doesn't affect the process's RSS... |
@njsmith How giant is giant? Becuase... So
The default zone is what the library calls a "scalable zone", which uses
We then switch on size. macOS appears to have three different allocation strategies: tiny (<1024 bytes on 64-bit systems), small (<127kB on machines with more than 1GB RAM, yes these are really computed using these two totally different metrics), and large. We're clearly well into the realm of large, so we now call This is where it all starts to kick off. This is quite a function. However, the long and short of it is that there appears to be a "free cache" of large memory pages. This is conditionally In this case, if an entry can be found in the "free cache", we do a whole bunch of bookkeeping but then ask the following question: if (was_madvised_reusable && -1 == madvise(addr, size, MADV_FREE_REUSE)) If that boolean comes out False, then we'll drop into a call to This appears to be the source of the problem. I was able to verify that if I attempted the next increment up from my highest increment (1GB allocation), then I circumvented the cache and the program executed nearly instantly. The TL;DR here seems to be that there is a region of I have to admit, I think this behaviour is a nasty edge case in macOS's allocator, but nonetheless it's a real one and I have no reason to suspect that it's a new problem. TL;DR: I think we still need to do this. |
I should note it looks like |
Ah, I see, yeah, I was using a 2 GiB test array. I can reproduce your example timings on my 10.11.5 test box. Wow, that's pretty terrible. |
Yup. So while I agree that CFFI should endeavour to use calloc, we have at least one platform where calloc behaves pretty nastily. In practice, I think macOS has room to be smarter here. As you've pointed out, it's actually the memory management subsystem that is forcing these pages into memory, which seems ultimately like a bug: there's no reason macOS shouldn't be able to spot that these pages are unused and so to avoid memsetting them. I think this may require that I file a radar. Joy. |
Alternatively, PyOpenSSL should round all allocations between 1 and 200 MB up to 1GB, just to be safe. ;) |
Yeah, I was just going to suggest that it might call for filing a radar... good luck with that :-). Sometimes it works... |
I guess I'm morbidly curious now what happens on windows, though I don't have a windows dev system handy. |
Leaves two questions: how does this behave on Linux and are we reverting in the hopes, that someone else fixes their bugs? ISTM that the changes we did here are still “the right thing to do” since we don’t multiply & don’t need clearing in any case? |
I think we still need these changes. |
I have filed this as rdar://29508271 |
@hynek: Linux's |
Upstream changes: * Added OpenSSL.X509Store.set_time() to set a custom verification time when verifying certificate chains. pyca/pyopenssl#567 * Added a collection of functions for working with OCSP stapling. None of these functions make it possible to validate OCSP assertions, only to staple them into the handshake and to retrieve the stapled assertion if provided. Users will need to write their own code to handle OCSP assertions. We specifically added: Context.set_ocsp_server_callback, Context.set_ocsp_client_callback, and Connection.request_ocsp. pyca/pyopenssl#580 * Changed the SSL module's memory allocation policy to avoid zeroing memory it allocates when unnecessary. This reduces CPU usage and memory allocation time by an amount proportional to the size of the allocation. For applications that process a lot of TLS data or that use very lage allocations this can provide considerable performance improvements. pyca/pyopenssl#578 * Automatically set SSL_CTX_set_ecdh_auto() on OpenSSL.SSL.Context. pyca/pyopenssl#575 * Fix empty exceptions from OpenSSL.crypto.load_privatekey(). pyca/pyopenssl#581 The full upstream changelog can be found at: https://pyopenssl.readthedocs.io/en/17.0.0/changelog.html I've also added a patch from pyca/pyopenssl#637 in order to fix the tests, which was the main reason for the version bump because that patch won't apply for 16.2.0. According to the upstream changelog there should be no backwards-incompatible changes, but I've tested building against some of the packages depending on pyopenssl anyway. Regardless of this, the build for pyopenssl fails right now anyway, so the worst that could happen via this commit would be that we break something that's already broken. Signed-off-by: aszlig <aszlig@redmoonstudios.org>
Mostly interresting for : * pyopenssl : pyca/pyopenssl#578 * certifi
Resolves #577.
This provides and stores an allocator that is exactly the same as the standard CFFI allocator except for the fact that it skips zeroing the memory before use.
This PR goes through the
SSL.py
module and changes all allocations of buffers that OpenSSL writes data into to use the new allocator. Given that OpenSSL always tells us how many bytes it has written, this change saves us substantial overhead zeroing memory we don't need to. This change can often save hundreds of milliseconds per call, which is quite an improvement!There are probably some other places we could use the new allocator. Should I go through all of PyOpenSSL to check every single allocation to see if we can use the new allocator? Do we need some kind of testing? If so...uh...what kind?