-
Notifications
You must be signed in to change notification settings - Fork 7.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TW#12797] Hardware crypto (MPI/bignum) acceleration causes hangs in mbedtls_ssl_handshake() #630
Comments
on WiFiClientSecure of arduino core, this occurs more often if put a delay inside the loop: while ((ret = mbedtls_ssl_handshake(&ssl_client->ssl_ctx)) != 0) {
if (ret != MBEDTLS_ERR_SSL_WANT_READ && ret != MBEDTLS_ERR_SSL_WANT_WRITE) {
return handle_error(ret);
}
delay(10);
vPortYield();
} |
To repeat the suggestion I made last year on the linked esp-open-rtos thread, if you set a socket timeout via mbedtls_ssl_conf_read_timeout() then does the hang go away? Also, if you wait a long time (five minutes plus), does the handshake eventually return? (I've run the HTTPS example here for very long periods without anything going wrong, but I have fairly stable internet. It makes sense that a jammed or wedged socket may cause a long delay.) We can look into adding a timeout to IDF example and possibly setting one by default in Arduino, but it would be great to verify if this fixes the issue first (as I can't reproduce, at least not so far). |
I am experiencing the same issue. By setting the mbedtls debug treshold, during handshake, this is the error detected
|
@projectgus Even if timeout enabled it still hangs and WDT doesn't reset. Here is an Arduino sketch with an example: (Sorry I'm not much familiar with IDF yet) For me, after 100 ~ 400 iterations it hangs |
@projectgus is there any progress on this issue? Could I provide any more info to shed more light on this? I have basically tried everything I could think of while still running into the same problem. Right now I've done some dirty hotfix: before the handshake I start a timer that will eventually reset the MCU after 60 seconds or so. Upon successful handshake I stop the timer. I tried "softer" solutions like stopping/deleting mbedtls instance but this did not work (result in crash or have now effect at all). So I would be really grateful if we could find some fix for that to handle this condition more gracefully. |
Same issue here. Is this issue dead? Should a new one be opened?
Apparently, it's not quite that simple - With blocking I/O, this will only work if a non-NULL f_recv_timeout was set with mbedtls_ssl_set_bio(). So, to verify the time was actually being called, created a local copy called mbedtls_net_recv_timeout_te and added some debug statements. Then added the following just above the line:
Nope, let it run for over 2.5 hrs after it hung and it's never came back.
Hmm, not sure how long you've let it run. Running the "esp-idf/examples/protocols/https_request" example, the "https_get_task" hangs in as little as a 1 to 8 minutes, sometimes as long as 40 or 50. Don't think it's ever taken an hour. Though come to think of it, did decrease the vTaskDelay from 1000 to 100 to speed up the cycle time. This post describes more of the background. This pastebin is a portion of the monitor output from the latest run. Sections that were cutout to make it small enough to fit were replaced with ...\n...\n.... "Enable mbedTLS debugging" is turned on, plus additional messages added to try to track down where the hang's happening. Yea, its a mess, sorry. Seems like the hang occurs in the call to "verify_func" which is wrapped in pk_internal.h - it's being called in mbedtls_pk_verify in pk.c. |
Thanks for all these additional details. One more question: if you go into menuconfig and disable "Enable hardware MPI (bignum) acceleration" in the mbedTLS config, does anything change? Thanks, |
With that option disabled, the test has been running for 2 hours so far without hanging. |
Disabling "Enable hardware MPI (bignum) acceleration" seems to have addressed the issue!
One other question, and this may be nothing or my misunderstanding - but I'm wondering why mbedtls_net_recv_timeout_te is being called so frequently? Twice right after connecting to wifi, plus more... Maybe it's always supposed to be called every time data's received whether there's a timeout or not?!? To see what I'm referring to, first, note that the "I (XXXX) example: The Time is: 1970-01-01 00:00:XX" message are printed by app_main on CPU0 whereas any messages generated by "https_get_task" (including mbedtls) are on CPU1. Then look at the pastebin mentioned above, you'll see we get connected, the the timeout is called twice - with at most a few 100ms delay. In fact, you'll see that "connected." happens 12 times while "TIMED OUT ***" happens 232 times!
The code that sets it up looks like:
|
Fantastic, I'm glad to hear this. There is a related bug we are tracking with MPI, that may be related to this. Will update this issue once we have a fix. I'll also try again to reproduce, maybe there's been a regression that I somehow didn't have the last time I ran the HTTPS example for an extended period (or maybe howsmyssl.com server has changed TLS parameters in a way which exposes this).
Yes. mbedtls_net_recv_timeout should be read as "receive data into buffer, with timeout". It should return after any number of bytes are received, or once the timeout is reached with no bytes received. |
I'm facing the same issue, disabling hardware bignum acceleration didn't solve it. Using RSA 4096 keys EDIT: on my issue, there seems to be a problem with the wifi module instead of mbedtls, since it sometimes also gets stuck in socker.recv() if the esp32 is far from the wi-fi modem. |
my issue was fixed by using the esp-wroom-32 chip instead of the esp32 from sparkfun |
@projectgus disabling MPI seams to fix this issue for me as well. Also interestingly enough HTTPS connections, especially the whole handshake section seams to be twice as fast now. Shouldn't it be the other way round? @LupascuAndrei also interesting that you seamed to have some kind of hardware issue. I think it would be also interesting if someone could pin that down so we can avoid similar issues when developing hardware for the ESP32 chips. |
after further investigations, when "failing", the handshake just takes way too long to complete I reproduced the same issue with ICMP pings (pings take an average of 10ms, but sometimes working pings can take up to > 10s ) LATER EDIT: i couldn't reproduce this anymore in esp-idf 2.1 |
It seems like everyone who has commented on the thread worked around the issue either by updating or by disabling hardware bignum acceleration. For IDF V2.1 & in master we currently have hardware crypto acceleration disabled by default when in dual core mode. There are some issues we need to work around, they are not particularly complex to work around but we need to make sure the workarounds do not increase the performance overhead to the point where hardware crypto acceleration is no longer significantly faster than software. I'm going to keep this issue open and rename to reflect the hardware acceleration issue. If you have an issue with the SSL handshake on current IDF (v2.1-rc or newer, or current master) and still see it with hardware MPI/bignum acceleration disabled, please open a new issue to track that. |
If scheduler switches cores in narrow window during esp_dport_access_stall_other_cpu_start(), could cause the stall interrupt to occur on the running CPU - halting the CPU until WDT cleans up. Related to #630
Please retest with the current IDF master branch (since afb6119). If an issue still occurs with hardware encryption enabled, please reopen this issue or open a new one with as many details as possible. |
Observed behavior
After opening and closing many ssl sessions, mbedtls_ssl_handshake() eventually ends up beeing stuck at some point. This happens while using esp-idf example code for tls as well as with WiFiClientSecure in the esp32-arduino environment.
Expected behavior
If something goes wrong during mbedtls_ssl_handshake() we expect an error code to be returned in order to safely proceed without having a stuck task. At the very least it should timeout somehow.
Steps to reproduce
This is not easy to reproduce. In our case we have some https connection every minute or so and some high priority tasks that run on the same core (both pinned ad CPU1) . After a day or so we only see output from other tasks in the console and the https task is stuck during handshake:
this is from the arduino lib which looks like:
@copercini can confirm that it is indeed not the while loop which is stuck but rather the mbedtls_ssl_handshake() itself.
This seems to be a know issue on FreeRTOS even on the ESP8266, especially when there are tasks present with higher priority (see SuperHouse/esp-open-rtos#182).
The text was updated successfully, but these errors were encountered: