Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TW#12797] Hardware crypto (MPI/bignum) acceleration causes hangs in mbedtls_ssl_handshake() #630

Closed
Curclamas opened this issue May 21, 2017 · 16 comments

Comments

@Curclamas
Copy link
Contributor

Observed behavior

After opening and closing many ssl sessions, mbedtls_ssl_handshake() eventually ends up beeing stuck at some point. This happens while using esp-idf example code for tls as well as with WiFiClientSecure in the esp32-arduino environment.

Expected behavior

If something goes wrong during mbedtls_ssl_handshake() we expect an error code to be returned in order to safely proceed without having a stuck task. At the very least it should timeout somehow.

Steps to reproduce

This is not easy to reproduce. In our case we have some https connection every minute or so and some high priority tasks that run on the same core (both pinned ad CPU1) . After a day or so we only see output from other tasks in the console and the https task is stuck during handshake:

[I][ssl_client.cpp:45] start_ssl_client(): Free heap before TLS 208096
[I][ssl_client.cpp:47] start_ssl_client(): Starting socket
[I][ssl_client.cpp:75] start_ssl_client(): Seeding the random number generator
[I][ssl_client.cpp:84] start_ssl_client(): Setting up the SSL/TLS structure...
[I][ssl_client.cpp:97] start_ssl_client(): Loading CA cert
[I][ssl_client.cpp:153] start_ssl_client(): Performing the SSL/TLS handshake...

this is from the arduino lib which looks like:

    while ((ret = mbedtls_ssl_handshake(&ssl_client->ssl_ctx)) != 0) {
        if (ret != MBEDTLS_ERR_SSL_WANT_READ && ret != MBEDTLS_ERR_SSL_WANT_WRITE) {
            return handle_error(ret);
        }
    }

@copercini can confirm that it is indeed not the while loop which is stuck but rather the mbedtls_ssl_handshake() itself.
This seems to be a know issue on FreeRTOS even on the ESP8266, especially when there are tasks present with higher priority (see SuperHouse/esp-open-rtos#182).

@copercini
Copy link

copercini commented May 21, 2017

on WiFiClientSecure of arduino core, this occurs more often if put a delay inside the loop:

while ((ret = mbedtls_ssl_handshake(&ssl_client->ssl_ctx)) != 0) {
        if (ret != MBEDTLS_ERR_SSL_WANT_READ && ret != MBEDTLS_ERR_SSL_WANT_WRITE) {
            return handle_error(ret);
        }
        delay(10);
        vPortYield();
}

@projectgus
Copy link
Contributor

projectgus commented May 22, 2017

To repeat the suggestion I made last year on the linked esp-open-rtos thread, if you set a socket timeout via mbedtls_ssl_conf_read_timeout() then does the hang go away? Also, if you wait a long time (five minutes plus), does the handshake eventually return?

(I've run the HTTPS example here for very long periods without anything going wrong, but I have fairly stable internet. It makes sense that a jammed or wedged socket may cause a long delay.)

We can look into adding a timeout to IDF example and possibly setting one by default in Arduino, but it would be great to verify if this fixes the issue first (as I can't reproduce, at least not so far).

@FayeY FayeY changed the title Bug: mbedtls_ssl_handshake() hangs after several calls [TW#12797] Bug: mbedtls_ssl_handshake() hangs after several calls May 23, 2017
@LupascuAndrei
Copy link

I am experiencing the same issue. By setting the mbedtls debug treshold, during handshake, this is the error detected

ssl->f_recv(_timeout)() returned 5 (-0xfffffffb)
<= fetch input
dumping 'input record header' (5 bytes)
0000: 16 03 03 01 6d ....m
input record: msgtype = 22, version = [3:3], msglen = 365
=> fetch input
in_left: 5, nb_want: 370
in_left: 5, nb_want: 370
ssl->f_recv(_timeout)() returned 365 (-0xfffffe93)
<= fetch input

@copercini
Copy link

copercini commented May 30, 2017

@projectgus Even if timeout enabled it still hangs and WDT doesn't reset.

Here is an Arduino sketch with an example: (Sorry I'm not much familiar with IDF yet)
WiFiClientSecureBurnIn.txt

For me, after 100 ~ 400 iterations it hangs

@Curclamas
Copy link
Contributor Author

@projectgus is there any progress on this issue? Could I provide any more info to shed more light on this? I have basically tried everything I could think of while still running into the same problem. Right now I've done some dirty hotfix: before the handshake I start a timer that will eventually reset the MCU after 60 seconds or so. Upon successful handshake I stop the timer. I tried "softer" solutions like stopping/deleting mbedtls instance but this did not work (result in crash or have now effect at all).

So I would be really grateful if we could find some fix for that to handle this condition more gracefully.

@mgleason
Copy link

Same issue here. Is this issue dead? Should a new one be opened?

if you set a socket timeout via mbedtls_ssl_conf_read_timeout() then does the hang go away?

Apparently, it's not quite that simple - With blocking I/O, this will only work if a non-NULL f_recv_timeout was set with mbedtls_ssl_set_bio(). So, to verify the time was actually being called, created a local copy called mbedtls_net_recv_timeout_te and added some debug statements. Then added the following just above the line: ESP_LOGI(TAG, "Performing the SSL/TLS handshake...");

mbedtls_ssl_set_bio(&ssl, &server_fd, mbedtls_net_send, mbedtls_net_recv, mbedtls_net_recv_timeout_te); mbedtls_ssl_conf_read_timeout(&conf, 10000);

Also, if you wait a long time (five minutes plus), does the handshake eventually return?

Nope, let it run for over 2.5 hrs after it hung and it's never came back.

(I've run the HTTPS example here for very long periods without anything going wrong, but I have fairly stable internet. It makes sense that a jammed or wedged socket may cause a long delay.)

Hmm, not sure how long you've let it run. Running the "esp-idf/examples/protocols/https_request" example, the "https_get_task" hangs in as little as a 1 to 8 minutes, sometimes as long as 40 or 50. Don't think it's ever taken an hour. Though come to think of it, did decrease the vTaskDelay from 1000 to 100 to speed up the cycle time. This post describes more of the background.

This pastebin is a portion of the monitor output from the latest run. Sections that were cutout to make it small enough to fit were replaced with ...\n...\n.... "Enable mbedTLS debugging" is turned on, plus additional messages added to try to track down where the hang's happening. Yea, its a mess, sorry.

Seems like the hang occurs in the call to "verify_func" which is wrapped in pk_internal.h - it's being called in mbedtls_pk_verify in pk.c.

@projectgus
Copy link
Contributor

Thanks for all these additional details.

One more question: if you go into menuconfig and disable "Enable hardware MPI (bignum) acceleration" in the mbedTLS config, does anything change?

Thanks,

@mgleason
Copy link

With that option disabled, the test has been running for 2 hours so far without hanging.

@mgleason
Copy link

mgleason commented Jun 14, 2017

Disabling "Enable hardware MPI (bignum) acceleration" seems to have addressed the issue!

  • Stopped monitoring after it ran for 13Hrs 26Min straight - without hanging.

One other question, and this may be nothing or my misunderstanding - but I'm wondering why mbedtls_net_recv_timeout_te is being called so frequently? Twice right after connecting to wifi, plus more... Maybe it's always supposed to be called every time data's received whether there's a timeout or not?!?

To see what I'm referring to, first, note that the "I (XXXX) example: The Time is: 1970-01-01 00:00:XX" message are printed by app_main on CPU0 whereas any messages generated by "https_get_task" (including mbedtls) are on CPU1. Then look at the pastebin mentioned above, you'll see we get connected, the the timeout is called twice - with at most a few 100ms delay. In fact, you'll see that "connected." happens 12 times while "TIMED OUT ***" happens 232 times!

  1. I (2776) example: The Time is: 1970-01-01 00:00:03
  2. I (2826) example: Connected.
    ...
  3. I (2876) example: The Time is: 1970-01-01 00:00:03
    ...
  4. I (2956) example: *** TIMED OUT *** - called mbedtls_net_recv_timeout
  5. I (2976) example: The Time is: 1970-01-01 00:00:03
    ...
  6. I (3006) example: *** TIMED OUT *** - called mbedtls_net_recv_timeout

The code that sets it up looks like:

ESP_LOGI(TAG, "Connected.");
mbedtls_ssl_set_bio(&ssl, &server_fd, mbedtls_net_send, mbedtls_net_recv, mbedtls_net_recv_timeout_te); 
mbedtls_ssl_conf_read_timeout(&conf, 10000);
ESP_LOGI(TAG, "Performing the SSL/TLS handshake...");
...

int mbedtls_net_recv_timeout_te( void *ctx, unsigned char *buf, size_t len,
                              uint32_t timeout )
{
    int ret;
    struct timeval tv;
    fd_set read_fds;
    int fd = ((mbedtls_net_context *) ctx)->fd;

    ESP_LOGI(TAG, "*** TIMED OUT *** - called mbedtls_net_recv_timeout");
...

@projectgus
Copy link
Contributor

projectgus commented Jun 15, 2017

Disabling "Enable hardware MPI (bignum) acceleration" seems to have addressed the issue!

Fantastic, I'm glad to hear this. There is a related bug we are tracking with MPI, that may be related to this. Will update this issue once we have a fix.

I'll also try again to reproduce, maybe there's been a regression that I somehow didn't have the last time I ran the HTTPS example for an extended period (or maybe howsmyssl.com server has changed TLS parameters in a way which exposes this).

Maybe it's always supposed to be called every time data's received whether there's a timeout or not?!?

Yes. mbedtls_net_recv_timeout should be read as "receive data into buffer, with timeout". It should return after any number of bytes are received, or once the timeout is reached with no bytes received.

@LupascuAndrei
Copy link

LupascuAndrei commented Jun 15, 2017

I'm facing the same issue, disabling hardware bignum acceleration didn't solve it.
I can also reproduce it by making only one handshake per run ( handshake -> restart -> handshake ->restart ... )

Using RSA 4096 keys

EDIT: on my issue, there seems to be a problem with the wifi module instead of mbedtls, since it sometimes also gets stuck in socker.recv() if the esp32 is far from the wi-fi modem.

@LupascuAndrei
Copy link

LupascuAndrei commented Jun 21, 2017

my issue was fixed by using the esp-wroom-32 chip instead of the esp32 from sparkfun
edit : it only reduced the rate that the ssl handshake fails

@Curclamas
Copy link
Contributor Author

@projectgus disabling MPI seams to fix this issue for me as well. Also interestingly enough HTTPS connections, especially the whole handshake section seams to be twice as fast now. Shouldn't it be the other way round?
Do you have any possible explanation? Is there a fix in the work regarding the MPI-Issue and what is the actual cause? Do you need benchmarks for the "MPI acutally slower"-thing? I mean this leaves room for some hope that it could actually be even faster with the MPI-Bugs fixed.

@LupascuAndrei also interesting that you seamed to have some kind of hardware issue. I think it would be also interesting if someone could pin that down so we can avoid similar issues when developing hardware for the ESP32 chips.

@LupascuAndrei
Copy link

LupascuAndrei commented Jun 22, 2017

after further investigations, when "failing", the handshake just takes way too long to complete
if i increase the ssl handshake timeout limit on the server i'm trying to connect, mbedtls never fails to connect anymore
the average, working handshake takes ~ 2s for 2048 RSA keys
but a handshake can take up to ~90s if i remove the handshake timeout limit.

I reproduced the same issue with ICMP pings (pings take an average of 10ms, but sometimes working pings can take up to > 10s )

LATER EDIT: i couldn't reproduce this anymore in esp-idf 2.1

@projectgus
Copy link
Contributor

It seems like everyone who has commented on the thread worked around the issue either by updating or by disabling hardware bignum acceleration.

For IDF V2.1 & in master we currently have hardware crypto acceleration disabled by default when in dual core mode. There are some issues we need to work around, they are not particularly complex to work around but we need to make sure the workarounds do not increase the performance overhead to the point where hardware crypto acceleration is no longer significantly faster than software.

I'm going to keep this issue open and rename to reflect the hardware acceleration issue.

If you have an issue with the SSL handshake on current IDF (v2.1-rc or newer, or current master) and still see it with hardware MPI/bignum acceleration disabled, please open a new issue to track that.

@projectgus projectgus changed the title [TW#12797] Bug: mbedtls_ssl_handshake() hangs after several calls [TW#12797] Hardware crypto (MPI/bignum) acceleration causes hangs in mbedtls_ssl_handshake() Jul 20, 2017
igrr pushed a commit that referenced this issue Sep 1, 2017
If scheduler switches cores in narrow window during esp_dport_access_stall_other_cpu_start(), could cause the stall
interrupt to occur on the running CPU - halting the CPU until WDT cleans up.

Related to #630
@projectgus
Copy link
Contributor

Please retest with the current IDF master branch (since afb6119). If an issue still occurs with hardware encryption enabled, please reopen this issue or open a new one with as many details as possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants