Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: fix flaky test-http-chunked-304 on SmartOS #3903

Conversation

claudiorodriguez
Copy link
Contributor

SmartOS has an issue where it will trigger ECONNREFUSED when it
should not. See https://smartos.org/bugview/OS-2767.

This change adds logic to test-http-chunked-304 to work around
the issue. See also similar issue: #2663

Fixes: #3864

@mscdex mscdex added http Issues or PRs related to the http subsystem. test Issues and PRs related to the tests. smartos Issues and PRs related to the SmartOS platform. labels Nov 18, 2015
@claudiorodriguez claudiorodriguez force-pushed the test-flaky-http-chunked-smartos branch from e7efad6 to ade9bfb Compare November 18, 2015 20:39
@Trott
Copy link
Member

Trott commented Nov 18, 2015

If CI is happy, this LGTM. EDIT: Stress test CI is not happy.

/cc @indutny since he wrote the test and it has had no substantial modification by anyone else until now.

CI stress test with this change:

CI stress test without this change (should show failures):

  • Not running yet, but I'll edit this when it is running. Will probably be https://ci.nodejs.org/job/node-stress-single-test/22/nodes=smartos14-32/console but right now that just redirects to the above

CI for this PR:

@indutny
Copy link
Member

indutny commented Nov 18, 2015

What exactly happens on SmartOS? How could we get ECONNREFUSED after starting the server?

@Trott
Copy link
Member

Trott commented Nov 18, 2015

@indutny: Does the explanation at the bottom of https://smartos.org/bugview/OS-2767 offer a plausible mechanism for spurious ECONNREFUSED in this situation?

@Trott
Copy link
Member

Trott commented Nov 18, 2015

Bad news anyway: The stress test still gets failures with this fix in place.

events.js:141
      throw er; // Unhandled 'error' event
      ^

Error: connect ECONNREFUSED 127.0.0.1:12346
    at Object.exports._errnoException (util.js:915:11)
    at exports._exceptionWithHostPort (util.js:938:20)
    at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1065:14)

@indutny
Copy link
Member

indutny commented Nov 18, 2015

Gosh, really? I think most of the tests should be broken on SmartOS then. Perhaps we should mark the whole platform as flaky, or restart tests on ECONNRESET?

@claudiorodriguez
Copy link
Contributor Author

Hmmm. Now that I'm at home and looking at the code, it kinda makes sense to me that, since we're talking about "connection refused", the "connect" event isn't called at all in that case, and since the "error" listener only gets bound inside the listener... yeah. Let me just make another change. I wish this issue was easier to test.

@Trott
Copy link
Member

Trott commented Nov 18, 2015

OK, in that case, I'm going to terminate the running CI because it's tying up other tests that are waiting for a 32-bit SmartOS machine.

@claudiorodriguez claudiorodriguez force-pushed the test-flaky-http-chunked-smartos branch from ade9bfb to 210b623 Compare November 18, 2015 23:11
@claudiorodriguez
Copy link
Contributor Author

Yeah, good call. Just made a commit, I'd say if a fail comes up, also terminate it and look for an alternate solution.

@Trott
Copy link
Member

Trott commented Nov 19, 2015

@@ -24,20 +24,39 @@ function test(statusCode, next) {
});

server.listen(common.PORT, function() {
var conn = net.createConnection(common.PORT, function() {
conn.write('GET / HTTP/1.1\r\n\r\n');
var
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linting fails due to trailing space on this line. Can you fix that and use make jslint to confirm no other oddities slipped in?

@Trott
Copy link
Member

Trott commented Nov 19, 2015

Stress test looks good. Can you fix the minor linting problem?

SmartOS has an issue where it will trigger ECONNREFUSED when it
should not. See https://smartos.org/bugview/OS-2767.

This change adds logic to test-http-chunked-304 to work around
the issue. See also similar issue: nodejs#2663

Fixes: nodejs#3864
@claudiorodriguez claudiorodriguez force-pushed the test-flaky-http-chunked-smartos branch from 210b623 to 89d67c3 Compare November 19, 2015 11:59
@claudiorodriguez
Copy link
Contributor Author

Oops, sorry, still getting used to the project. Fixed and jslint is coming up clear.

@indutny
Copy link
Member

indutny commented Nov 19, 2015

@Trott why do we want to land this? Why other tests are not failing with ECONNRESET?

@Trott
Copy link
Member

Trott commented Nov 19, 2015

@indutny I'll take your second question first:

Why other tests are not failing with ECONNRESET?

(It's ECONNREFUSED rather than ECONNRESET.)

We definitely see it on other tests. In #2663 we saw it a lot on test-net-server-max-connections.js because that test opens hundreds of connections. It was fixed in #3830.

It doesn't pop up as often on other tests probably because they typically only open a few connections and not hundreds. But we do see it.

Here are two more occurrences on different tests since this test was designated flaky a few days ago.

not ok 370 test-http-flush-headers.js
#events.js:141
#      throw er; // Unhandled 'error' event
#      ^
#
#Error: connect ECONNREFUSED 127.0.0.1:12346
#    at Object.exports._errnoException (util.js:872:11)
#    at exports._exceptionWithHostPort (util.js:895:20)
#    at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1065:14)
not ok 522 test-net-dns-custom-lookup.js
#events.js:141
#      throw er; // Unhandled 'error' event
#      ^
#
#Error: connect ECONNREFUSED 127.0.0.1:12346
#    at Object.exports._errnoException (util.js:915:11)
#    at exports._exceptionWithHostPort (util.js:938:20)
#    at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1065:14)

why do we want to land this?

It's a way to let the test not be flaky on SmartOS. I admit that part of it doesn't pass the smell test for me. The retry logic seemed OK in the max-server-connections test mentioned above, but to move it to a bunch of other tests... ¯\_(ツ)_/¯ I'm certainly open to other ideas. Things I've thought of:

  • I'm not sure how SmartOS is versioned and how EOL works with it and all, but we could decide we don't support the OS/hardware combinations where we see this problem popping up and stop testing on it.
  • Or we could just skip the flaky tests on SmartOS rather than introducing the more convoluted retry logic we're introducing here.
  • Maybe there's an OS tweak we could do that will make this problem go away. SmartOS is a Joyent thing, right? Is there someone we can refer this issue to?
  • We could mark the tests as flaky on SmartOS with no intention of making them not-flaky ever. I don't like that option, but it is an option.

Other ideas?

@indutny
Copy link
Member

indutny commented Nov 19, 2015

Or we could just skip the flaky tests on SmartOS rather than introducing the more convoluted retry logic we're introducing here.

This is what I propose cc @bnoordhuis

@claudiorodriguez
Copy link
Contributor Author

Even if we go that way I'd suggest keeping the change that extracts the listener bindings out of the connect listener, and also keeping the error handling - basically everything in this PR except the reconnect logic (on a new PR).

@indutny
Copy link
Member

indutny commented Nov 19, 2015

What's the reasoning of this @fansworld-claudio ? The test appears to be working fine everywhere else.

@claudiorodriguez
Copy link
Contributor Author

Making it easier to implement future changes I guess, mainly. It's true that it works right now, in the context it's executed. I guess not then.

@indutny
Copy link
Member

indutny commented Nov 19, 2015

Well, it is a test, so I'm not sure if future-proofing it is reasonable. Sorry for this!

@claudiorodriguez
Copy link
Contributor Author

Don't worry, like I said I'm still getting used to the project and it makes perfect sense.
On the subject of skipping the tests, I agree that the reconnect logic is kind of an ugly workaround, but skipping an entire platform also seems less than ideal. Is it possible to narrow down the criteria a bit?

@Trott
Copy link
Member

Trott commented Nov 19, 2015

Unfortunately, we have two SmartOS builds in the test CI and both of them are affected. This would seem to only affect tests where a server is set up and then an attempt is made to connect to that server. That's a lot of tests under net and http, but I wouldn't want to skip any of them prematurely because there are probably only certain conditions that trip the bug. So maybe we just add the standard three-or-four lines of skip code at the beginning of tests if the test:

  • fails with ECONNREFUSED for no apparent reason on CI and
  • we run node-stress-single-test to confirm that the flakiness

The standard three-or-four lines I'm talking about look like this:

if (common.isSunOS) {
  console.log('1..0 # Skipped: Reason for skipping here.');
  return;
}

This will be a lot of tests in the end, and we will want to be able to roll them all back should the bug get fixed in SmartOS. So we'll want to make sure the reason provided for skipping is identical in all the tests so that we can find them easily.

@indutny
Copy link
Member

indutny commented Nov 19, 2015

We could just retry test several times if it fails with ECONNREFUSED, maybe 2 or 3 times before failing

@Trott
Copy link
Member

Trott commented Nov 20, 2015

Retrying the test if we're on SmartOS and get an ECONNREFUSED is basically what this PR does, isn't it?

@indutny
Copy link
Member

indutny commented Nov 20, 2015

@Trott I mean doing this automatically on CI, regardless of particular test name. Again, there are lots of tests with similar semantics, and doing this hack for all of them seems to be pointless to me.

@Trott
Copy link
Member

Trott commented Nov 20, 2015

That would probably mean putting the logic for it in the Python test wrapper/harness. That should work.

@indutny
Copy link
Member

indutny commented Nov 20, 2015

Exactly! ;)

@Trott
Copy link
Member

Trott commented Nov 20, 2015

#3941

@claudiorodriguez
Copy link
Contributor Author

Closing this, then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
http Issues or PRs related to the http subsystem. smartos Issues and PRs related to the SmartOS platform. test Issues and PRs related to the tests.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants