raft: Avoid returning errors from ProcessRaftMessage #1779

aaronlehmann · 2016-11-30T20:21:10Z

If the ProcessRaftMessage RPC returns an error, the client treats that
as a potential transport-level error, and tries to reestablish a
connection.

In some cases this can cause a feedback loop. If ProcessRaftMessage
can't successfully check the health of the sending node, it returns an
error. That causes the sending node to bounce its outgoing connection,
which results in another health check failure.

To solve this, only return an error from ProcessRaftMessage when it is
necessary to communicate to the client that it has been removed from the
cluster. Ideally, I would fix this by having the client check
specifically for a transport-level error before bouncing the connection,
but there doesn't seem to be a reliable way to do this. Transport errors
can end up with many different codes that are commonly returned by RPC
handlers, including Internal, Unavailable, FailedPrecondition,
DeadlineExceeded, and Cancelled.

cc @LK4D4 @cyli

cyli · 2016-11-30T21:39:50Z

manager/state/raft/raft.go

@@ -939,19 +939,15 @@ func (n *Node) ProcessRaftMessage(ctx context.Context, msg *api.ProcessRaftMessa
 		// current architecture depends on only the leader
 		// making proposals, so in-flight proposals can be
 		// guaranteed not to conflict.
-		return nil, grpc.Errorf(codes.InvalidArgument, "proposals not accepted")
+		return &api.ProcessRaftMessageResponse{}, nil


Should something be logged here to help with debugging?

cyli · 2016-11-30T21:40:22Z

manager/state/raft/raft.go

-	if err := n.raftNode.Step(ctx, *msg.Message); err != nil {
-		return nil, err
+	if n.IsMember() {
+		n.raftNode.Step(ctx, *msg.Message)


Do we need to handle this error? It was previously returned. If we don't actually care about the error, should we at least log it?

cyli · 2016-11-30T21:46:03Z

@aaronlehmann I don't actually have a preference, but to achieve your ideal implementation, is there an error code that transport errors cannot be? If so, can we return that error code and the resultant error, and the client can explicitly only bounce if it's not that particular error code? Alternately, can we wrap the non-transport errors in a type that can be checked for?

codecov-io · 2016-11-30T21:59:06Z

Current coverage is 55.02% (diff: 57.14%)

Merging #1779 into master will decrease coverage by 0.07%

@@             master      #1779   diff @@
==========================================
  Files           102        102          
  Lines         16875      16884     +9   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
- Hits           9297       9290     -7   
- Misses         6425       6436    +11   
- Partials       1153       1158     +5

Powered by Codecov. Last update a1801a7...e9b7cb1

If the ProcessRaftMessage RPC returns an error, the client treats that as a potential transport-level error, and tries to reestablish a connection. In some cases this can cause a feedback loop. If ProcessRaftMessage can't successfully check the health of the sending node, it returns an error. That causes the sending node to bounce its outgoing connection, which results in another health check failure. To solve this, only return an error from ProcessRaftMessage when it is necessary to communicate to the client that it has been removed from the cluster. Ideally, I would fix this by having the client check specifically for a transport-level error before bouncing the connection, but there doesn't seem to be a reliable way to do this. Transport errors can end up with many different codes that are commonly returned by RPC handlers, including Internal, Unavailable, FailedPrecondition, DeadlineExceeded, and Cancelled. Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>

aaronlehmann · 2016-11-30T22:18:16Z

I've added debug logging.

I don't actually have a preference, but to achieve your ideal implementation, is there an error code that transport errors cannot be?

Probably, but it doesn't seem safe to rely on this going forward. Also, this error code wouldn't be the right one to return for some of these situations.

If so, can we return that error code and the resultant error, and the client can explicitly only bounce if it's not that particular error code? Alternately, can we wrap the non-transport errors in a type that can be checked for?

It's not possible to wrap these errors because they cross an RPC boundary.

cyli · 2016-11-30T23:10:41Z

It's not possible to wrap these errors because they cross an RPC boundary.

Ah right, we check the desc text to see if the node has been removed. :|

Well, LGTM, since I can't think of a way around that unless we wanted to parse the description to tell what type it was, which seems fragile. Or we wanted to add to ProcessRaftMessageResponse.

LK4D4 · 2016-12-02T00:36:29Z

LGTM

Fix silly race in ProcessRaftMessage logging introduced by moby#1779. Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>

Fix silly race in ProcessRaftMessage logging introduced by #1779. Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com> (cherry picked from commit 9c19379)

GordonTheTurtle added the dco/no label Nov 30, 2016

aaronlehmann force-pushed the raft-feedback-loop-2 branch from 8e60017 to a3f7705 Compare November 30, 2016 20:21

GordonTheTurtle added the dco/no label Nov 30, 2016

aaronlehmann force-pushed the raft-feedback-loop-2 branch from a3f7705 to e937cdb Compare November 30, 2016 20:21

GordonTheTurtle removed the dco/no label Nov 30, 2016

aaronlehmann added the priority/P1 label Nov 30, 2016

aaronlehmann added this to the 1.13.0 milestone Nov 30, 2016

aaronlehmann added the area/raft label Nov 30, 2016

cyli reviewed Nov 30, 2016

View reviewed changes

aaronlehmann force-pushed the raft-feedback-loop-2 branch from e937cdb to e9b7cb1 Compare November 30, 2016 22:17

aaronlehmann mentioned this pull request Dec 1, 2016

raft: transport package #1748

Merged

LK4D4 merged commit 32eea3b into moby:master Dec 2, 2016

aaronlehmann deleted the raft-feedback-loop-2 branch December 2, 2016 00:40

aaronlehmann added a commit to aaronlehmann/swarmkit that referenced this pull request Dec 2, 2016

Fix race in ProcessRaftMessage logging

9c19379

Fix silly race in ProcessRaftMessage logging introduced by moby#1779. Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>

aaronlehmann mentioned this pull request Dec 2, 2016

Fix race in ProcessRaftMessage logging #1786

Merged

aaronlehmann added a commit that referenced this pull request Dec 2, 2016

Fix race in ProcessRaftMessage logging

860e6e8

Fix silly race in ProcessRaftMessage logging introduced by #1779. Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com> (cherry picked from commit 9c19379)

aaronlehmann added the process/cherry-picked label Dec 2, 2016

aaronlehmann mentioned this pull request Dec 2, 2016

[1.13] Vendor swarmkit moby/moby#29049

Merged

AkihiroSuda mentioned this pull request Dec 5, 2016

[master] Vendor swarmkit moby/moby#29117

Merged

aaronlehmann mentioned this pull request Dec 6, 2016

Ensure that the node status is no longer pending when checking lock/unlock. moby/moby#29155

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

raft: Avoid returning errors from ProcessRaftMessage #1779

raft: Avoid returning errors from ProcessRaftMessage #1779

aaronlehmann commented Nov 30, 2016

cyli Nov 30, 2016

cyli Nov 30, 2016

cyli commented Nov 30, 2016

codecov-io commented Nov 30, 2016 •

edited

Loading

aaronlehmann commented Nov 30, 2016

cyli commented Nov 30, 2016

LK4D4 commented Dec 2, 2016

raft: Avoid returning errors from ProcessRaftMessage #1779

raft: Avoid returning errors from ProcessRaftMessage #1779

Conversation

aaronlehmann commented Nov 30, 2016

cyli Nov 30, 2016

Choose a reason for hiding this comment

cyli Nov 30, 2016

Choose a reason for hiding this comment

cyli commented Nov 30, 2016

codecov-io commented Nov 30, 2016 • edited Loading

Current coverage is 55.02% (diff: 57.14%)

aaronlehmann commented Nov 30, 2016

cyli commented Nov 30, 2016

LK4D4 commented Dec 2, 2016

codecov-io commented Nov 30, 2016 •

edited

Loading