Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft: Avoid returning errors from ProcessRaftMessage #1779

Merged
merged 1 commit into from
Dec 2, 2016

Conversation

aaronlehmann
Copy link
Collaborator

If the ProcessRaftMessage RPC returns an error, the client treats that
as a potential transport-level error, and tries to reestablish a
connection.

In some cases this can cause a feedback loop. If ProcessRaftMessage
can't successfully check the health of the sending node, it returns an
error. That causes the sending node to bounce its outgoing connection,
which results in another health check failure.

To solve this, only return an error from ProcessRaftMessage when it is
necessary to communicate to the client that it has been removed from the
cluster. Ideally, I would fix this by having the client check
specifically for a transport-level error before bouncing the connection,
but there doesn't seem to be a reliable way to do this. Transport errors
can end up with many different codes that are commonly returned by RPC
handlers, including Internal, Unavailable, FailedPrecondition,
DeadlineExceeded, and Cancelled.

cc @LK4D4 @cyli

@@ -939,19 +939,15 @@ func (n *Node) ProcessRaftMessage(ctx context.Context, msg *api.ProcessRaftMessa
// current architecture depends on only the leader
// making proposals, so in-flight proposals can be
// guaranteed not to conflict.
return nil, grpc.Errorf(codes.InvalidArgument, "proposals not accepted")
return &api.ProcessRaftMessageResponse{}, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should something be logged here to help with debugging?

if err := n.raftNode.Step(ctx, *msg.Message); err != nil {
return nil, err
if n.IsMember() {
n.raftNode.Step(ctx, *msg.Message)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to handle this error? It was previously returned. If we don't actually care about the error, should we at least log it?

@cyli
Copy link
Contributor

cyli commented Nov 30, 2016

@aaronlehmann I don't actually have a preference, but to achieve your ideal implementation, is there an error code that transport errors cannot be? If so, can we return that error code and the resultant error, and the client can explicitly only bounce if it's not that particular error code? Alternately, can we wrap the non-transport errors in a type that can be checked for?

@codecov-io
Copy link

codecov-io commented Nov 30, 2016

Current coverage is 55.02% (diff: 57.14%)

Merging #1779 into master will decrease coverage by 0.07%

@@             master      #1779   diff @@
==========================================
  Files           102        102          
  Lines         16875      16884     +9   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
- Hits           9297       9290     -7   
- Misses         6425       6436    +11   
- Partials       1153       1158     +5   

Sunburst

Powered by Codecov. Last update a1801a7...e9b7cb1

If the ProcessRaftMessage RPC returns an error, the client treats that
as a potential transport-level error, and tries to reestablish a
connection.

In some cases this can cause a feedback loop. If ProcessRaftMessage
can't successfully check the health of the sending node, it returns an
error. That causes the sending node to bounce its outgoing connection,
which results in another health check failure.

To solve this, only return an error from ProcessRaftMessage when it is
necessary to communicate to the client that it has been removed from the
cluster. Ideally, I would fix this by having the client check
specifically for a transport-level error before bouncing the connection,
but there doesn't seem to be a reliable way to do this. Transport errors
can end up with many different codes that are commonly returned by RPC
handlers, including Internal, Unavailable, FailedPrecondition,
DeadlineExceeded, and Cancelled.

Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
@aaronlehmann
Copy link
Collaborator Author

I've added debug logging.

I don't actually have a preference, but to achieve your ideal implementation, is there an error code that transport errors cannot be?

Probably, but it doesn't seem safe to rely on this going forward. Also, this error code wouldn't be the right one to return for some of these situations.

If so, can we return that error code and the resultant error, and the client can explicitly only bounce if it's not that particular error code? Alternately, can we wrap the non-transport errors in a type that can be checked for?

It's not possible to wrap these errors because they cross an RPC boundary.

@cyli
Copy link
Contributor

cyli commented Nov 30, 2016

It's not possible to wrap these errors because they cross an RPC boundary.

Ah right, we check the desc text to see if the node has been removed. :|

Well, LGTM, since I can't think of a way around that unless we wanted to parse the description to tell what type it was, which seems fragile. Or we wanted to add to ProcessRaftMessageResponse.

@LK4D4
Copy link
Contributor

LK4D4 commented Dec 2, 2016

LGTM

@LK4D4 LK4D4 merged commit 32eea3b into moby:master Dec 2, 2016
@aaronlehmann aaronlehmann deleted the raft-feedback-loop-2 branch December 2, 2016 00:40
aaronlehmann added a commit to aaronlehmann/swarmkit that referenced this pull request Dec 2, 2016
Fix silly race in ProcessRaftMessage logging introduced by moby#1779.

Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
aaronlehmann added a commit that referenced this pull request Dec 2, 2016
Fix silly race in ProcessRaftMessage logging introduced by #1779.

Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
(cherry picked from commit 9c19379)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants