-
Notifications
You must be signed in to change notification settings - Fork 617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
raft: Avoid returning errors from ProcessRaftMessage #1779
Conversation
8e60017
to
a3f7705
Compare
a3f7705
to
e937cdb
Compare
@@ -939,19 +939,15 @@ func (n *Node) ProcessRaftMessage(ctx context.Context, msg *api.ProcessRaftMessa | |||
// current architecture depends on only the leader | |||
// making proposals, so in-flight proposals can be | |||
// guaranteed not to conflict. | |||
return nil, grpc.Errorf(codes.InvalidArgument, "proposals not accepted") | |||
return &api.ProcessRaftMessageResponse{}, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should something be logged here to help with debugging?
if err := n.raftNode.Step(ctx, *msg.Message); err != nil { | ||
return nil, err | ||
if n.IsMember() { | ||
n.raftNode.Step(ctx, *msg.Message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to handle this error? It was previously returned. If we don't actually care about the error, should we at least log it?
@aaronlehmann I don't actually have a preference, but to achieve your ideal implementation, is there an error code that transport errors cannot be? If so, can we return that error code and the resultant error, and the client can explicitly only bounce if it's not that particular error code? Alternately, can we wrap the non-transport errors in a type that can be checked for? |
Current coverage is 55.02% (diff: 57.14%)@@ master #1779 diff @@
==========================================
Files 102 102
Lines 16875 16884 +9
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
- Hits 9297 9290 -7
- Misses 6425 6436 +11
- Partials 1153 1158 +5
|
If the ProcessRaftMessage RPC returns an error, the client treats that as a potential transport-level error, and tries to reestablish a connection. In some cases this can cause a feedback loop. If ProcessRaftMessage can't successfully check the health of the sending node, it returns an error. That causes the sending node to bounce its outgoing connection, which results in another health check failure. To solve this, only return an error from ProcessRaftMessage when it is necessary to communicate to the client that it has been removed from the cluster. Ideally, I would fix this by having the client check specifically for a transport-level error before bouncing the connection, but there doesn't seem to be a reliable way to do this. Transport errors can end up with many different codes that are commonly returned by RPC handlers, including Internal, Unavailable, FailedPrecondition, DeadlineExceeded, and Cancelled. Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
e937cdb
to
e9b7cb1
Compare
I've added debug logging.
Probably, but it doesn't seem safe to rely on this going forward. Also, this error code wouldn't be the right one to return for some of these situations.
It's not possible to wrap these errors because they cross an RPC boundary. |
Ah right, we check the desc text to see if the node has been removed. :| Well, LGTM, since I can't think of a way around that unless we wanted to parse the description to tell what type it was, which seems fragile. Or we wanted to add to |
LGTM |
Fix silly race in ProcessRaftMessage logging introduced by moby#1779. Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
If the ProcessRaftMessage RPC returns an error, the client treats that
as a potential transport-level error, and tries to reestablish a
connection.
In some cases this can cause a feedback loop. If ProcessRaftMessage
can't successfully check the health of the sending node, it returns an
error. That causes the sending node to bounce its outgoing connection,
which results in another health check failure.
To solve this, only return an error from ProcessRaftMessage when it is
necessary to communicate to the client that it has been removed from the
cluster. Ideally, I would fix this by having the client check
specifically for a transport-level error before bouncing the connection,
but there doesn't seem to be a reliable way to do this. Transport errors
can end up with many different codes that are commonly returned by RPC
handlers, including Internal, Unavailable, FailedPrecondition,
DeadlineExceeded, and Cancelled.
cc @LK4D4 @cyli