-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ringhash: Delegate subchannel creation to pickfirst #8047
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #8047 +/- ##
==========================================
+ Coverage 82.10% 82.23% +0.12%
==========================================
Files 387 387
Lines 39067 38928 -139
==========================================
- Hits 32076 32011 -65
+ Misses 5662 5598 -64
+ Partials 1329 1319 -10
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Co-authored-by: Doug Fawley <dfawley@google.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Just some comment changes that I noticed...
Co-authored-by: Doug Fawley <dfawley@google.com>
7fe6692
to
5202208
Compare
7065793
to
0b91fe0
Compare
} | ||
return addrsUpdated | ||
|
||
b.updatePickerLocked(endpointEnteredTF) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please write a comment or two about why we trigger connecting to the next idle endpoint only when this condition is true. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a comment in updatePickerLocked
, where the boolean is used. endpointEnteredTF
was added based on the the pseudocode in A61:
if (in_transient_failure && endpoint_entered_transient_failure) {
On re-reading the existing comment in updatePickerLocked
, I think that the endpoint_entered_transient_failure
condition can cause the ring hash to remain in TF in certain situations where the aggregated state is TF, but no endpoint actually enters TF:
- Scenario 1: There are four endpoints in the following states: TF, TF, READY, and IDLE. If the READY endpoint fails, it transitions to IDLE, resulting in the new states: TF, TF, IDLE, IDLE.
- Scenario 2: There are four endpoints in the following states: TF, TF, CONNECTING, and IDLE. If the CONNECTING endpoint is removed, the new states become: TF, TF, IDLE.
I've asked about this in the maintainers group.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great finding. Please let us know the outcome of this discussion is, as iiuc this is happening on a Google private channel. Also, it would be nice to demonstrate the behavior of those scenario via end 2 end tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: These edge cases were overlooked when writing A61. There is a discussion going on about the best way to handle such cases. I plan to remove the in_transient_failure
condition in Go, add unit tests to verify ringhash recovers from TF in both these cases and add a TODO to update the implementation based on the result of the discussion later.
@@ -163,14 +164,22 @@ func (s) TestRingHash_ReconnectToMoveOutOfTransientFailure(t *testing.T) { | |||
|
|||
// startTestServiceBackends starts num stub servers. It returns their addresses. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: The return type has changed. Please update the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the comment.
var otherEndpointAddr string | ||
var backendIdx int | ||
switch initialAddr { | ||
case backendAddrs[0]: | ||
otherEndpointAddr = backendAddrs[1] | ||
backendIdx = 0 | ||
case backendAddrs[1]: | ||
otherEndpointAddr = backendAddrs[0] | ||
backendIdx = 1 | ||
case backendAddrs[2]: | ||
otherEndpointAddr = backendAddrs[3] | ||
backendIdx = 2 | ||
case backendAddrs[3]: | ||
otherEndpointAddr = backendAddrs[2] | ||
backendIdx = 3 | ||
default: | ||
t.Fatalf("Request sent to unknown backend: %q", initialAddr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand this. Shouldn't the first RPC always reach the first address in the endpoint. i.e backendAddrs[0]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you're correct. The cases for the second address should never be used. Removed the extra cases now.
a0697d9
to
9e7758e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just minor nits, otherwise lgtm.
ring *ring | ||
logger *grpclog.PrefixLogger | ||
// endpointStates is a cache of endpoint connectivity states and pickers. | ||
// The cache is used to avoid locking the mutex for reading the state at RPC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had trouble understanding this comment. What mutex are you referring to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ringhash balancer stores endpoint states in a resolver.EndpointMap
, with access guarded by ringhashBalancer.mu
. The endpointStates
cache in the picker helps avoid locking the ringhash balancer's mutex when reading the latest state at RPC time. This also means that the picker needs to be regenerated every time a child generates a new picker.
Mentioned this in the comment.
// Per gRFC A61, because of sticky-TF with PickFirst's auto reconnect on TF, | ||
// we ignore all TF subchannels and find the first ring entry in READY, | ||
// CONNECTING or IDLE. If that entry is in IDLE, we need to initiate a | ||
// connection. The idlePicker returned by the LazyLB or the new Pickfirst |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why "or the new PickFirst"? Isn't exitIdle
always triggered by the lazy balancer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lazy balancer's idle picker is used the first time an endpoint exits idle state. If a subchannel fails after connecting to the backend, the pickfirst balancer will enter idle, but this time ringhash would get pickfirst's idle picker instead of the lazy balancer's idle picker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, makes sense.
// the previous, i.e. any of the following is true: | ||
// - an endpoint was added | ||
// - an endpoint was removed | ||
// - an endpoint's weight was updated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should add that:
- the first addresses of the endpoint has changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the comment.
} | ||
return addrsUpdated | ||
|
||
b.updatePickerLocked(endpointEnteredTF) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great finding. Please let us know the outcome of this discussion is, as iiuc this is happening on a Google private channel. Also, it would be nice to demonstrate the behavior of those scenario via end 2 end tests.
// when the first endpoint fails. The test then verifies that the balancer | ||
// doesn't automatically re-connect to the endpoint when the READY endpoint | ||
// enters IDLE. | ||
func (s) TestRingHash_NoReconnectWhenEndpointEntersIdle(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got this test to timeout once with the following output:
--- FAIL: Test (11.78s)
--- FAIL: Test/RingHash_NoReconnectWhenEndpointEntersIdle (10.01s)
tlogger.go:116: INFO restartable_listener.go:84 [testutils] Stopping restartable listener "127.0.0.1:63116" (t=+1.977333ms)
tlogger.go:116: INFO server.go:690 [core] [Server #44496]Server created (t=+2.222708ms)
stubserver.go:300: Started test service backend at "127.0.0.1:63117"
ringhash_balancer_test.go:2247: Created new snapshot cache...
tlogger.go:116: INFO server.go:886 [core] [Server #44496 ListenSocket #44497]ListenSocket created (t=+2.287291ms)
tlogger.go:116: INFO server.go:690 [core] [Server #44498]Server created (t=+2.485ms)
ringhash_balancer_test.go:2247: Registered Aggregated Discovery Service (ADS)...
ringhash_balancer_test.go:2247: xDS management server serving at: 127.0.0.1:63118...
tlogger.go:116: INFO server.go:886 [core] [Server #44498 ListenSocket #44499]ListenSocket created (t=+2.538875ms)
server.go:229: Created new resource snapshot...
server.go:235: Updated snapshot cache with resource snapshot...
tlogger.go:116: INFO clientconn.go:1659 [core] original dial target is: "xds:///test.server" (t=+3.567791ms)
tlogger.go:116: INFO clientconn.go:319 [core] [Channel #44500]Channel created (t=+3.585125ms)
tlogger.go:116: INFO clientconn.go:193 [core] [Channel #44500]parsed dial target is: resolver.Target{URL:url.URL{Scheme:"xds", Opaque:"", User:(*url.Userinfo)(nil), Host:"", Path:"/test.server", RawPath:"", OmitHost:false, ForceQuery:false, RawQuery:"", Fragment:"", RawFragment:""}} (t=+3.593916ms)
tlogger.go:116: INFO clientconn.go:194 [core] [Channel #44500]Channel authority set to "test.server" (t=+3.602666ms)
tlogger.go:116: INFO blocking_context_dialer.go:106 [testutils] Hold 0x14000f32630: Waiting for a connection attempt to addr "127.0.0.1:63116" (t=+3.619708ms)
tlogger.go:116: INFO xds_resolver.go:111 [xds] [xds-resolver 0x14000f12360] Creating resolver for target: xds:///test.server (t=+3.641875ms)
tlogger.go:116: INFO pool.go:242 [xds] xDS node ID: 69b48cbe-e27f-436d-9f7a-9093e09cb9ba (t=+3.709ms)
tlogger.go:116: INFO clientconn.go:1659 [core] original dial target is: "passthrough:///127.0.0.1:63118" (t=+3.724958ms)
tlogger.go:116: INFO clientconn.go:319 [core] [Channel #44501]Channel created (t=+3.732833ms)
tlogger.go:116: INFO clientconn.go:193 [core] [Channel #44501]parsed dial target is: resolver.Target{URL:url.URL{Scheme:"passthrough", Opaque:"", User:(*url.Userinfo)(nil), Host:"", Path:"/127.0.0.1:63118", RawPath:"", OmitHost:false, ForceQuery:false, RawQuery:"", Fragment:"", RawFragment:""}} (t=+3.739041ms)
tlogger.go:116: INFO clientconn.go:194 [core] [Channel #44501]Channel authority set to "127.0.0.1:63118" (t=+3.744375ms)
tlogger.go:116: INFO resolver_wrapper.go:211 [core] [Channel #44501]Resolver state updated: {
"Addresses": [
{
"Addr": "127.0.0.1:63118",
"ServerName": "",
"Attributes": null,
"BalancerAttributes": null,
"Metadata": null
}
],
"Endpoints": [
{
"Addresses": [
{
"Addr": "127.0.0.1:63118",
"ServerName": "",
"Attributes": null,
"BalancerAttributes": null,
"Metadata": null
}
],
"Attributes": null
}
],
"ServiceConfig": null,
"Attributes": null
} (resolver returned new addresses) (t=+3.772791ms)
tlogger.go:116: INFO balancer_wrapper.go:122 [core] [Channel #44501]Channel switches to new LB policy "pick_first" (t=+3.785625ms)
tlogger.go:116: INFO clientconn.go:843 [core] [Channel #44501 SubChannel #44502]Subchannel created (t=+3.797333ms)
tlogger.go:116: INFO clientconn.go:549 [core] [Channel #44501]Channel Connectivity change to CONNECTING (t=+3.80375ms)
tlogger.go:116: INFO clientconn.go:319 [core] [Channel #44501]Channel exiting idle mode (t=+3.809791ms)
tlogger.go:116: INFO clientconn.go:1204 [core] [Channel #44501 SubChannel #44502]Subchannel Connectivity change to CONNECTING (t=+3.831291ms)
tlogger.go:116: INFO clientconn.go:1324 [core] [Channel #44501 SubChannel #44502]Subchannel picks a new address "127.0.0.1:63118" to connect (t=+3.840375ms)
tlogger.go:116: INFO clientconn.go:319 [core] [Channel #44500]Channel exiting idle mode (t=+3.848541ms)
tlogger.go:116: INFO clientconn.go:1204 [core] [Channel #44501 SubChannel #44502]Subchannel Connectivity change to READY (t=+4.021541ms)
tlogger.go:116: INFO clientconn.go:549 [core] [Channel #44501]Channel Connectivity change to READY (t=+4.036375ms)
logging.go:30: nodeID "69b48cbe-e27f-436d-9f7a-9093e09cb9ba" requested type.googleapis.com/envoy.config.listener.v3.Listener[test.server] and known map[]. Diff [test.server]
logging.go:30: respond type.googleapis.com/envoy.config.listener.v3.Listener[test.server] version "" with version "1"
logging.go:30: nodeID "69b48cbe-e27f-436d-9f7a-9093e09cb9ba" requested type.googleapis.com/envoy.config.listener.v3.Listener[test.server] and known map[test.server:{}]. Diff []
logging.go:30: open watch 1 for type.googleapis.com/envoy.config.listener.v3.Listener[test.server] from nodeID "69b48cbe-e27f-436d-9f7a-9093e09cb9ba", version "1"
logging.go:30: nodeID "69b48cbe-e27f-436d-9f7a-9093e09cb9ba" requested type.googleapis.com/envoy.config.route.v3.RouteConfiguration[new_route] and known map[]. Diff [new_route]
logging.go:30: respond type.googleapis.com/envoy.config.route.v3.RouteConfiguration[new_route] version "" with version "1"
tlogger.go:116: INFO xds_resolver.go:295 [xds] [xds-resolver 0x14000f12360] For Listener resource "test.server" and RouteConfiguration resource "new_route", generated service config: {
"loadBalancingConfig": [
{
"xds_cluster_manager_experimental": {
"children": {
"cluster:cluster": {
"childPolicy": [
{
"cds_experimental": {
"cluster": "cluster"
}
}
]
}
}
}
}
]
} (t=+4.334291ms)
tlogger.go:116: INFO resolver_wrapper.go:211 [core] [Channel #44500]Resolver state updated: {
"Addresses": null,
"Endpoints": [],
"ServiceConfig": {
"Config": {
"Config": null,
"Methods": {}
},
"Err": null
},
"Attributes": {
"\u003c%!p(resolver.csKeyType=grpc.internal.resolver.configSelector)\u003e": "\u003c0x14000be5c70\u003e",
"\u003c%!p(xdsclient.clientKeyType=grpc.xds.internal.client.Client)\u003e": "\u003c0x14000f2c800\u003e"
}
} (service config updated) (t=+4.523416ms)
tlogger.go:116: INFO balancer_wrapper.go:122 [core] [Channel #44500]Channel switches to new LB policy "xds_cluster_manager_experimental" (t=+4.544916ms)
tlogger.go:116: INFO clustermanager.go:59 [xds] [xds-cluster-manager-lb 0x14000bec780] Created (t=+4.56025ms)
tlogger.go:116: INFO balancergroup.go:299 [xds] [xds-cluster-manager-lb 0x14000bec780] Adding child policy of type "cds_experimental" for child "cluster:cluster" (t=+4.572625ms)
tlogger.go:116: INFO balancergroup.go:102 [xds] [xds-cluster-manager-lb 0x14000bec780] Creating child policy of type "cds_experimental" for child "cluster:cluster" (t=+4.579958ms)
tlogger.go:116: INFO cdsbalancer.go:114 [xds] [cds-lb 0x1400068ec00] Created (t=+4.592041ms)
logging.go:30: nodeID "69b48cbe-e27f-436d-9f7a-9093e09cb9ba" requested type.googleapis.com/envoy.config.route.v3.RouteConfiguration[new_route] and known map[new_route:{}]. Diff []
logging.go:30: open watch 2 for type.googleapis.com/envoy.config.route.v3.RouteConfiguration[new_route] from nodeID "69b48cbe-e27f-436d-9f7a-9093e09cb9ba", version "1"
tlogger.go:116: INFO cdsbalancer.go:126 [xds] [cds-lb 0x1400068ec00] xDS credentials in use: false (t=+4.598416ms)
tlogger.go:116: INFO cdsbalancer.go:301 [xds] [cds-lb 0x1400068ec00] Received balancer config update: {
"LoadBalancingConfig": null,
"Cluster": "cluster"
} (t=+4.637208ms)
tlogger.go:116: INFO balancerstateaggregator.go:191 [xds] [xds-cluster-manager-lb 0x14000bec780] Child pickers: map[cluster:cluster:picker:0x14000bf0a00,state:CONNECTING,stateToAggregate:CONNECTING] (t=+4.666041ms)
tlogger.go:116: INFO clientconn.go:549 [core] [Channel #44500]Channel Connectivity change to CONNECTING (t=+4.67675ms)
logging.go:30: nodeID "69b48cbe-e27f-436d-9f7a-9093e09cb9ba" requested type.googleapis.com/envoy.config.cluster.v3.Cluster[cluster] and known map[]. Diff [cluster]
logging.go:30: respond type.googleapis.com/envoy.config.cluster.v3.Cluster[cluster] version "" with version "1"
tlogger.go:116: INFO cdsbalancer.go:425 [xds] [cds-lb 0x1400068ec00] Received Cluster resource: {
"ClusterType": 0,
"ClusterName": "cluster",
"EDSServiceName": "cluster",
"LRSServerConfig": null,
"SecurityCfg": null,
"MaxRequests": null,
"DNSHostName": "",
"PrioritizedClusterNames": null,
"LBPolicy": [
{
"ring_hash_experimental": {
"minRingSize": 1024,
"maxRingSize": 8388608
}
}
],
"OutlierDetection": null,
"Raw": {
"type_url": "type.googleapis.com/envoy.config.cluster.v3.Cluster",
"value": "CgdjbHVzdGVyGg0KAhoAEgdjbHVzdGVyMAIQAw=="
},
"TelemetryLabels": {
"csm.service_name": "unknown",
"csm.service_namespace_name": "unknown"
}
} (t=+4.855291ms)
tlogger.go:116: INFO clusterresolver.go:85 [xds] [xds-cluster-resolver-lb 0x14000530820] Created (t=+4.888083ms)
tlogger.go:116: INFO cdsbalancer.go:466 [xds] [cds-lb 0x1400068ec00] Created child policy 0x14000530820 of type cluster_resolver_experimental (t=+4.898ms)
logging.go:30: nodeID "69b48cbe-e27f-436d-9f7a-9093e09cb9ba" requested type.googleapis.com/envoy.config.cluster.v3.Cluster[cluster] and known map[cluster:{}]. Diff []
logging.go:30: open watch 3 for type.googleapis.com/envoy.config.cluster.v3.Cluster[cluster] from nodeID "69b48cbe-e27f-436d-9f7a-9093e09cb9ba", version "1"
ringhash_balancer_test.go:2287: Timeout waiting for connection attempt to backend "127.0.0.1:63116".
tlogger.go:116: INFO clientconn.go:549 [core] [Channel #44500]Channel Connectivity change to SHUTDOWN (t=+10.003514958s)
tlogger.go:116: INFO resolver_wrapper.go:111 [core] [Channel #44500]Closing the name resolver (t=+10.00356s)
tlogger.go:116: INFO balancer_wrapper.go:160 [core] [Channel #44500]ccBalancerWrapper: closing (t=+10.003606416s)
tlogger.go:116: INFO watch_service.go:56 [xds] [xds-resolver 0x14000f12360] Canceling watch on Listener resource "test.server" (t=+10.003708333s)
tlogger.go:116: INFO clusterresolver.go:355 [xds] [xds-cluster-resolver-lb 0x14000530820] Shutdown (t=+10.003748625s)
tlogger.go:116: INFO cdsbalancer.go:391 [xds] [cds-lb 0x1400068ec00] Shutdown (t=+10.003777666s)
tlogger.go:116: INFO clustermanager.go:197 [xds] [xds-cluster-manager-lb 0x14000bec780] Shutdown (t=+10.003803291s)
tlogger.go:116: WARNING ads_stream.go:514 [xds] [xds-client 0x14000f30990] [xds-channel 0x14000e98600] [ads-stream 0x1400161d300] ADS stream closed: rpc error: code = Canceled desc = context canceled (t=+10.00383s)
tlogger.go:116: INFO ads_stream.go:178 [xds] [xds-client 0x14000f30990] [xds-channel 0x14000e98600] [ads-stream 0x1400161d300] Stopping ADS stream (t=+10.003872291s)
tlogger.go:116: INFO clientconn.go:549 [core] [Channel #44501]Channel Connectivity change to SHUTDOWN (t=+10.003931375s)
tlogger.go:116: INFO resolver_wrapper.go:111 [core] [Channel #44501]Closing the name resolver (t=+10.003957541s)
tlogger.go:116: INFO balancer_wrapper.go:160 [core] [Channel #44501]ccBalancerWrapper: closing (t=+10.0039875s)
tlogger.go:116: WARNING ads_stream.go:471 [xds] [xds-client 0x14000f30990] [xds-channel 0x14000e98600] [ads-stream 0x1400161d300] Sending ADS request for type "type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment", resources: [], version: "", nonce: "" failed: EOF (t=+10.004030416s)
tlogger.go:116: INFO clientconn.go:1204 [core] [Channel #44501 SubChannel #44502]Subchannel Connectivity change to SHUTDOWN (t=+10.004085583s)
tlogger.go:116: INFO clientconn.go:1532 [core] [Channel #44501 SubChannel #44502]Subchannel deleted (t=+10.004120916s)
tlogger.go:116: INFO clientconn.go:319 [core] [Channel #44501]Channel deleted (t=+10.004357083s)
tlogger.go:116: INFO channel.go:145 [xds] [xds-client 0x14000f30990] [xds-channel 0x14000e98600] Shutdown (t=+10.004390083s)
tlogger.go:116: INFO watch_service.go:91 [xds] [xds-resolver 0x14000f12360] Canceling watch on RouteConfiguration resource "new_route" (t=+10.004416791s)
tlogger.go:116: INFO clientimpl.go:228 [xds] [xds-client 0x14000f30990] Shutdown (t=+10.004449208s)
tlogger.go:116: INFO xds_resolver.go:264 [xds] [xds-resolver 0x14000f12360] Shutdown (t=+10.00447575s)
tlogger.go:116: INFO clientconn.go:319 [core] [Channel #44500]Channel deleted (t=+10.004502041s)
tlogger.go:116: INFO server.go:822 [core] [Server #44498 ListenSocket #44499]ListenSocket deleted (t=+10.00461225s)
tlogger.go:116: INFO server.go:822 [core] [Server #44496 ListenSocket #44497]ListenSocket deleted (t=+10.004753s)
I haven't had time to dig into it, but you should be able to reproduce with go test -count 100 ./xds/internal/balancer/ringhash/e2e -run Test/RingHash_NoReconnectWhenEndpointEntersIdle
. I haven't been able to reproduce again, so maybe we can just ignore it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the success case, the next few log lines after Created child policy 0x14000530820 of type cluster_resolver_experimental
are similar to the following:
logging.go:30: nodeID "267a4abd-bf2b-4af8-9119-b88eeb7e1d98" requested type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment[cluster] and known map[]. Diff [cluster]
logging.go:30: respond type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment[cluster] version "" with version "1"
tlogger.go:116: INFO balancer.go:76 [xds] [priority-lb 0xc000378230] Created (t=+9.91661ms)
It seems like that clusterresolver didn't start a watch for EDS resources. The root cause is unclear, but it seems unrelated to the changes in this PR. I tried to repro the failure with debug logs enabled by running the test 10^5 times on Google's test cluster, but all the runs passed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, not sure what happened when it failed locally with -count 100
, I tried to run that again and it did not fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an issue open for a failure with similar logs: #7993
Co-authored-by: Antoine Tollenaere <atollena@gmail.com>
5ee0a8f
to
ec55d5b
Compare
As per gRFC A61, this PR switches the ringhash balancer to defer to pick first instead of creating and handling SubConns itself. This is part of Dualstack, and makes this a petiole.
RELEASE NOTES:
Endpoints
field instead of theAddresses
field in resolver state updates.