ringhash: Delegate subchannel creation to pickfirst #8047

arjan-bal · 2025-01-28T10:26:00Z

As per gRFC A61, this PR switches the ringhash balancer to defer to pick first instead of creating and handling SubConns itself. This is part of Dualstack, and makes this a petiole.

RELEASE NOTES:

ringhash: ringhash uses pick first child policies to manage endpoints instead of creating SubConns.
ringhash: ringhash uses the Endpoints field instead of the Addresses field in resolver state updates.

codecov · 2025-01-28T10:31:20Z

Codecov Report

Attention: Patch coverage is 97.14286% with 6 lines in your changes missing coverage. Please review.

Project coverage is 82.23%. Comparing base (ad5cd32) to head (ec55d5b).
Report is 5 commits behind head on master.

Files with missing lines	Patch %	Lines
xds/internal/balancer/ringhash/ringhash.go	97.43%	4 Missing ⚠️
xds/internal/balancer/ringhash/picker.go	88.88%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #8047      +/-   ##
==========================================
+ Coverage   82.10%   82.23%   +0.12%     
==========================================
  Files         387      387              
  Lines       39067    38928     -139     
==========================================
- Hits        32076    32011      -65     
+ Misses       5662     5598      -64     
+ Partials     1329     1319      -10

Files with missing lines	Coverage Δ
balancer/weightedroundrobin/weightedroundrobin.go	`100.00% <100.00%> (ø)`
xds/internal/balancer/ringhash/logging.go	`100.00% <ø> (ø)`
xds/internal/balancer/ringhash/ring.go	`92.95% <100.00%> (+0.65%)`	⬆️
xds/internal/balancer/ringhash/picker.go	`90.00% <88.88%> (+5.66%)`	⬆️
xds/internal/balancer/ringhash/ringhash.go	`93.80% <97.43%> (+6.18%)`	⬆️

... and 26 files with indirect coverage changes

dfawley

I guess this also needs to wait for #8052, right?

This is a pretty large change, and I haven't done a very thorough review, but it mostly looks good to me. I would like to request @easwars or maybe @atollena to review more closely, since they're more familiar with this logic than me.

balancer/weightedroundrobin/weightedroundrobin.go

internal/testutils/balancer.go

xds/internal/balancer/ringhash/picker.go

Co-authored-by: Doug Fawley <dfawley@google.com>

arjan-bal · 2025-01-31T11:32:53Z

I guess this also needs to wait for #8052, right?

This PR depends on the lazy balancer and #8052 changes it. Either PR can be merged first. I'll rebase the other.

…o-pf

dfawley

LGTM. Just some comment changes that I noticed...

xds/internal/balancer/ringhash/ringhash.go

Co-authored-by: Doug Fawley <dfawley@google.com>

xds/internal/balancer/ringhash/picker.go

xds/internal/balancer/ringhash/picker_test.go

xds/internal/balancer/ringhash/ringhash.go

…o-pf

xds/internal/balancer/ringhash/picker.go

xds/internal/balancer/ringhash/ringhash.go

easwars · 2025-02-11T18:55:44Z

xds/internal/balancer/ringhash/ringhash.go

-	}
-	return addrsUpdated
+
+	b.updatePickerLocked(endpointEnteredTF)


Could you please write a comment or two about why we trigger connecting to the next idle endpoint only when this condition is true. Thanks.

There is a comment in updatePickerLocked, where the boolean is used. endpointEnteredTF was added based on the the pseudocode in A61:

if (in_transient_failure && endpoint_entered_transient_failure) {

On re-reading the existing comment in updatePickerLocked, I think that the endpoint_entered_transient_failure condition can cause the ring hash to remain in TF in certain situations where the aggregated state is TF, but no endpoint actually enters TF:

Scenario 1: There are four endpoints in the following states: TF, TF, READY, and IDLE. If the READY endpoint fails, it transitions to IDLE, resulting in the new states: TF, TF, IDLE, IDLE.

Scenario 2: There are four endpoints in the following states: TF, TF, CONNECTING, and IDLE. If the CONNECTING endpoint is removed, the new states become: TF, TF, IDLE.

I've asked about this in the maintainers group.

This is a great finding. Please let us know the outcome of this discussion is, as iiuc this is happening on a Google private channel. Also, it would be nice to demonstrate the behavior of those scenario via end 2 end tests.

Update: These edge cases were overlooked when writing A61. There is a discussion going on about the best way to handle such cases. I plan to remove the in_transient_failure condition in Go, add unit tests to verify ringhash recovers from TF in both these cases and add a TODO to update the implementation based on the result of the discussion later.

easwars · 2025-02-11T19:07:41Z

xds/internal/balancer/ringhash/e2e/ringhash_balancer_test.go

@@ -163,14 +164,22 @@ func (s) TestRingHash_ReconnectToMoveOutOfTransientFailure(t *testing.T) {

 // startTestServiceBackends starts num stub servers. It returns their addresses.


Nit: The return type has changed. Please update the comment.

Updated the comment.

easwars · 2025-02-11T19:19:57Z

xds/internal/balancer/ringhash/e2e/ringhash_balancer_test.go

+	var otherEndpointAddr string
+	var backendIdx int
+	switch initialAddr {
+	case backendAddrs[0]:
+		otherEndpointAddr = backendAddrs[1]
+		backendIdx = 0
+	case backendAddrs[1]:
+		otherEndpointAddr = backendAddrs[0]
+		backendIdx = 1
+	case backendAddrs[2]:
+		otherEndpointAddr = backendAddrs[3]
+		backendIdx = 2
+	case backendAddrs[3]:
+		otherEndpointAddr = backendAddrs[2]
+		backendIdx = 3
+	default:
+		t.Fatalf("Request sent to unknown backend: %q", initialAddr)


I don't quite understand this. Shouldn't the first RPC always reach the first address in the endpoint. i.e backendAddrs[0]?

Yes, you're correct. The cases for the second address should never be used. Removed the extra cases now.

…o-pf

atollena

just minor nits, otherwise lgtm.

atollena · 2025-02-12T07:47:55Z

xds/internal/balancer/ringhash/picker.go

+	ring   *ring
+	logger *grpclog.PrefixLogger
+	// endpointStates is a cache of endpoint connectivity states and pickers.
+	// The cache is used to avoid locking the mutex for reading the state at RPC


I had trouble understanding this comment. What mutex are you referring to?

The ringhash balancer stores endpoint states in a resolver.EndpointMap, with access guarded by ringhashBalancer.mu. The endpointStates cache in the picker helps avoid locking the ringhash balancer's mutex when reading the latest state at RPC time. This also means that the picker needs to be regenerated every time a child generates a new picker.

Mentioned this in the comment.

atollena · 2025-02-12T07:50:01Z

xds/internal/balancer/ringhash/picker.go

+	// Per gRFC A61, because of sticky-TF with PickFirst's auto reconnect on TF,
+	// we ignore all TF subchannels and find the first ring entry in READY,
+	// CONNECTING or IDLE.  If that entry is in IDLE, we need to initiate a
+	// connection. The idlePicker returned by the LazyLB or the new Pickfirst


Why "or the new PickFirst"? Isn't exitIdle always triggered by the lazy balancer?

The lazy balancer's idle picker is used the first time an endpoint exits idle state. If a subchannel fails after connecting to the backend, the pickfirst balancer will enter idle, but this time ringhash would get pickfirst's idle picker instead of the lazy balancer's idle picker.

OK, makes sense.

xds/internal/balancer/ringhash/ringhash.go

atollena · 2025-02-12T07:56:37Z

xds/internal/balancer/ringhash/ringhash.go

+// the previous, i.e. any of the following is true:
+// - an endpoint was added
+// - an endpoint was removed
+// - an endpoint's weight was updated


You should add that:

- the first addresses of the endpoint has changed.

Updated the comment.

atollena · 2025-02-12T08:10:33Z

xds/internal/balancer/ringhash/ringhash.go

-	}
-	return addrsUpdated
+
+	b.updatePickerLocked(endpointEnteredTF)


This is a great finding. Please let us know the outcome of this discussion is, as iiuc this is happening on a Google private channel. Also, it would be nice to demonstrate the behavior of those scenario via end 2 end tests.

atollena · 2025-02-12T08:31:34Z

xds/internal/balancer/ringhash/e2e/ringhash_balancer_test.go

+// when the first endpoint fails. The test then verifies that the balancer
+// doesn't automatically re-connect to the endpoint when the READY endpoint
+// enters IDLE.
+func (s) TestRingHash_NoReconnectWhenEndpointEntersIdle(t *testing.T) {


I got this test to timeout once with the following output:

--- FAIL: Test (11.78s) --- FAIL: Test/RingHash_NoReconnectWhenEndpointEntersIdle (10.01s) tlogger.go:116: INFO restartable_listener.go:84 [testutils] Stopping restartable listener "127.0.0.1:63116" (t=+1.977333ms) tlogger.go:116: INFO server.go:690 [core] [Server #44496]Server created (t=+2.222708ms) stubserver.go:300: Started test service backend at "127.0.0.1:63117" ringhash_balancer_test.go:2247: Created new snapshot cache... tlogger.go:116: INFO server.go:886 [core] [Server #44496 ListenSocket #44497]ListenSocket created (t=+2.287291ms) tlogger.go:116: INFO server.go:690 [core] [Server #44498]Server created (t=+2.485ms) ringhash_balancer_test.go:2247: Registered Aggregated Discovery Service (ADS)... ringhash_balancer_test.go:2247: xDS management server serving at: 127.0.0.1:63118... tlogger.go:116: INFO server.go:886 [core] [Server #44498 ListenSocket #44499]ListenSocket created (t=+2.538875ms) server.go:229: Created new resource snapshot... server.go:235: Updated snapshot cache with resource snapshot... tlogger.go:116: INFO clientconn.go:1659 [core] original dial target is: "xds:///test.server" (t=+3.567791ms) tlogger.go:116: INFO clientconn.go:319 [core] [Channel #44500]Channel created (t=+3.585125ms) tlogger.go:116: INFO clientconn.go:193 [core] [Channel #44500]parsed dial target is: resolver.Target{URL:url.URL{Scheme:"xds", Opaque:"", User:(*url.Userinfo)(nil), Host:"", Path:"/test.server", RawPath:"", OmitHost:false, ForceQuery:false, RawQuery:"", Fragment:"", RawFragment:""}} (t=+3.593916ms) tlogger.go:116: INFO clientconn.go:194 [core] [Channel #44500]Channel authority set to "test.server" (t=+3.602666ms) tlogger.go:116: INFO blocking_context_dialer.go:106 [testutils] Hold 0x14000f32630: Waiting for a connection attempt to addr "127.0.0.1:63116" (t=+3.619708ms) tlogger.go:116: INFO xds_resolver.go:111 [xds] [xds-resolver 0x14000f12360] Creating resolver for target: xds:///test.server (t=+3.641875ms) tlogger.go:116: INFO pool.go:242 [xds] xDS node ID: 69b48cbe-e27f-436d-9f7a-9093e09cb9ba (t=+3.709ms) tlogger.go:116: INFO clientconn.go:1659 [core] original dial target is: "passthrough:///127.0.0.1:63118" (t=+3.724958ms) tlogger.go:116: INFO clientconn.go:319 [core] [Channel #44501]Channel created (t=+3.732833ms) tlogger.go:116: INFO clientconn.go:193 [core] [Channel #44501]parsed dial target is: resolver.Target{URL:url.URL{Scheme:"passthrough", Opaque:"", User:(*url.Userinfo)(nil), Host:"", Path:"/127.0.0.1:63118", RawPath:"", OmitHost:false, ForceQuery:false, RawQuery:"", Fragment:"", RawFragment:""}} (t=+3.739041ms) tlogger.go:116: INFO clientconn.go:194 [core] [Channel #44501]Channel authority set to "127.0.0.1:63118" (t=+3.744375ms) tlogger.go:116: INFO resolver_wrapper.go:211 [core] [Channel #44501]Resolver state updated: { "Addresses": [ { "Addr": "127.0.0.1:63118", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null } ], "Endpoints": [ { "Addresses": [ { "Addr": "127.0.0.1:63118", "ServerName": "", "Attributes": null, "BalancerAttributes": null, "Metadata": null } ], "Attributes": null } ], "ServiceConfig": null, "Attributes": null } (resolver returned new addresses) (t=+3.772791ms) tlogger.go:116: INFO balancer_wrapper.go:122 [core] [Channel #44501]Channel switches to new LB policy "pick_first" (t=+3.785625ms) tlogger.go:116: INFO clientconn.go:843 [core] [Channel #44501 SubChannel #44502]Subchannel created (t=+3.797333ms) tlogger.go:116: INFO clientconn.go:549 [core] [Channel #44501]Channel Connectivity change to CONNECTING (t=+3.80375ms) tlogger.go:116: INFO clientconn.go:319 [core] [Channel #44501]Channel exiting idle mode (t=+3.809791ms) tlogger.go:116: INFO clientconn.go:1204 [core] [Channel #44501 SubChannel #44502]Subchannel Connectivity change to CONNECTING (t=+3.831291ms) tlogger.go:116: INFO clientconn.go:1324 [core] [Channel #44501 SubChannel #44502]Subchannel picks a new address "127.0.0.1:63118" to connect (t=+3.840375ms) tlogger.go:116: INFO clientconn.go:319 [core] [Channel #44500]Channel exiting idle mode (t=+3.848541ms) tlogger.go:116: INFO clientconn.go:1204 [core] [Channel #44501 SubChannel #44502]Subchannel Connectivity change to READY (t=+4.021541ms) tlogger.go:116: INFO clientconn.go:549 [core] [Channel #44501]Channel Connectivity change to READY (t=+4.036375ms) logging.go:30: nodeID "69b48cbe-e27f-436d-9f7a-9093e09cb9ba" requested type.googleapis.com/envoy.config.listener.v3.Listener[test.server] and known map[]. Diff [test.server] logging.go:30: respond type.googleapis.com/envoy.config.listener.v3.Listener[test.server] version "" with version "1" logging.go:30: nodeID "69b48cbe-e27f-436d-9f7a-9093e09cb9ba" requested type.googleapis.com/envoy.config.listener.v3.Listener[test.server] and known map[test.server:{}]. Diff [] logging.go:30: open watch 1 for type.googleapis.com/envoy.config.listener.v3.Listener[test.server] from nodeID "69b48cbe-e27f-436d-9f7a-9093e09cb9ba", version "1" logging.go:30: nodeID "69b48cbe-e27f-436d-9f7a-9093e09cb9ba" requested type.googleapis.com/envoy.config.route.v3.RouteConfiguration[new_route] and known map[]. Diff [new_route] logging.go:30: respond type.googleapis.com/envoy.config.route.v3.RouteConfiguration[new_route] version "" with version "1" tlogger.go:116: INFO xds_resolver.go:295 [xds] [xds-resolver 0x14000f12360] For Listener resource "test.server" and RouteConfiguration resource "new_route", generated service config: { "loadBalancingConfig": [ { "xds_cluster_manager_experimental": { "children": { "cluster:cluster": { "childPolicy": [ { "cds_experimental": { "cluster": "cluster" } } ] } } } } ] } (t=+4.334291ms) tlogger.go:116: INFO resolver_wrapper.go:211 [core] [Channel #44500]Resolver state updated: { "Addresses": null, "Endpoints": [], "ServiceConfig": { "Config": { "Config": null, "Methods": {} }, "Err": null }, "Attributes": { "\u003c%!p(resolver.csKeyType=grpc.internal.resolver.configSelector)\u003e": "\u003c0x14000be5c70\u003e", "\u003c%!p(xdsclient.clientKeyType=grpc.xds.internal.client.Client)\u003e": "\u003c0x14000f2c800\u003e" } } (service config updated) (t=+4.523416ms) tlogger.go:116: INFO balancer_wrapper.go:122 [core] [Channel #44500]Channel switches to new LB policy "xds_cluster_manager_experimental" (t=+4.544916ms) tlogger.go:116: INFO clustermanager.go:59 [xds] [xds-cluster-manager-lb 0x14000bec780] Created (t=+4.56025ms) tlogger.go:116: INFO balancergroup.go:299 [xds] [xds-cluster-manager-lb 0x14000bec780] Adding child policy of type "cds_experimental" for child "cluster:cluster" (t=+4.572625ms) tlogger.go:116: INFO balancergroup.go:102 [xds] [xds-cluster-manager-lb 0x14000bec780] Creating child policy of type "cds_experimental" for child "cluster:cluster" (t=+4.579958ms) tlogger.go:116: INFO cdsbalancer.go:114 [xds] [cds-lb 0x1400068ec00] Created (t=+4.592041ms) logging.go:30: nodeID "69b48cbe-e27f-436d-9f7a-9093e09cb9ba" requested type.googleapis.com/envoy.config.route.v3.RouteConfiguration[new_route] and known map[new_route:{}]. Diff [] logging.go:30: open watch 2 for type.googleapis.com/envoy.config.route.v3.RouteConfiguration[new_route] from nodeID "69b48cbe-e27f-436d-9f7a-9093e09cb9ba", version "1" tlogger.go:116: INFO cdsbalancer.go:126 [xds] [cds-lb 0x1400068ec00] xDS credentials in use: false (t=+4.598416ms) tlogger.go:116: INFO cdsbalancer.go:301 [xds] [cds-lb 0x1400068ec00] Received balancer config update: { "LoadBalancingConfig": null, "Cluster": "cluster" } (t=+4.637208ms) tlogger.go:116: INFO balancerstateaggregator.go:191 [xds] [xds-cluster-manager-lb 0x14000bec780] Child pickers: map[cluster:cluster:picker:0x14000bf0a00,state:CONNECTING,stateToAggregate:CONNECTING] (t=+4.666041ms) tlogger.go:116: INFO clientconn.go:549 [core] [Channel #44500]Channel Connectivity change to CONNECTING (t=+4.67675ms) logging.go:30: nodeID "69b48cbe-e27f-436d-9f7a-9093e09cb9ba" requested type.googleapis.com/envoy.config.cluster.v3.Cluster[cluster] and known map[]. Diff [cluster] logging.go:30: respond type.googleapis.com/envoy.config.cluster.v3.Cluster[cluster] version "" with version "1" tlogger.go:116: INFO cdsbalancer.go:425 [xds] [cds-lb 0x1400068ec00] Received Cluster resource: { "ClusterType": 0, "ClusterName": "cluster", "EDSServiceName": "cluster", "LRSServerConfig": null, "SecurityCfg": null, "MaxRequests": null, "DNSHostName": "", "PrioritizedClusterNames": null, "LBPolicy": [ { "ring_hash_experimental": { "minRingSize": 1024, "maxRingSize": 8388608 } } ], "OutlierDetection": null, "Raw": { "type_url": "type.googleapis.com/envoy.config.cluster.v3.Cluster", "value": "CgdjbHVzdGVyGg0KAhoAEgdjbHVzdGVyMAIQAw==" }, "TelemetryLabels": { "csm.service_name": "unknown", "csm.service_namespace_name": "unknown" } } (t=+4.855291ms) tlogger.go:116: INFO clusterresolver.go:85 [xds] [xds-cluster-resolver-lb 0x14000530820] Created (t=+4.888083ms) tlogger.go:116: INFO cdsbalancer.go:466 [xds] [cds-lb 0x1400068ec00] Created child policy 0x14000530820 of type cluster_resolver_experimental (t=+4.898ms) logging.go:30: nodeID "69b48cbe-e27f-436d-9f7a-9093e09cb9ba" requested type.googleapis.com/envoy.config.cluster.v3.Cluster[cluster] and known map[cluster:{}]. Diff [] logging.go:30: open watch 3 for type.googleapis.com/envoy.config.cluster.v3.Cluster[cluster] from nodeID "69b48cbe-e27f-436d-9f7a-9093e09cb9ba", version "1" ringhash_balancer_test.go:2287: Timeout waiting for connection attempt to backend "127.0.0.1:63116". tlogger.go:116: INFO clientconn.go:549 [core] [Channel #44500]Channel Connectivity change to SHUTDOWN (t=+10.003514958s) tlogger.go:116: INFO resolver_wrapper.go:111 [core] [Channel #44500]Closing the name resolver (t=+10.00356s) tlogger.go:116: INFO balancer_wrapper.go:160 [core] [Channel #44500]ccBalancerWrapper: closing (t=+10.003606416s) tlogger.go:116: INFO watch_service.go:56 [xds] [xds-resolver 0x14000f12360] Canceling watch on Listener resource "test.server" (t=+10.003708333s) tlogger.go:116: INFO clusterresolver.go:355 [xds] [xds-cluster-resolver-lb 0x14000530820] Shutdown (t=+10.003748625s) tlogger.go:116: INFO cdsbalancer.go:391 [xds] [cds-lb 0x1400068ec00] Shutdown (t=+10.003777666s) tlogger.go:116: INFO clustermanager.go:197 [xds] [xds-cluster-manager-lb 0x14000bec780] Shutdown (t=+10.003803291s) tlogger.go:116: WARNING ads_stream.go:514 [xds] [xds-client 0x14000f30990] [xds-channel 0x14000e98600] [ads-stream 0x1400161d300] ADS stream closed: rpc error: code = Canceled desc = context canceled (t=+10.00383s) tlogger.go:116: INFO ads_stream.go:178 [xds] [xds-client 0x14000f30990] [xds-channel 0x14000e98600] [ads-stream 0x1400161d300] Stopping ADS stream (t=+10.003872291s) tlogger.go:116: INFO clientconn.go:549 [core] [Channel #44501]Channel Connectivity change to SHUTDOWN (t=+10.003931375s) tlogger.go:116: INFO resolver_wrapper.go:111 [core] [Channel #44501]Closing the name resolver (t=+10.003957541s) tlogger.go:116: INFO balancer_wrapper.go:160 [core] [Channel #44501]ccBalancerWrapper: closing (t=+10.0039875s) tlogger.go:116: WARNING ads_stream.go:471 [xds] [xds-client 0x14000f30990] [xds-channel 0x14000e98600] [ads-stream 0x1400161d300] Sending ADS request for type "type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment", resources: [], version: "", nonce: "" failed: EOF (t=+10.004030416s) tlogger.go:116: INFO clientconn.go:1204 [core] [Channel #44501 SubChannel #44502]Subchannel Connectivity change to SHUTDOWN (t=+10.004085583s) tlogger.go:116: INFO clientconn.go:1532 [core] [Channel #44501 SubChannel #44502]Subchannel deleted (t=+10.004120916s) tlogger.go:116: INFO clientconn.go:319 [core] [Channel #44501]Channel deleted (t=+10.004357083s) tlogger.go:116: INFO channel.go:145 [xds] [xds-client 0x14000f30990] [xds-channel 0x14000e98600] Shutdown (t=+10.004390083s) tlogger.go:116: INFO watch_service.go:91 [xds] [xds-resolver 0x14000f12360] Canceling watch on RouteConfiguration resource "new_route" (t=+10.004416791s) tlogger.go:116: INFO clientimpl.go:228 [xds] [xds-client 0x14000f30990] Shutdown (t=+10.004449208s) tlogger.go:116: INFO xds_resolver.go:264 [xds] [xds-resolver 0x14000f12360] Shutdown (t=+10.00447575s) tlogger.go:116: INFO clientconn.go:319 [core] [Channel #44500]Channel deleted (t=+10.004502041s) tlogger.go:116: INFO server.go:822 [core] [Server #44498 ListenSocket #44499]ListenSocket deleted (t=+10.00461225s) tlogger.go:116: INFO server.go:822 [core] [Server #44496 ListenSocket #44497]ListenSocket deleted (t=+10.004753s)

I haven't had time to dig into it, but you should be able to reproduce with go test -count 100 ./xds/internal/balancer/ringhash/e2e -run Test/RingHash_NoReconnectWhenEndpointEntersIdle. I haven't been able to reproduce again, so maybe we can just ignore it.

In the success case, the next few log lines after Created child policy 0x14000530820 of type cluster_resolver_experimental are similar to the following:

logging.go:30: nodeID "267a4abd-bf2b-4af8-9119-b88eeb7e1d98" requested type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment[cluster] and known map[]. Diff [cluster]
logging.go:30: respond type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment[cluster] version "" with version "1"
tlogger.go:116: INFO balancer.go:76 [xds] [priority-lb 0xc000378230] Created (t=+9.91661ms)

It seems like that clusterresolver didn't start a watch for EDS resources. The root cause is unclear, but it seems unrelated to the changes in this PR. I tried to repro the failure with debug logs enabled by running the test 10^5 times on Google's test cluster, but all the runs passed.

OK, not sure what happened when it failed locally with -count 100, I tried to run that again and it did not fail.

There is an issue open for a failure with similar logs: #7993

Co-authored-by: Antoine Tollenaere <atollena@gmail.com>

make ringhash delegate to pickfirst

125b5ba

arjan-bal added Type: Feature New features or improvements in behavior Area: Resolvers/Balancers Includes LB policy & NR APIs, resolver/balancer/picker wrappers, LB policy impls and utilities. labels Jan 28, 2025

arjan-bal added this to the 1.71 Release milestone Jan 28, 2025

arjan-bal requested review from easwars and dfawley January 28, 2025 10:34

arjan-bal assigned easwars and dfawley Jan 28, 2025

dfawley mentioned this pull request Jan 30, 2025

Flaky test: Graceful stop example #8059

Closed

dfawley reviewed Jan 30, 2025

View reviewed changes

arjan-bal assigned atollena Jan 31, 2025

arjan-bal requested a review from atollena January 31, 2025 10:57

arjan-bal and others added 2 commits January 31, 2025 16:28

Update internal/testutils/balancer.go

d437cbe

Co-authored-by: Doug Fawley <dfawley@google.com>

Address review comments

b6f8f84

arjan-bal added 2 commits February 3, 2025 12:23

Merge remote-tracking branch 'source/master' into ringhash-delegate-t…

8b52b39

…o-pf

Build endpointsharding with child builder

279f23f

dfawley approved these changes Feb 3, 2025

View reviewed changes

xds/internal/balancer/ringhash/ringhash.go Outdated Show resolved Hide resolved

xds/internal/balancer/ringhash/ringhash.go Outdated Show resolved Hide resolved

xds/internal/balancer/ringhash/ringhash.go Outdated Show resolved Hide resolved

dfawley assigned arjan-bal and unassigned dfawley Feb 3, 2025

easwars reviewed Feb 3, 2025

View reviewed changes

xds/internal/balancer/ringhash/ringhash.go Outdated Show resolved Hide resolved

xds/internal/balancer/ringhash/ringhash.go Show resolved Hide resolved

xds/internal/balancer/ringhash/ringhash.go Outdated Show resolved Hide resolved

arjan-bal and others added 2 commits February 4, 2025 19:09

Apply suggestions from code review

cbaeb8e

Co-authored-by: Doug Fawley <dfawley@google.com>

Address comments

9ad0b47

arjan-bal removed their assignment Feb 4, 2025

easwars reviewed Feb 4, 2025

View reviewed changes

easwars assigned arjan-bal and unassigned easwars Feb 4, 2025

arjan-bal added 2 commits February 6, 2025 14:13

Remove the ordered endpoints list

3e0d95a

Move the picker contructor to a balancer method

4538144

arjan-bal added 2 commits February 10, 2025 12:00

address review comments

467cee1

Merge remote-tracking branch 'source/master' into ringhash-delegate-t…

5202208

…o-pf

arjan-bal force-pushed the ringhash-delegate-to-pf branch from 7fe6692 to 5202208 Compare February 11, 2025 10:00

Resolve merge conflict

1b39713

arjan-bal assigned arjan-bal and unassigned atollena Feb 11, 2025

Update ring when the first address changes

0b91fe0

arjan-bal force-pushed the ringhash-delegate-to-pf branch from 7065793 to 0b91fe0 Compare February 11, 2025 18:16

arjan-bal assigned atollena and unassigned arjan-bal Feb 11, 2025

arjan-bal requested a review from atollena February 11, 2025 18:25

easwars approved these changes Feb 11, 2025

View reviewed changes

easwars assigned arjan-bal and unassigned atollena and easwars Feb 11, 2025

arjan-bal added 2 commits February 12, 2025 00:52

Merge remote-tracking branch 'source/master' into ringhash-delegate-t…

ee96885

…o-pf

Address reviews

9e7758e

arjan-bal force-pushed the ringhash-delegate-to-pf branch from a0697d9 to 9e7758e Compare February 12, 2025 06:31

atollena approved these changes Feb 12, 2025

View reviewed changes

arjan-bal and others added 2 commits February 12, 2025 15:23

Update xds/internal/balancer/ringhash/ringhash.go

5d96035

Co-authored-by: Antoine Tollenaere <atollena@gmail.com>

Improve comments

44015ea

arjan-bal mentioned this pull request Feb 13, 2025

ringhash: Update ringhash to handle conditions under which to make child balancers exit idle state #8085

Closed

arjan-bal added 2 commits February 13, 2025 17:59

Handle edge cases which cause balancer to be stuck in TF

bed21c1

Fix flaky test logic

ec55d5b

arjan-bal force-pushed the ringhash-delegate-to-pf branch from 5ee0a8f to ec55d5b Compare February 13, 2025 15:40

arjan-bal merged commit fabe274 into grpc:master Feb 13, 2025
15 checks passed

arjan-bal deleted the ringhash-delegate-to-pf branch February 13, 2025 15:56

janardhanvissa pushed a commit to janardhanvissa/grpc-go that referenced this pull request Feb 15, 2025

ringhash: Delegate subchannel creation to pickfirst (grpc#8047)

0a81516

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ringhash: Delegate subchannel creation to pickfirst #8047

ringhash: Delegate subchannel creation to pickfirst #8047

arjan-bal commented Jan 28, 2025 •

edited

Loading

codecov bot commented Jan 28, 2025 •

edited

Loading

dfawley left a comment

arjan-bal commented Jan 31, 2025

dfawley left a comment

easwars Feb 11, 2025

arjan-bal Feb 12, 2025

atollena Feb 12, 2025

arjan-bal Feb 13, 2025

easwars Feb 11, 2025

arjan-bal Feb 12, 2025

easwars Feb 11, 2025

arjan-bal Feb 12, 2025

atollena left a comment

atollena Feb 12, 2025

arjan-bal Feb 12, 2025

atollena Feb 12, 2025

arjan-bal Feb 12, 2025

atollena Feb 13, 2025

atollena Feb 12, 2025

arjan-bal Feb 12, 2025

atollena Feb 12, 2025

atollena Feb 12, 2025

arjan-bal Feb 12, 2025

atollena Feb 13, 2025

arjan-bal Feb 14, 2025

		@@ -163,14 +164,22 @@ func (s) TestRingHash_ReconnectToMoveOutOfTransientFailure(t *testing.T) {

		// startTestServiceBackends starts num stub servers. It returns their addresses.

ringhash: Delegate subchannel creation to pickfirst #8047

ringhash: Delegate subchannel creation to pickfirst #8047

Conversation

arjan-bal commented Jan 28, 2025 • edited Loading

codecov bot commented Jan 28, 2025 • edited Loading

Codecov Report

dfawley left a comment

Choose a reason for hiding this comment

arjan-bal commented Jan 31, 2025

dfawley left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

atollena left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arjan-bal commented Jan 28, 2025 •

edited

Loading

codecov bot commented Jan 28, 2025 •

edited

Loading