Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ringhash: Delegate subchannel creation to pickfirst #8047

Merged
merged 20 commits into from
Feb 13, 2025

Conversation

arjan-bal
Copy link
Contributor

@arjan-bal arjan-bal commented Jan 28, 2025

As per gRFC A61, this PR switches the ringhash balancer to defer to pick first instead of creating and handling SubConns itself. This is part of Dualstack, and makes this a petiole.

RELEASE NOTES:

  • ringhash: ringhash uses pick first child policies to manage endpoints instead of creating SubConns.
  • ringhash: ringhash uses the Endpoints field instead of the Addresses field in resolver state updates.

@arjan-bal arjan-bal added Type: Feature New features or improvements in behavior Area: Resolvers/Balancers Includes LB policy & NR APIs, resolver/balancer/picker wrappers, LB policy impls and utilities. labels Jan 28, 2025
@arjan-bal arjan-bal added this to the 1.71 Release milestone Jan 28, 2025
Copy link

codecov bot commented Jan 28, 2025

Codecov Report

Attention: Patch coverage is 97.14286% with 6 lines in your changes missing coverage. Please review.

Project coverage is 82.23%. Comparing base (ad5cd32) to head (ec55d5b).
Report is 5 commits behind head on master.

Files with missing lines Patch % Lines
xds/internal/balancer/ringhash/ringhash.go 97.43% 4 Missing ⚠️
xds/internal/balancer/ringhash/picker.go 88.88% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8047      +/-   ##
==========================================
+ Coverage   82.10%   82.23%   +0.12%     
==========================================
  Files         387      387              
  Lines       39067    38928     -139     
==========================================
- Hits        32076    32011      -65     
+ Misses       5662     5598      -64     
+ Partials     1329     1319      -10     
Files with missing lines Coverage Δ
balancer/weightedroundrobin/weightedroundrobin.go 100.00% <100.00%> (ø)
xds/internal/balancer/ringhash/logging.go 100.00% <ø> (ø)
xds/internal/balancer/ringhash/ring.go 92.95% <100.00%> (+0.65%) ⬆️
xds/internal/balancer/ringhash/picker.go 90.00% <88.88%> (+5.66%) ⬆️
xds/internal/balancer/ringhash/ringhash.go 93.80% <97.43%> (+6.18%) ⬆️

... and 26 files with indirect coverage changes

Copy link
Member

@dfawley dfawley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this also needs to wait for #8052, right?

This is a pretty large change, and I haven't done a very thorough review, but it mostly looks good to me. I would like to request @easwars or maybe @atollena to review more closely, since they're more familiar with this logic than me.

arjan-bal and others added 2 commits January 31, 2025 16:28
Co-authored-by: Doug Fawley <dfawley@google.com>
@arjan-bal
Copy link
Contributor Author

I guess this also needs to wait for #8052, right?

This PR depends on the lazy balancer and #8052 changes it. Either PR can be merged first. I'll rebase the other.

Copy link
Member

@dfawley dfawley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just some comment changes that I noticed...

@dfawley dfawley assigned arjan-bal and unassigned dfawley Feb 3, 2025
arjan-bal and others added 2 commits February 4, 2025 19:09
Co-authored-by: Doug Fawley <dfawley@google.com>
@arjan-bal arjan-bal removed their assignment Feb 4, 2025
@easwars easwars assigned arjan-bal and unassigned easwars Feb 4, 2025
@arjan-bal arjan-bal force-pushed the ringhash-delegate-to-pf branch from 7fe6692 to 5202208 Compare February 11, 2025 10:00
@arjan-bal arjan-bal assigned arjan-bal and unassigned atollena Feb 11, 2025
@arjan-bal arjan-bal force-pushed the ringhash-delegate-to-pf branch from 7065793 to 0b91fe0 Compare February 11, 2025 18:16
@arjan-bal arjan-bal assigned atollena and unassigned arjan-bal Feb 11, 2025
@arjan-bal arjan-bal requested a review from atollena February 11, 2025 18:25
}
return addrsUpdated

b.updatePickerLocked(endpointEnteredTF)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please write a comment or two about why we trigger connecting to the next idle endpoint only when this condition is true. Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a comment in updatePickerLocked, where the boolean is used. endpointEnteredTF was added based on the the pseudocode in A61:

if (in_transient_failure && endpoint_entered_transient_failure) {

On re-reading the existing comment in updatePickerLocked, I think that the endpoint_entered_transient_failure condition can cause the ring hash to remain in TF in certain situations where the aggregated state is TF, but no endpoint actually enters TF:

  • Scenario 1: There are four endpoints in the following states: TF, TF, READY, and IDLE. If the READY endpoint fails, it transitions to IDLE, resulting in the new states: TF, TF, IDLE, IDLE.
  • Scenario 2: There are four endpoints in the following states: TF, TF, CONNECTING, and IDLE. If the CONNECTING endpoint is removed, the new states become: TF, TF, IDLE.

I've asked about this in the maintainers group.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great finding. Please let us know the outcome of this discussion is, as iiuc this is happening on a Google private channel. Also, it would be nice to demonstrate the behavior of those scenario via end 2 end tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: These edge cases were overlooked when writing A61. There is a discussion going on about the best way to handle such cases. I plan to remove the in_transient_failure condition in Go, add unit tests to verify ringhash recovers from TF in both these cases and add a TODO to update the implementation based on the result of the discussion later.

@@ -163,14 +164,22 @@ func (s) TestRingHash_ReconnectToMoveOutOfTransientFailure(t *testing.T) {

// startTestServiceBackends starts num stub servers. It returns their addresses.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: The return type has changed. Please update the comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the comment.

Comment on lines 2188 to 2204
var otherEndpointAddr string
var backendIdx int
switch initialAddr {
case backendAddrs[0]:
otherEndpointAddr = backendAddrs[1]
backendIdx = 0
case backendAddrs[1]:
otherEndpointAddr = backendAddrs[0]
backendIdx = 1
case backendAddrs[2]:
otherEndpointAddr = backendAddrs[3]
backendIdx = 2
case backendAddrs[3]:
otherEndpointAddr = backendAddrs[2]
backendIdx = 3
default:
t.Fatalf("Request sent to unknown backend: %q", initialAddr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand this. Shouldn't the first RPC always reach the first address in the endpoint. i.e backendAddrs[0]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're correct. The cases for the second address should never be used. Removed the extra cases now.

@easwars easwars assigned arjan-bal and unassigned atollena and easwars Feb 11, 2025
@arjan-bal arjan-bal force-pushed the ringhash-delegate-to-pf branch from a0697d9 to 9e7758e Compare February 12, 2025 06:31
Copy link
Collaborator

@atollena atollena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just minor nits, otherwise lgtm.

ring *ring
logger *grpclog.PrefixLogger
// endpointStates is a cache of endpoint connectivity states and pickers.
// The cache is used to avoid locking the mutex for reading the state at RPC
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had trouble understanding this comment. What mutex are you referring to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ringhash balancer stores endpoint states in a resolver.EndpointMap, with access guarded by ringhashBalancer.mu. The endpointStates cache in the picker helps avoid locking the ringhash balancer's mutex when reading the latest state at RPC time. This also means that the picker needs to be regenerated every time a child generates a new picker.

Mentioned this in the comment.

// Per gRFC A61, because of sticky-TF with PickFirst's auto reconnect on TF,
// we ignore all TF subchannels and find the first ring entry in READY,
// CONNECTING or IDLE. If that entry is in IDLE, we need to initiate a
// connection. The idlePicker returned by the LazyLB or the new Pickfirst
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why "or the new PickFirst"? Isn't exitIdle always triggered by the lazy balancer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lazy balancer's idle picker is used the first time an endpoint exits idle state. If a subchannel fails after connecting to the backend, the pickfirst balancer will enter idle, but this time ringhash would get pickfirst's idle picker instead of the lazy balancer's idle picker.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, makes sense.

// the previous, i.e. any of the following is true:
// - an endpoint was added
// - an endpoint was removed
// - an endpoint's weight was updated
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should add that:

- the first addresses of the endpoint has changed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the comment.

}
return addrsUpdated

b.updatePickerLocked(endpointEnteredTF)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great finding. Please let us know the outcome of this discussion is, as iiuc this is happening on a Google private channel. Also, it would be nice to demonstrate the behavior of those scenario via end 2 end tests.

// when the first endpoint fails. The test then verifies that the balancer
// doesn't automatically re-connect to the endpoint when the READY endpoint
// enters IDLE.
func (s) TestRingHash_NoReconnectWhenEndpointEntersIdle(t *testing.T) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got this test to timeout once with the following output:

--- FAIL: Test (11.78s)
    --- FAIL: Test/RingHash_NoReconnectWhenEndpointEntersIdle (10.01s)
        tlogger.go:116: INFO restartable_listener.go:84 [testutils] Stopping restartable listener "127.0.0.1:63116"  (t=+1.977333ms)
        tlogger.go:116: INFO server.go:690 [core] [Server #44496]Server created  (t=+2.222708ms)
        stubserver.go:300: Started test service backend at "127.0.0.1:63117"
        ringhash_balancer_test.go:2247: Created new snapshot cache...
        tlogger.go:116: INFO server.go:886 [core] [Server #44496 ListenSocket #44497]ListenSocket created  (t=+2.287291ms)
        tlogger.go:116: INFO server.go:690 [core] [Server #44498]Server created  (t=+2.485ms)
        ringhash_balancer_test.go:2247: Registered Aggregated Discovery Service (ADS)...
        ringhash_balancer_test.go:2247: xDS management server serving at: 127.0.0.1:63118...
        tlogger.go:116: INFO server.go:886 [core] [Server #44498 ListenSocket #44499]ListenSocket created  (t=+2.538875ms)
        server.go:229: Created new resource snapshot...
        server.go:235: Updated snapshot cache with resource snapshot...
        tlogger.go:116: INFO clientconn.go:1659 [core] original dial target is: "xds:///test.server"  (t=+3.567791ms)
        tlogger.go:116: INFO clientconn.go:319 [core] [Channel #44500]Channel created  (t=+3.585125ms)
        tlogger.go:116: INFO clientconn.go:193 [core] [Channel #44500]parsed dial target is: resolver.Target{URL:url.URL{Scheme:"xds", Opaque:"", User:(*url.Userinfo)(nil), Host:"", Path:"/test.server", RawPath:"", OmitHost:false, ForceQuery:false, RawQuery:"", Fragment:"", RawFragment:""}}  (t=+3.593916ms)
        tlogger.go:116: INFO clientconn.go:194 [core] [Channel #44500]Channel authority set to "test.server"  (t=+3.602666ms)
        tlogger.go:116: INFO blocking_context_dialer.go:106 [testutils] Hold 0x14000f32630: Waiting for a connection attempt to addr "127.0.0.1:63116"  (t=+3.619708ms)
        tlogger.go:116: INFO xds_resolver.go:111 [xds] [xds-resolver 0x14000f12360] Creating resolver for target: xds:///test.server  (t=+3.641875ms)
        tlogger.go:116: INFO pool.go:242 [xds] xDS node ID: 69b48cbe-e27f-436d-9f7a-9093e09cb9ba  (t=+3.709ms)
        tlogger.go:116: INFO clientconn.go:1659 [core] original dial target is: "passthrough:///127.0.0.1:63118"  (t=+3.724958ms)
        tlogger.go:116: INFO clientconn.go:319 [core] [Channel #44501]Channel created  (t=+3.732833ms)
        tlogger.go:116: INFO clientconn.go:193 [core] [Channel #44501]parsed dial target is: resolver.Target{URL:url.URL{Scheme:"passthrough", Opaque:"", User:(*url.Userinfo)(nil), Host:"", Path:"/127.0.0.1:63118", RawPath:"", OmitHost:false, ForceQuery:false, RawQuery:"", Fragment:"", RawFragment:""}}  (t=+3.739041ms)
        tlogger.go:116: INFO clientconn.go:194 [core] [Channel #44501]Channel authority set to "127.0.0.1:63118"  (t=+3.744375ms)
        tlogger.go:116: INFO resolver_wrapper.go:211 [core] [Channel #44501]Resolver state updated: {
              "Addresses": [
                {
                  "Addr": "127.0.0.1:63118",
                  "ServerName": "",
                  "Attributes": null,
                  "BalancerAttributes": null,
                  "Metadata": null
                }
              ],
              "Endpoints": [
                {
                  "Addresses": [
                    {
                      "Addr": "127.0.0.1:63118",
                      "ServerName": "",
                      "Attributes": null,
                      "BalancerAttributes": null,
                      "Metadata": null
                    }
                  ],
                  "Attributes": null
                }
              ],
              "ServiceConfig": null,
              "Attributes": null
            } (resolver returned new addresses)  (t=+3.772791ms)
        tlogger.go:116: INFO balancer_wrapper.go:122 [core] [Channel #44501]Channel switches to new LB policy "pick_first"  (t=+3.785625ms)
        tlogger.go:116: INFO clientconn.go:843 [core] [Channel #44501 SubChannel #44502]Subchannel created  (t=+3.797333ms)
        tlogger.go:116: INFO clientconn.go:549 [core] [Channel #44501]Channel Connectivity change to CONNECTING  (t=+3.80375ms)
        tlogger.go:116: INFO clientconn.go:319 [core] [Channel #44501]Channel exiting idle mode  (t=+3.809791ms)
        tlogger.go:116: INFO clientconn.go:1204 [core] [Channel #44501 SubChannel #44502]Subchannel Connectivity change to CONNECTING  (t=+3.831291ms)
        tlogger.go:116: INFO clientconn.go:1324 [core] [Channel #44501 SubChannel #44502]Subchannel picks a new address "127.0.0.1:63118" to connect  (t=+3.840375ms)
        tlogger.go:116: INFO clientconn.go:319 [core] [Channel #44500]Channel exiting idle mode  (t=+3.848541ms)
        tlogger.go:116: INFO clientconn.go:1204 [core] [Channel #44501 SubChannel #44502]Subchannel Connectivity change to READY  (t=+4.021541ms)
        tlogger.go:116: INFO clientconn.go:549 [core] [Channel #44501]Channel Connectivity change to READY  (t=+4.036375ms)
        logging.go:30: nodeID "69b48cbe-e27f-436d-9f7a-9093e09cb9ba" requested type.googleapis.com/envoy.config.listener.v3.Listener[test.server] and known map[]. Diff [test.server]
        logging.go:30: respond type.googleapis.com/envoy.config.listener.v3.Listener[test.server] version "" with version "1"
        logging.go:30: nodeID "69b48cbe-e27f-436d-9f7a-9093e09cb9ba" requested type.googleapis.com/envoy.config.listener.v3.Listener[test.server] and known map[test.server:{}]. Diff []
        logging.go:30: open watch 1 for type.googleapis.com/envoy.config.listener.v3.Listener[test.server] from nodeID "69b48cbe-e27f-436d-9f7a-9093e09cb9ba", version "1"
        logging.go:30: nodeID "69b48cbe-e27f-436d-9f7a-9093e09cb9ba" requested type.googleapis.com/envoy.config.route.v3.RouteConfiguration[new_route] and known map[]. Diff [new_route]
        logging.go:30: respond type.googleapis.com/envoy.config.route.v3.RouteConfiguration[new_route] version "" with version "1"
        tlogger.go:116: INFO xds_resolver.go:295 [xds] [xds-resolver 0x14000f12360] For Listener resource "test.server" and RouteConfiguration resource "new_route", generated service config: {
              "loadBalancingConfig": [
                {
                  "xds_cluster_manager_experimental": {
                    "children": {
                      "cluster:cluster": {
                        "childPolicy": [
                          {
                            "cds_experimental": {
                              "cluster": "cluster"
                            }
                          }
                        ]
                      }
                    }
                  }
                }
              ]
            }  (t=+4.334291ms)
        tlogger.go:116: INFO resolver_wrapper.go:211 [core] [Channel #44500]Resolver state updated: {
              "Addresses": null,
              "Endpoints": [],
              "ServiceConfig": {
                "Config": {
                  "Config": null,
                  "Methods": {}
                },
                "Err": null
              },
              "Attributes": {
                "\u003c%!p(resolver.csKeyType=grpc.internal.resolver.configSelector)\u003e": "\u003c0x14000be5c70\u003e",
                "\u003c%!p(xdsclient.clientKeyType=grpc.xds.internal.client.Client)\u003e": "\u003c0x14000f2c800\u003e"
              }
            } (service config updated)  (t=+4.523416ms)
        tlogger.go:116: INFO balancer_wrapper.go:122 [core] [Channel #44500]Channel switches to new LB policy "xds_cluster_manager_experimental"  (t=+4.544916ms)
        tlogger.go:116: INFO clustermanager.go:59 [xds] [xds-cluster-manager-lb 0x14000bec780] Created  (t=+4.56025ms)
        tlogger.go:116: INFO balancergroup.go:299 [xds] [xds-cluster-manager-lb 0x14000bec780] Adding child policy of type "cds_experimental" for child "cluster:cluster"  (t=+4.572625ms)
        tlogger.go:116: INFO balancergroup.go:102 [xds] [xds-cluster-manager-lb 0x14000bec780] Creating child policy of type "cds_experimental" for child "cluster:cluster"  (t=+4.579958ms)
        tlogger.go:116: INFO cdsbalancer.go:114 [xds] [cds-lb 0x1400068ec00] Created  (t=+4.592041ms)
        logging.go:30: nodeID "69b48cbe-e27f-436d-9f7a-9093e09cb9ba" requested type.googleapis.com/envoy.config.route.v3.RouteConfiguration[new_route] and known map[new_route:{}]. Diff []
        logging.go:30: open watch 2 for type.googleapis.com/envoy.config.route.v3.RouteConfiguration[new_route] from nodeID "69b48cbe-e27f-436d-9f7a-9093e09cb9ba", version "1"
        tlogger.go:116: INFO cdsbalancer.go:126 [xds] [cds-lb 0x1400068ec00] xDS credentials in use: false  (t=+4.598416ms)
        tlogger.go:116: INFO cdsbalancer.go:301 [xds] [cds-lb 0x1400068ec00] Received balancer config update: {
              "LoadBalancingConfig": null,
              "Cluster": "cluster"
            }  (t=+4.637208ms)
        tlogger.go:116: INFO balancerstateaggregator.go:191 [xds] [xds-cluster-manager-lb 0x14000bec780] Child pickers: map[cluster:cluster:picker:0x14000bf0a00,state:CONNECTING,stateToAggregate:CONNECTING]  (t=+4.666041ms)
        tlogger.go:116: INFO clientconn.go:549 [core] [Channel #44500]Channel Connectivity change to CONNECTING  (t=+4.67675ms)
        logging.go:30: nodeID "69b48cbe-e27f-436d-9f7a-9093e09cb9ba" requested type.googleapis.com/envoy.config.cluster.v3.Cluster[cluster] and known map[]. Diff [cluster]
        logging.go:30: respond type.googleapis.com/envoy.config.cluster.v3.Cluster[cluster] version "" with version "1"
        tlogger.go:116: INFO cdsbalancer.go:425 [xds] [cds-lb 0x1400068ec00] Received Cluster resource: {
              "ClusterType": 0,
              "ClusterName": "cluster",
              "EDSServiceName": "cluster",
              "LRSServerConfig": null,
              "SecurityCfg": null,
              "MaxRequests": null,
              "DNSHostName": "",
              "PrioritizedClusterNames": null,
              "LBPolicy": [
                {
                  "ring_hash_experimental": {
                    "minRingSize": 1024,
                    "maxRingSize": 8388608
                  }
                }
              ],
              "OutlierDetection": null,
              "Raw": {
                "type_url": "type.googleapis.com/envoy.config.cluster.v3.Cluster",
                "value": "CgdjbHVzdGVyGg0KAhoAEgdjbHVzdGVyMAIQAw=="
              },
              "TelemetryLabels": {
                "csm.service_name": "unknown",
                "csm.service_namespace_name": "unknown"
              }
            }  (t=+4.855291ms)
        tlogger.go:116: INFO clusterresolver.go:85 [xds] [xds-cluster-resolver-lb 0x14000530820] Created  (t=+4.888083ms)
        tlogger.go:116: INFO cdsbalancer.go:466 [xds] [cds-lb 0x1400068ec00] Created child policy 0x14000530820 of type cluster_resolver_experimental  (t=+4.898ms)
        logging.go:30: nodeID "69b48cbe-e27f-436d-9f7a-9093e09cb9ba" requested type.googleapis.com/envoy.config.cluster.v3.Cluster[cluster] and known map[cluster:{}]. Diff []
        logging.go:30: open watch 3 for type.googleapis.com/envoy.config.cluster.v3.Cluster[cluster] from nodeID "69b48cbe-e27f-436d-9f7a-9093e09cb9ba", version "1"
        ringhash_balancer_test.go:2287: Timeout waiting for connection attempt to backend "127.0.0.1:63116".
        tlogger.go:116: INFO clientconn.go:549 [core] [Channel #44500]Channel Connectivity change to SHUTDOWN  (t=+10.003514958s)
        tlogger.go:116: INFO resolver_wrapper.go:111 [core] [Channel #44500]Closing the name resolver  (t=+10.00356s)
        tlogger.go:116: INFO balancer_wrapper.go:160 [core] [Channel #44500]ccBalancerWrapper: closing  (t=+10.003606416s)
        tlogger.go:116: INFO watch_service.go:56 [xds] [xds-resolver 0x14000f12360] Canceling watch on Listener resource "test.server"  (t=+10.003708333s)
        tlogger.go:116: INFO clusterresolver.go:355 [xds] [xds-cluster-resolver-lb 0x14000530820] Shutdown  (t=+10.003748625s)
        tlogger.go:116: INFO cdsbalancer.go:391 [xds] [cds-lb 0x1400068ec00] Shutdown  (t=+10.003777666s)
        tlogger.go:116: INFO clustermanager.go:197 [xds] [xds-cluster-manager-lb 0x14000bec780] Shutdown  (t=+10.003803291s)
        tlogger.go:116: WARNING ads_stream.go:514 [xds] [xds-client 0x14000f30990] [xds-channel 0x14000e98600] [ads-stream 0x1400161d300] ADS stream closed: rpc error: code = Canceled desc = context canceled  (t=+10.00383s)
        tlogger.go:116: INFO ads_stream.go:178 [xds] [xds-client 0x14000f30990] [xds-channel 0x14000e98600] [ads-stream 0x1400161d300] Stopping ADS stream  (t=+10.003872291s)
        tlogger.go:116: INFO clientconn.go:549 [core] [Channel #44501]Channel Connectivity change to SHUTDOWN  (t=+10.003931375s)
        tlogger.go:116: INFO resolver_wrapper.go:111 [core] [Channel #44501]Closing the name resolver  (t=+10.003957541s)
        tlogger.go:116: INFO balancer_wrapper.go:160 [core] [Channel #44501]ccBalancerWrapper: closing  (t=+10.0039875s)
        tlogger.go:116: WARNING ads_stream.go:471 [xds] [xds-client 0x14000f30990] [xds-channel 0x14000e98600] [ads-stream 0x1400161d300] Sending ADS request for type "type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment", resources: [], version: "", nonce: "" failed: EOF  (t=+10.004030416s)
        tlogger.go:116: INFO clientconn.go:1204 [core] [Channel #44501 SubChannel #44502]Subchannel Connectivity change to SHUTDOWN  (t=+10.004085583s)
        tlogger.go:116: INFO clientconn.go:1532 [core] [Channel #44501 SubChannel #44502]Subchannel deleted  (t=+10.004120916s)
        tlogger.go:116: INFO clientconn.go:319 [core] [Channel #44501]Channel deleted  (t=+10.004357083s)
        tlogger.go:116: INFO channel.go:145 [xds] [xds-client 0x14000f30990] [xds-channel 0x14000e98600] Shutdown  (t=+10.004390083s)
        tlogger.go:116: INFO watch_service.go:91 [xds] [xds-resolver 0x14000f12360] Canceling watch on RouteConfiguration resource "new_route"  (t=+10.004416791s)
        tlogger.go:116: INFO clientimpl.go:228 [xds] [xds-client 0x14000f30990] Shutdown  (t=+10.004449208s)
        tlogger.go:116: INFO xds_resolver.go:264 [xds] [xds-resolver 0x14000f12360] Shutdown  (t=+10.00447575s)
        tlogger.go:116: INFO clientconn.go:319 [core] [Channel #44500]Channel deleted  (t=+10.004502041s)
        tlogger.go:116: INFO server.go:822 [core] [Server #44498 ListenSocket #44499]ListenSocket deleted  (t=+10.00461225s)
        tlogger.go:116: INFO server.go:822 [core] [Server #44496 ListenSocket #44497]ListenSocket deleted  (t=+10.004753s)

I haven't had time to dig into it, but you should be able to reproduce with go test -count 100 ./xds/internal/balancer/ringhash/e2e -run Test/RingHash_NoReconnectWhenEndpointEntersIdle. I haven't been able to reproduce again, so maybe we can just ignore it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the success case, the next few log lines after Created child policy 0x14000530820 of type cluster_resolver_experimental are similar to the following:

logging.go:30: nodeID "267a4abd-bf2b-4af8-9119-b88eeb7e1d98" requested type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment[cluster] and known map[]. Diff [cluster]
logging.go:30: respond type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment[cluster] version "" with version "1"
tlogger.go:116: INFO balancer.go:76 [xds] [priority-lb 0xc000378230] Created (t=+9.91661ms)

It seems like that clusterresolver didn't start a watch for EDS resources. The root cause is unclear, but it seems unrelated to the changes in this PR. I tried to repro the failure with debug logs enabled by running the test 10^5 times on Google's test cluster, but all the runs passed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, not sure what happened when it failed locally with -count 100, I tried to run that again and it did not fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an issue open for a failure with similar logs: #7993

arjan-bal and others added 2 commits February 12, 2025 15:23
Co-authored-by: Antoine Tollenaere <atollena@gmail.com>
@arjan-bal arjan-bal force-pushed the ringhash-delegate-to-pf branch from 5ee0a8f to ec55d5b Compare February 13, 2025 15:40
@arjan-bal arjan-bal merged commit fabe274 into grpc:master Feb 13, 2025
15 checks passed
@arjan-bal arjan-bal deleted the ringhash-delegate-to-pf branch February 13, 2025 15:56
janardhanvissa pushed a commit to janardhanvissa/grpc-go that referenced this pull request Feb 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: Resolvers/Balancers Includes LB policy & NR APIs, resolver/balancer/picker wrappers, LB policy impls and utilities. Type: Feature New features or improvements in behavior
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants