-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix AP upgrade version issue #27277
Fix AP upgrade version issue #27277
Conversation
CI Results: |
@@ -582,6 +583,7 @@ func NewRaftBackend(conf map[string]string, logger log.Logger) (physical.Backend | |||
failGetInTxn: new(uint32), | |||
raftLogVerifierEnabled: backendConfig.RaftLogVerifierEnabled, | |||
raftLogVerificationInterval: backendConfig.RaftLogVerificationInterval, | |||
effectiveSDKVersion: version.GetVersion().Version, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting it by default to the binary version is a defensive move so we have something that's 99% of the time going to be correct anyway even if the plumbing of the effective version is still incorrect in some edge case (or becomes incorrect later).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call
// EffectiveSDKVersion is typically the version string baked into the binary. | ||
// We pass it in though because it can be overridden in tests or via ENV in | ||
// core. | ||
EffectiveSDKVersion string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I preferred making this an option when we setup the cluster since that is an existing mechanism for runtime config that isn't known at NewRaftBackend
time rather than a customer Set*
method that needs correct plumbing.
TLSKeyring: raftTLS, | ||
ClusterListener: c.getClusterListener(), | ||
StartAsLeader: creating, | ||
EffectiveSDKVersion: c.effectiveSDKVersion, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that c.effectiveSDKVersion
is only ever set in the CreateCore
method before the Core
is returned so if it's going to be set it will have been by now and doesn't need any locking to read it (we read it without locks in other places too). That means this will always be the correct value by the time this call is made - configured by coreConfig
or defaulted in CreateCore
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes look great Paul, and make sense to me.
I tried writing a test for this plumbing at least to catch any future regressions. It looked like this: // TestRaft_Autopilot_BinaryVersionPlumbing is an apparently trivial test that
// ensures that the default plumbing in Vault core to configure the binary
// version of the raft library is working. Hopefully this will trivially pass
// from now on, however it would have caught a regression in the past!
func TestRaft_Autopilot_BinaryVersionPlumbing(t *testing.T) {
t.Parallel()
coreCfg, clusterOpts := raftClusterBuilder(t, &RaftClusterOpts{
EnableAutopilot: true,
// We need 2 nodes because the code path that regressed was different on a
// standby vs active node so we'd not detect the problem if we only test on
// an active node.
NumCores: 2,
})
// Default options should not set EffectiveSDKVersion(Map) which would defeat
// the point of this test by plumbing versions via config.
require.Nil(t, clusterOpts.EffectiveSDKVersionMap)
require.Empty(t, coreCfg.EffectiveSDKVersion)
c := vault.NewTestCluster(t, coreCfg, &clusterOpts)
vault.TestWaitActive(t, c.Cores[0].Core)
client := c.Cores[0].Client
// Have to wait for the follower to heartbeat at least once or we're just
// effectively reading the leader's own version back. We know when this has
// happened because it's AppliedIndex will be > 0. Just to be robust against
// assuming core-0 is the leader, check that all servers have an AppliedIndex
// > 0.
corehelpers.RetryUntil(t, 3*time.Minute, func() error {
state, err := client.Sys().RaftAutopilotState()
if err != nil {
return err
}
if state == nil {
return fmt.Errorf("no state")
}
wantVersion := version.GetVersion().Version
for _, s := range state.Servers {
if s.LastIndex == 0 {
return fmt.Errorf("server %q has not sent a heartbeat", s.ID)
}
if s.Version != wantVersion {
return fmt.Errorf("want version %s, got %s", wantVersion, s.Version)
}
}
return nil
})
} The problem is, that this test passes on To get a test case that really reproduces the bug this PR fixes, we'd need the followers to be a different binary version. We could possibly do that with a docker-based test I guess but it would be a huge pain without shimming the versions through config... and once you do that you no longer test the actual default code path that caused the regression! My feeling is that we have a case of diminishing returns jumping through hoops to get a docker integration test that actually checks out and builds a different version of the binary on some nodes in order to really exercise this. Our manual testing (which we will now codify in a script with a thoroughly considered matix of scenarios) will need to be enough. Open to ideas if anyone else can come up with something that is worth adding to CI tests though! |
This reverts commit e25fcd8.
Build Results: |
OK I figured out a way to test this that reliably fails on main 😅 . I feel OK about this since the backend already exposes This fails on main because |
c := vault.NewTestCluster(t, coreCfg, &clusterOpts) | ||
defer c.Cleanup() | ||
|
||
// Wait for follower to be perf standby |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is CE Vault, the follower will never become a perf standby, will it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, in CE that helper just waits for active node. So yeah it's possible this test would not catch the regression in CE before this fix or would fail for the wrong reason, but either way I think it passes now and actively catches the issue in Enterprise so it seem worthwhile keeping it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll update the comment 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want to backport this to 1.17 too right? Though perhaps not merge that backport until after 1.17.0 is released. |
* Fix AP upgrade version issue * add heartbeat logging at trace level * add log to show when heartbeats resume * Test the plumbing * Revert "Test the plumbing" This reverts commit e25fcd8. * Add CHANGELOG * Add plumbing test * Update misleading comment --------- Co-authored-by: Josh Black <raskchanky@gmail.com>
* Fix AP upgrade version issue * add heartbeat logging at trace level * add log to show when heartbeats resume * Test the plumbing * Revert "Test the plumbing" This reverts commit e25fcd8. * Add CHANGELOG * Add plumbing test * Update misleading comment --------- Co-authored-by: Josh Black <raskchanky@gmail.com>
This is a subtle issue introduced by #26449 while fixing a different issue. It only impacts Vault Enterprise but is generally simpler to reason about.
That PR was a correct fix, however the
effectiveSDKVersion
was only ever being set on active nodes not on perf standbys in Enterprise.The big change here is:
SetupCluster
which is necessary on all types of serversRaftBackend.effectiveSDKVersion
back to the binary version so even if we fail to plumb this from core in some edge case in the future it's still the right value in almost all cases (other than testing environments).I've manually tested this locally. To reproduce the issue it fixes you need to:
autopilot_upgrade_version
in config on the new servers (doing so bypasses the bug).In this case you'll see the new servers join but remain "target version" because their Echo's are sending an empty version string. You also see some logs about empty
upgrade_versions
which are legit in this case but are also confusing because you can see them even after the fix due to timing (if AP runs before an Echo has come in but after node joined raft on the leader you'll see that log even on a non-broken cluster and then you don't see a corresponding "OK" log after wards).TODO