-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Huge regression: Upgrading HA DynamoDB backend based Vault cluster to 1.11.0 and upwards breaks functional cluster #17737
Comments
Doing a with the corresponding commit 106f548a41c5aa3a7310e1b53f2196f989913eff Commenting out the return line just makes the cluster functional again @ncabatoff any insights on this please ? |
May be the test This explains why the test is always true in AWS behind a LB and therefore why the Vault HA cluster is not functional any more |
and therefore
in the received JSON advertisement structure and the same "addresses" are in the Core structure |
As a first suggestion of what could be a more complete test which might solve the issue is to do a My very basic 2 cents |
@ccapurso many thanks for categorising this |
AFAICT I do not think that having DynamoDB as the vault backend has anything to do with the issue. I am pretty much sure I can reproduce this with Raft backend. Willl try to demonstrate this if time permits |
Hi @obourdon we saw the same issue today on our cluster running 1.11.3 with Dyanamodb backend and AWS behind an NLB. Do you know how to reproduce this bug ? We have been running 1.11.X for the past month and only noticed it today. |
@ArchiFleKs on my side with a working HA cluster (Vault <=1.10.7) I just replace the vault binary on any one node by any version >= 1.11.0 and relaunch the associated Linux service then run the command I pretty much also guess that this might change if the node you replace the Vault binary on is the current leader/active node. Any Vault call which will be targeted at the now modified node will be failing (in my case 1 out of 3 times because cluster has 3 nodes and LB dispatch requests evenly) The situation can completely be reversed even with an entirely failing cluster by replacing Vault by any version <= 1.10.7 on all failing nodes (at least it was on my setup were I now have everything back to normal |
Thanks for your inputs. Currently in the process of downgrading, what about 1.10.8 released 2 days ago ? |
@ArchiFleKs seems like it should be working perfectly like 1.10.7 (at least it does not include the failing 1.11.0 code) |
I do not see any activity on this huge regression issue. Can someone please take care of this ? |
Still no feedback on this ???? |
@obourdon I think the reason this change is impacting you is because every node has the same cluster_address,
We know we're not the active node because at the top of the function we return |
I encountered the same issue but we are using specifc |
If you're referring to the error |
I must admit that I do not get some of the points described above AFAIU the only way to describe the cluster address when behind a load balancer using AWS ASG is to specify the LB address as you do not know in advance what would be your instances IPs and if one instance is failing and a new one respawned by the ASG you should not have to change your Vault cluster configuration files I indeed do not think that the issue is related to the fact that the backend is based on DynamoDB by looking at the location of the failing code and I still do not understand why this change was done in the first place and what it is susposed to fix
|
@obourdon Yes I encountered the same issue at the same time you reported this bug, but since then I kept Vault > 1.10 running in our dev environment and I was not able to reproduce the issue nor did I encounter it a second time. And yes we are using a different |
@ncabatoff you're right that's what lead me to believe it might not have been the same issue despite using the same backend (dynamoDB). And it might have been just a transitive issue. Because I was not able to reproduce it in the end. |
@ArchiFleKs many thanks for all the details above an indeed we also currently keep the Vault version strictly under 1.11 to make sure that the source code does not contain the issue It would be very nice if Vault could support the same @ncabatoff if you could spend some time explaining why this code was changed in the first place in more details as it is leading to a definitely NOT transient but permanent and definitive Vault cluster failure, we would greatly appreciate |
still no activity on this one |
We wound up doing it in #9109.
It was to prevent attempting to connect to ourself, which was causing problems in some tests. |
@ncabatoff could you be a bit more specific and point us to the failing tests please and how to potentially run them locally on our development machines |
Any update on this ? |
I missed the release window on this one, please note that the fix won't be in the next releases, but rather the ones next month. |
Describe the bug
A fully functional HA Vault cluster whose backend is DynamoDB can be updated from 1.9.4 up to 1.10.7 without any problem
Upgrading one more time to 1.11.0 just make the cluster non functional
To Reproduce
Steps to reproduce the behavior:
curl http://127.0.0.1:8200/v1/sys/leader
curl -sk --header "X-Vault-Token: XXXX" https://vault.internal.mycluster.io/v1/sys/ha-status | jq -rS .
Expected behavior
With version 1.10.7 and below we get:
and
Environment:
vault status
): 1.11.0vault version
): 1.11.0Vault server configuration file(s):
Additional context
The functional cluster was originally running Vault 1.9.4 and I tried to upgrade it directly to 1.12.0 which failed with the behaviour described above
I therefore decided to rollback to 1.9.4 and to upgrade one version at a time until it breaks.
The following versions were tested successfully:
1.11.0 and upwards just breaks
It is to be noted also that the vault cluster as it is residing within an AWS infrastructure is behind a Network LoadBalancer properly configured as can be seen in the attached files
Putting the Vault processes in debug/trace mode did not help finding any valuable information which could solve the issue
I could not see anything either in the changelog of 1.11.0 which could help
The text was updated successfully, but these errors were encountered: