-
Notifications
You must be signed in to change notification settings - Fork 560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
In pods with multiple EFS PVC mounts, the EFS-CSI driver seems to sometimes mix up mount points #635
Comments
Sounds pretty much like #282 doesn't it? |
We have experienced a similar bug multiple times, although I am not 100% sure if it is the exact same as described here, and is more in line with the behaviour described in the closed issue #282 In our case, I noticed while browsing via shell on one of the pods, that the same volume was actually mounted in 2 places, whereas those mountpoints should have had different mounts (one of them was correct and the other was the wrong volume) Our workload writes files in dated folders, so I could see that this issue has been recurred multiple times for a few months, by finding dated files from months back in wrong volumes, but has gone unnoticed most of the times. I don't have logs and we usually only notice this after the fact when some file read causes a mysterious "file not found" error, or some data suddenly appears in the wrong volume. It happens rarely and at random times, to random pods, while other pods from the same deployment/replicaset are working as expected. What I have noticed is this - inside the pod, these volumes are mounted as nfs4 mounts from localhost with random port numbers. When the issue happens, I observed that multiple mounts, which should be different volumes, are mounted with the exact same port number, and those are also the volumes which have been mixed up. My uneducated guess would be that the driver assigns a random port number but it does not check for conflicting ports (or if it does, this check fails). I might be wrong, but I suspect it still has to do something with the ports being assigned the same. I think this is potentially a very serious issue, as some workloads might be deleting data from efs volumes, and deleting data from the wrong volume would be a serious problem. @jgoeres I would suggest checking the port numbers when you observe this bug, to see if the behaviour is the same. |
Since our setup described in my initial post is a bit involved, we tried to build a simple reproducer. We create configurable number of PVCs via an EFS-CSI-driver-backed storage class. A "controller" pod mounts these and puts a unique marker file in each. It then proceeds to launch "validator" pods in rapid succession. The validators check if the marker files are in the expected share. If they are, they report success to the controller and exit (to be restarted a second later by the controller), if not, they report an error to the controller and stay running (the idea being that this would allow us to exec into the pod and check for the root cause). |
I find it puzzling that you don't observe the problem in your synthetic test setup.. Are the "validator" pods being ran one by one? I have no evidence of this, but maybe the issue is more likely to appear if many pods mount shares? At least in our case, the pods experiencing this employ HPA, so they are often scaled from 3 to a few dozen and back. Is your production spark workload similar? Maybe you can modify your test to run as a deployment with a few dozen replicas. The ones that "succeed" and exit, would get re-launched as new pods by the replicaset to keep the desired replica account. If you define the marker files as known beforehand (like a filename "a" on the volume meant to be mounted to /mnt/a/ or something like that), you could probably even skip the controller - you would be able to tell a pod has the bug just by the running time - but I don't know your setup, you might still want an alert in one way or another. If many parallel pods doesn't reproduce the issue, another edge-case one could try is set node affinity or anti-affinity for the parallel pods. Maybe the issue is more likely with multiple parallel pods with the same EFS mounts running on the same instance. Or maybe the other way around - if they are on different hosts. Again, I am just guessing here. |
In the meantime, the reproducer we built manages to actually reproduce the issue. I had the controller permanently (re-)creating 30 "validator" pods and let that run since Friday evening. When I checked today, the issue had occurred 9 times out of 145358 launched pods (each mounting the same 5 PVCs into directories /a to /e).
As you can see, directory /c that should have a marker file marker_c instead contains marker_a.
Just as mounting fails, unmount does, too:
This second type of error also sometimes comes with an additional message:
Another observation is that of the four nodes onto which the validator pods could be scheduled, 8 out of 9 validators that ran into the problem had been scheduled on the same node:
It is also interesting that the error messages sometimes occur in the seconds before the error is reported by the validator, while sometimes the last log statement of the efs-csi-node pod happened several minutes before the incident.
In this example, there appears to be a temporal correlation between the time of the incident (12:30:15) and the last log messages of that node's efs-csi-node pod (a failed unmount at 12:30:07). You can find the last hour of logs of that efs-csi-node pod attached.
The efs-csi-node pod has several hundred listening ports, established connections and unix sockets (see attached file). Any pointers on how to proceed from here would be welcome, this issue has been plaguing us for a while now and since we now have three mounts instead of two as before, it is getting more frequent. Regards J efs-csi-node_netstat.txt |
@jgoeres that's great amount of details. I hope someone knowledgeable can use your data to get to the bottom of this, seeing as it is hard to reproduce on purpose. We are also plagued by this issue. Out of curiosity, did you observe the conflicting mounts to use the same port, in the output of |
@jgoeres We were facing a similar kind of issue. Seems like the efs version needs to upgraded to v1.3.7 . After this change we did not notice this issue. |
We're hitting the same issue with efs-csi driver v1.3.4. Can anyone else confirm whether v1.3.7 might have fixed the problem? cc: @wongma7 |
We were having this issue regularly with helm chart up to v2.2.4 (driver v1.3.6). After updating to chart v2.2.6 (driver v1.3.8) the reports of this issue stopped for more than a month. The behavior was the same as I noticed previously - same EFS volume mounted on different paths, and also confirmed that in mountpoint list (command |
My team is being impacted by the same issue, observed with versions 1.3.6 and 1.3.8. We mount three persistent volume over a single EFS volumes using dynamic provisioning. Our workload tries to read specific files from specific volumes and sometimes fails because of the miss-mount. I tried digging a little and so far I'm thinking efs-utils which provide the support for There is an MR open in there that tries to fix a race condition with port selection for the connection to EFS (aws/efs-utils#129). I'm not familiar with CSI retry policy and how this driver handles mount retries but it sound quite related with the issue here. |
Hello. I was wondering: I run into something that sounds very similar (issue #695). Does anybody see similar symptomps in the CSI node driver and kubelet logs (some volumes are mounted multiple times)? |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
I think it's caused by aws/efs-utils#125 - the race can result in a timeout / failure or in very rare cases it could IMO surface as efs-utils using the same stunnel port for two different volumes & mounting an unexpected one there. I have no reproducer though. |
We have identified the race condition that could cause this issue. The root cause is a race condition when selecting the TLS port: the race condition happens between sock.close() → stunnel established. We are already working on the fix. |
@lemosric have you considered aws/efs-utils#129 ? |
Hi @jsafrane , |
Latest release of efs-csi-driver have efs-utils v1.34.4 release which contains this fix. Closing. |
/kind bug
Hi all,
we are using EFS CSI driver v1.3.2 on EKS v1.21.5.
We are running a big data app where significant parts of the number crunching logic has been implemented with Spark 3.1.1. We use the Spark operator (/~https://github.com/GoogleCloudPlatform/spark-on-k8s-operator) v1.2.1-3.0.0 (the issue described here also occurs with 1.2.3-3.1.1) to run the Spark jobs.
The Spark operator defines a CRD "SparkApplication". In our application pods, whenver we need to run a number crunching job, we create a SparkApplication instance, which the Spark operator will then see and create a Spark driver pod. The driver itself then creates one more more executor pods which do most of the actual work.
In our Spark applications, we require access to two different PVCs (e.g., A and B) which are both provisioned through the EFS-CSI driver and which we mount into the Spark pods (say, to mount points /a and /b). To make this happen, we add the corresponding volumes and volume mounts definitions to the SparkApplication objects.
(As a side note, due to some technical constraints inside the Spark operator, when it creates the driver pod, it cannot add these volumes and volume mounts to the driver pod spec directly. Instead, it defines a mutating webhook to intercept the (its own!) pod creation calls and add the volumes this way.)
We run roughly 500 of these Spark jobs every night.
What we now observe for a while now is that on average about one of these jobs fails each night (sometimes a few more, in some nights none). When searching for the cause of the failure, we found that the two volumes are mounted into the wrong location, i.e., PVC A would be mounted to /b and PVC B to /a.
Our first guess was that something with the whole Spark (operator) volume handling was wrong. At first we of course blamed ourselves and added a bit of logging to our own app to see in detail how the SparkApplication objects that we create look like. However, this logging confirmed that we created the SparkApplication objects properly (i.e, A => /a, B => /b). The next thing we considered was that maybe the whole webhook magic of the Spark operator had an issue. So we added a watch for pod creation events to check if the driver and/or executor pods are created with "flipped" volume mounts by the Spark operator. But the pod definitions were also as they should be.
Then we noticed that the issues started around the time when we switched from the old EFS provisioner to the EFS CSI driver.
So we started looking at the logs of the efs-csi-controller and efs-csi-node pods. Controllers looked unsuspicious (they barely log anything), but we found a lot of error messages in the efs-csi-node pods. Interestingly, these error messages tend to happen more or less all the time, not only when the pods with the "mismounted" volumes are starting. We also see them on efs-csi-node pods that are on nodes where the Spark jobs are not running (we have a dedicated node group exclusively for the Spark pods). Alas, we couldn't make much sense of the logs, below is a pretty representative excerpt:
Could anyone provide any indication of what could be going on here?
Is this a known issue? I searched through the current issues but couldn't find one, which comes as not too big a surprise, since having more than one mount and have these mounted into pods that are launched many times a day, is probably a very special use case.
Any indications on how to proceed with finding the root cause? Can we, e.g., increase the log level to see each and every mount operation in more detail?
Regards
J
The text was updated successfully, but these errors were encountered: