CSI driver causes EFS filesystems to mount to wrong mountpoint #282

jrtcppv · 2020-11-17T02:43:34Z

/kind bug

We use a sharded EFS setup in which multiple EFS filesystems are mounted to one pod. Files are distributed among the shards randomly so that we don't reach an I/O bottleneck in any one filesystem. We have found that randomly some pods will come up with a different EFS share mounted than what the pod spec specifies. To check for this we created a file on each filesystem corresponding to its correct mount point, so the filesystem that should be mounted at /upload001 contains a file called upload001, so if we do a ls /upload001/upload001 the file should exist. What we found was that occasionally that file does not exist and a different filesystem is indeed mounted in a place it shouldn't be. See the below printout:

transcoder@devastator:~/tator$ for i in {000..007}; do kubectl exec gunicorn-deployment-5bd59d6c8-8jc29 ls /upload$i/upload$i; done
/upload000/upload000
/upload001/upload001
/upload002/upload002
ls: cannot access '/upload003/upload003': No such file or directory
command terminated with exit code 2
/upload004/upload004
/upload005/upload005
/upload006/upload006
/upload007/upload007
transcoder@devastator:~/tator$ for i in {000..007}; do kubectl exec gunicorn-deployment-5bd59d6c8-gz957 ls /upload$i/upload$i; done
/upload000/upload000
/upload001/upload001
/upload002/upload002
/upload003/upload003
/upload004/upload004
/upload005/upload005
/upload006/upload006
/upload007/upload007

For the second pod all the shards list the test file as expected, in the first they do not. Note that these pods both have identical specs, as they are created by the same deployment. This never happened before we switched to using the CSI driver and we noticed it only because we were getting random 404 errors in our nginx deployment.

I am not 100% sure how to reproduce this. We have seen it in multiple deployments (nginx and gunicorn) and we are mapping about 17 filesystems to various mount points in each pod. My guess is if you create a similar scenario and delete pods and check them in the way outlined above you will see that sometimes the mapping fails.

I just want to make clear this is a serious issue. One of our cronjobs periodically garbage collects files from our EFS shards; any file that does not have a corresponding database object containing the given EFS path is deleted. If we had been unlucky and this occurred in the pod that was performing that cleanup, it would have deleted every file in the filesystem that was mounted improperly.

Kubernetes version (use kubectl version): 1.17.9-eks-4c6976
Driver version: 1.0

The text was updated successfully, but these errors were encountered:

wongma7 · 2020-11-17T02:53:29Z

Want to make sure I understand: in the first example, instead of seeing upload003/upload003 that Pod saw one of upload003/upload00{0,1,2,4,5,6,7}?

ls: cannot access '/upload003/upload003': No such file or directory
command terminated with exit code 2

Could you share logs of the CSI driver on the node to which such a Pod was scheduled? This script can help gather logs. /~https://github.com/kubernetes-sigs/aws-efs-csi-driver/tree/master/troubleshooting . If you are worried about sensitive info or quantity of logs, feel free to send it over email or slack.

jrtcppv · 2020-11-17T04:51:47Z

Okay I just sent you the logs via email.

Your description of the issue is correct, /upload003 actually saw /upload003/upload005, so it was mounted to one of the other volumes. We verified this by both checking for the same number of files in both mount points (/upload003 and /upload005) and by verifying a file with a UUID filename existed in both mount points.

shankerbalan · 2021-02-08T05:05:23Z

We are seeing a similar issue as well. We do ~10 deployment daily and each pod has 2 EFS mounts. In the past month, we have seen the mount order reversed atleast 3 times.

Its really hard to re-produce. We have updated the AMIs on all the EKS clusters to see if it resolves the issue. Will post logs next time we hit the issue.

fejta-bot · 2021-05-09T05:11:56Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

fejta-bot · 2021-06-08T05:38:49Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

fejta-bot · 2021-07-08T06:19:53Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

k8s-ci-robot · 2021-07-08T06:19:57Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Nov 17, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 9, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 8, 2021

k8s-ci-robot closed this as completed Jul 8, 2021

617m4rc mentioned this issue Feb 18, 2022

In pods with multiple EFS PVC mounts, the EFS-CSI driver seems to sometimes mix up mount points #635

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSI driver causes EFS filesystems to mount to wrong mountpoint #282

CSI driver causes EFS filesystems to mount to wrong mountpoint #282

jrtcppv commented Nov 17, 2020 •

edited

Loading

wongma7 commented Nov 17, 2020

jrtcppv commented Nov 17, 2020

shankerbalan commented Feb 8, 2021

fejta-bot commented May 9, 2021

fejta-bot commented Jun 8, 2021

fejta-bot commented Jul 8, 2021

k8s-ci-robot commented Jul 8, 2021

CSI driver causes EFS filesystems to mount to wrong mountpoint #282

CSI driver causes EFS filesystems to mount to wrong mountpoint #282

Comments

jrtcppv commented Nov 17, 2020 • edited Loading

wongma7 commented Nov 17, 2020

jrtcppv commented Nov 17, 2020

shankerbalan commented Feb 8, 2021

fejta-bot commented May 9, 2021

fejta-bot commented Jun 8, 2021

fejta-bot commented Jul 8, 2021

k8s-ci-robot commented Jul 8, 2021

jrtcppv commented Nov 17, 2020 •

edited

Loading