Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSI driver causes EFS filesystems to mount to wrong mountpoint #282

Closed
jrtcppv opened this issue Nov 17, 2020 · 7 comments
Closed

CSI driver causes EFS filesystems to mount to wrong mountpoint #282

jrtcppv opened this issue Nov 17, 2020 · 7 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@jrtcppv
Copy link

jrtcppv commented Nov 17, 2020

/kind bug

We use a sharded EFS setup in which multiple EFS filesystems are mounted to one pod. Files are distributed among the shards randomly so that we don't reach an I/O bottleneck in any one filesystem. We have found that randomly some pods will come up with a different EFS share mounted than what the pod spec specifies. To check for this we created a file on each filesystem corresponding to its correct mount point, so the filesystem that should be mounted at /upload001 contains a file called upload001, so if we do a ls /upload001/upload001 the file should exist. What we found was that occasionally that file does not exist and a different filesystem is indeed mounted in a place it shouldn't be. See the below printout:

transcoder@devastator:~/tator$ for i in {000..007}; do kubectl exec gunicorn-deployment-5bd59d6c8-8jc29 ls /upload$i/upload$i; done
/upload000/upload000
/upload001/upload001
/upload002/upload002
ls: cannot access '/upload003/upload003': No such file or directory
command terminated with exit code 2
/upload004/upload004
/upload005/upload005
/upload006/upload006
/upload007/upload007
transcoder@devastator:~/tator$ for i in {000..007}; do kubectl exec gunicorn-deployment-5bd59d6c8-gz957 ls /upload$i/upload$i; done
/upload000/upload000
/upload001/upload001
/upload002/upload002
/upload003/upload003
/upload004/upload004
/upload005/upload005
/upload006/upload006
/upload007/upload007

For the second pod all the shards list the test file as expected, in the first they do not. Note that these pods both have identical specs, as they are created by the same deployment. This never happened before we switched to using the CSI driver and we noticed it only because we were getting random 404 errors in our nginx deployment.

I am not 100% sure how to reproduce this. We have seen it in multiple deployments (nginx and gunicorn) and we are mapping about 17 filesystems to various mount points in each pod. My guess is if you create a similar scenario and delete pods and check them in the way outlined above you will see that sometimes the mapping fails.

I just want to make clear this is a serious issue. One of our cronjobs periodically garbage collects files from our EFS shards; any file that does not have a corresponding database object containing the given EFS path is deleted. If we had been unlucky and this occurred in the pod that was performing that cleanup, it would have deleted every file in the filesystem that was mounted improperly.

  • Kubernetes version (use kubectl version): 1.17.9-eks-4c6976
  • Driver version: 1.0
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Nov 17, 2020
@wongma7
Copy link
Contributor

wongma7 commented Nov 17, 2020

Want to make sure I understand: in the first example, instead of seeing upload003/upload003 that Pod saw one of upload003/upload00{0,1,2,4,5,6,7}?

ls: cannot access '/upload003/upload003': No such file or directory
command terminated with exit code 2

Could you share logs of the CSI driver on the node to which such a Pod was scheduled? This script can help gather logs. /~https://github.com/kubernetes-sigs/aws-efs-csi-driver/tree/master/troubleshooting . If you are worried about sensitive info or quantity of logs, feel free to send it over email or slack.

@jrtcppv
Copy link
Author

jrtcppv commented Nov 17, 2020

Okay I just sent you the logs via email.

Your description of the issue is correct, /upload003 actually saw /upload003/upload005, so it was mounted to one of the other volumes. We verified this by both checking for the same number of files in both mount points (/upload003 and /upload005) and by verifying a file with a UUID filename existed in both mount points.

@shankerbalan
Copy link

We are seeing a similar issue as well. We do ~10 deployment daily and each pod has 2 EFS mounts. In the past month, we have seen the mount order reversed atleast 3 times.

Its really hard to re-produce. We have updated the AMIs on all the EKS clusters to see if it resolves the issue. Will post logs next time we hit the issue.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 9, 2021
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 8, 2021
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

5 participants