Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition in store result map #107

Conversation

Ezetowers
Copy link
Contributor

@Ezetowers Ezetowers commented Feb 12, 2022

What type of PR is this?

/kind bug

What this PR does / why we need it:

Fixes a race condition in the Store struct located in the resultstore package

Which issue(s) this PR fixes:

There is no issue opened for this bug

Special notes for your reviewer:

I've been using the simulator to test different scheduler configurations starting from a existing cluster status. In order to do that, I am mocking the status of the cluster in the simulator creating equivalent nodes in the original cluster and forcing pods to be placed in the same nodes. Since my clusters have a lot of nodes and pods, I've created some scripts to load the existing cluster and by doing this I've found out a race condition

Steps to reproduce

  • Start the simulator (both bare metal and docker-compose version should work)
  • Create a huge amount of pods in a short period of time. This script can be used to do that (https://gist.github.com/Ezetowers/35c2166f74d234710fe34048e09a3c21) or import an already exported simulation with a huge amount of nodes
  • Repeat the past steps until the following error appears
fatal error: concurrent map read and map write

goroutine 5255 [running]:
runtime.throw({0x2774833, 0x20ef145})
        /opt/go/src/runtime/panic.go:1198 +0x71 fp=0xc004bd0b08 sp=0xc004bd0ad8 pc=0x437731
runtime.mapaccess2_faststr(0x23ce960, 0xc00bef6e10, {0xc008ec8b30, 0xf})
        /opt/go/src/runtime/map_faststr.go:116 +0x3d4 fp=0xc004bd0b70 sp=0xc004bd0b08 pc=0x414854
github.com/kubernetes-sigs/kube-scheduler-simulator/scheduler/plugin/resultstore.(*Store).AddFilterResult(0xc0082afb90, {0xc00bef6e10, 0xc007c35800}, {0xc00bef6e00, 0x7f14c98d43d8}, {0xc00aa0f924, 0x6}, {0x27526ec, 0x12}, {0x273e564, ...})
        /home/etorres/Development/kube-scheduler-simulator/scheduler/plugin/resultstore/store.go:175 +0x13c fp=0xc004bd0bf8 sp=0xc004bd0b70 pc=0x216631c
github.com/kubernetes-sigs/kube-scheduler-simulator/scheduler/plugin.(*simulatorPlugin).Filter(0xc001e34be0, {0x2bb2f90, 0xc0063ebf80}, 0xc0069dcdb0, 0xc007c35800, 0xc0097ff440)
        /home/etorres/Development/kube-scheduler-simulator/scheduler/plugin/plugins.go:323 +0x125 fp=0xc004bd0c88 sp=0xc004bd0bf8 pc=0x21694e5
k8s.io/kubernetes/pkg/scheduler/framework/runtime.(*frameworkImpl).runFilterPlugin(0x8edccf, {0x2bb2f90, 0xc0063ebf80}, {0x7f14c1cff378, 0xc001e34be0}, 0x0, 0xc004bd0d18, 0xc004bd0d70)
        /home/etorres/go/pkg/mod/k8s.io/kubernetes@v1.22.0/pkg/scheduler/framework/runtime/framework.go:597 +0x167 fp=0xc004bd0d10 sp=0xc004bd0c88 pc=0x2105f47
k8s.io/kubernetes/pkg/scheduler/framework/runtime.(*frameworkImpl).RunFilterPlugins(0xc00646f340, {0x2bb2f90, 0xc0063ebf80}, 0xc00646f340, 0xc007c35800, 0x4a1b1e)
        /home/etorres/go/pkg/mod/k8s.io/kubernetes@v1.22.0/pkg/scheduler/framework/runtime/framework.go:575 +0xf6 fp=0xc004bd0e08 sp=0xc004bd0d10 pc=0x2105a36
k8s.io/kubernetes/pkg/scheduler/framework/runtime.(*frameworkImpl).RunFilterPluginsWithNominatedPods(0x7f14f2dcfd28, {0x2bb2f90, 0xc0063ebf80}, 0xc00683de60, 0xc00506a528, 0xc0097ff440)
        /home/etorres/go/pkg/mod/k8s.io/kubernetes@v1.22.0/pkg/scheduler/framework/runtime/framework.go:683 +0x128 fp=0xc004bd0e98 sp=0xc004bd0e08 pc=0x21068e8
k8s.io/kubernetes/pkg/scheduler.(*genericScheduler).findNodesThatPassFilters.func1(0x0)
        /home/etorres/go/pkg/mod/k8s.io/kubernetes@v1.22.0/pkg/scheduler/generic_scheduler.go:296 +0xd1 fp=0xc004bd0f38 sp=0xc004bd0e98 pc=0x215d771
k8s.io/client-go/util/workqueue.ParallelizeUntil.func1()
        /home/etorres/go/pkg/mod/k8s.io/client-go@v0.22.0/util/workqueue/parallelizer.go:90 +0x150 fp=0xc004bd0fe0 sp=0xc004bd0f38 pc=0x1147c70
runtime.goexit()
        /opt/go/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc004bd0fe8 sp=0xc004bd0fe0 pc=0x46a9c1
created by k8s.io/client-go/util/workqueue.ParallelizeUntil
        /home/etorres/go/pkg/mod/k8s.io/client-go@v0.22.0/util/workqueue/parallelizer.go:76 +0x1dc

/label tide/merge-method-squash

* Add mutexes to every read/write access the store makes
  to the results map
@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. labels Feb 12, 2022
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Feb 12, 2022

CLA Signed

The committers are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Feb 12, 2022
@k8s-ci-robot
Copy link
Contributor

Welcome @Ezetowers!

It looks like this is your first PR to kubernetes-sigs/kube-scheduler-simulator 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/kube-scheduler-simulator has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

Hi @Ezetowers. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Feb 12, 2022
@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Feb 12, 2022
@Ezetowers
Copy link
Contributor Author

/kind bug

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Feb 12, 2022
@Ezetowers Ezetowers marked this pull request as ready for review February 12, 2022 19:00
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 12, 2022
@k8s-ci-robot k8s-ci-robot requested a review from adtac February 12, 2022 19:01
@Ezetowers Ezetowers changed the title Fix race condition if store result map Fix race condition in store result map Feb 12, 2022
@sanposhiho
Copy link
Member

/ok-to-test
/assign

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 13, 2022
Copy link
Member

@sanposhiho sanposhiho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Ezetowers, thanks for creating the PR.
Overall looks good. Left one small comment.

scheduler/plugin/resultstore/store.go Show resolved Hide resolved
Co-authored-by: Kensei Nakada <handbomusic@gmail.com>
@Ezetowers
Copy link
Contributor Author

Ezetowers commented Feb 13, 2022

Hi @Ezetowers, thanks for creating the PR. Overall looks good. Left one small comment.

Hi @sanposhiho, thanks for the quick response!. Comment has been added

Copy link
Member

@sanposhiho sanposhiho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks 👍

/lgtm
/approve


Btw, just for your information, we have KUBE_API_PORT and KUBE_API_HOST env to fix internal kube-apiserver address. (This is not documented yet)
With them, you can directly communicate with internal kube-apiserver via kubectl or client-go to create a lot of resources of your resources in the simulator. It may be helpful for your simulating.

/~https://github.com/kubernetes-sigs/kube-scheduler-simulator/blob/master/config/config.go#L79

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 14, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Ezetowers, sanposhiho

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 14, 2022
@k8s-ci-robot k8s-ci-robot merged commit 693f4a6 into kubernetes-sigs:master Feb 14, 2022
@Ezetowers Ezetowers deleted the race-condition-during-high-load-store branch September 23, 2022 23:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants