Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-design equivalence class cache to two level cache #65714

Merged
merged 2 commits into from
Jul 19, 2018

Conversation

resouer
Copy link
Contributor

@resouer resouer commented Jul 2, 2018

What this PR does / why we need it:

The current ecache introduced a global lock across all the nodes, and this patch tried to assign ecache per node to eliminate that global lock. The improvement of scheduling performance and throughput are both significant.

CPU Profile Result

Machine: 32-core 60GB GCE VM

1k nodes 10k pods bench test (we've highlighted the critical function):

  1. Current default scheduler with ecache enabled:
    equivlance class cache bench test 001
  2. Current default scheduler with ecache disabled:
    equivlance class cache bench test 002
  3. Current default scheduler with this patch and ecache enabled:
    equivlance class cache bench test 003

Throughput Test Result

1k nodes 3k pods scheduler_perf test:

Current default scheduler, ecache is disabled:

Minimal observed throughput for 3k pod test: 200
PASS
ok      k8s.io/kubernetes/test/integration/scheduler_perf    30.091s

With this patch, ecache is enabled:

Minimal observed throughput for 3k pod test: 556
PASS
ok      k8s.io/kubernetes/test/integration/scheduler_perf    11.119s

Design and implementation:

The idea is: we re-designed ecache into a "two level cache".

The first level cache holds the global lock across nodes and sync is needed only when node is added or deleted, which is of much lower frequency.

The second level cache is assigned per node and its lock is restricted to per node level, thus there's no need to bother the global lock during whole predicate process cycle. For more detail, please check the original discussion.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #63784

Special notes for your reviewer:

Tagged as WIP to make sure this does not break existing code and tests, we can start review after CI is happy.

Release note:

Re-design equivalence class cache to two level cache

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jul 2, 2018
@k8s-ci-robot k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Jul 2, 2018
@resouer resouer force-pushed the fix-63784 branch 2 times, most recently from ec39e38 to 014d756 Compare July 3, 2018 07:40
@@ -51,7 +51,7 @@ type schedulerCache struct {
period time.Duration

// This mutex guards all fields within this cache struct.
mu sync.Mutex
mu sync.RWMutex
Copy link
Contributor Author

@resouer resouer Jul 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment for reviewer: this change is to improve cache.IsUpToDate

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we forget to approve/merge the PR where someone else made this change? /=

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, that previous PR only improved ecache, not cache :D

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the way I see it, we are making 2 changes as part of this PR,

  • Changing scheduler cache's lock from Mutex to RWMutex
  • Changing ecache to be 2 level cache.

So, the flame graphs attached, could be result of both the changes? Am I missing something?

Copy link
Contributor Author

@resouer resouer Jul 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scheduler cache refactoring is just a bonus :D Performance improvement is brought by 2 level cache which is the core idea of this patch.

@resouer resouer changed the title [WIP] Re-design equivalence class cache to two level cache Re-design equivalence class cache to two level cache Jul 3, 2018
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 3, 2018
@resouer
Copy link
Contributor Author

resouer commented Jul 3, 2018

cc @bsalamat @misterikkit Seems CI is happy now, please take a look :D

Copy link

@misterikkit misterikkit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, we want to avoid top level RLock() multiple times when running various predicates for one node. If that's the case, I think we should structure the equivalence pkg API around that. (which is basically what you've done) I just think the names of things could better reflect how we expect users to use them. For example, how about this?

eCache := equivalence.NewCache()
class := equivalence.NewClass(pod)
nodeCache := eCache.ForNode(nodeName)
fit, reasons, err := nodeCache.RunPredicate(p, pod, class, etc...)

The type NodeCache would have only one exported function and no exported fields (maybe it's an interface).

AddNode is done lazily by ForNode.
RemoveNode could keep it's old name of InvalidateAllPredicatesOnNode.

// TopLevelEquivCache is a thread safe nodeMap
// NOTE(harry): Theoretically sync.Map has better performance in machine with 8+ CPUs, while
// the reality is lock contention in first level cache is rare.
type TopLevelEquivCache struct {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exported symbols from this package should be "Cache" and "NewCache". The top level struct/constructor should take those names, while we rename the old ones.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note: To avoid "stutter" we should not put "equiv" or "equivalence" in exported symbols of this package.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And NodeCache, we need to pass this type as parameter in generic_scheduler

@@ -33,6 +33,152 @@ import (
"github.com/golang/glog"
)

// nodeMap is type for a map of {nodeName: *Cache}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is mostly redundant. We know it's a type, a map, and that the value is *Cache. How about something like, "nodeMap stores a *Cache for each node."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks!

// GetOrCreateCache returns the second level Cache for given node name.
// If the cache does not exist, create it.
// This should be called in generic scheduler before predicate phase begins for given node.
func (n *TopLevelEquivCache) GetOrCreateCache(name string) *Cache {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the API surface of this package need to change for this improvement? RunPredicate had the side effect of creating new cache items if needed.

I'm ok with improvements to the eCache API, but this seems like more management work will be put on generic scheduler.

Copy link
Contributor Author

@resouer resouer Jul 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As RunPredicate now only handled by NodeCache, the first level cache has to be managed by generic scheduler.

@@ -41,14 +187,14 @@ import (
// class". (Equivalence class is defined in the `Class` type.) Saved results
// will be reused until an appropriate invalidation function is called.
type Cache struct {
mu sync.RWMutex
cache nodeMap
sync.RWMutex

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove the 'mu' here? By embedding the type, ppl outside this package can start calling (*Cache).Lock(), which I don't think is intended.

@@ -355,21 +355,27 @@ func (g *genericScheduler) findNodesThatFit(pod *v1.Pod, nodes []*v1.Node) ([]*v
// We can use the same metadata producer for all nodes.
meta := g.predicateMetaProducer(pod, g.cachedNodeInfoMap)

var equivClass *equivalence.Class
var (

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

single var can be on single line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! Also, refactored functions to new names :-) Please review.

@resouer resouer force-pushed the fix-63784 branch 2 times, most recently from e36cdd4 to f505798 Compare July 5, 2018 16:00
Copy link

@misterikkit misterikkit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! I left mostly style/naming comments, and they don't all need to be addressed. There are a couple API/threading questions though.

@@ -51,7 +51,7 @@ type schedulerCache struct {
period time.Duration

// This mutex guards all fields within this cache struct.
mu sync.Mutex
mu sync.RWMutex

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we forget to approve/merge the PR where someone else made this change? /=

// nodeMap stores a *Cache for each node.
type nodeMap map[string]*NodeCache

// Cache is a thread safe nodeMap

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the Cache type is the main entry point of this package, I think we should keep a lot of the previous type comment with it. i.e. the fact that it saves and reuses predicate results, and the explanation of node/predicate/class heirarchy. (Those comments were moved to the NodeCache type, but I think it's okay if we repeat ourselves a bit.)

Also, this is the second significant API change I've seen that is driven by performance needs. I think we should add to the comment on this type that users are expected to precompute the equivalence.Class value and pre-lookup the NodeCache within one scheduling cycle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I will clarify these with more comments.

// ForNode returns the existing NodeCache for given node if present. Otherwise,
// it creates the NodeCache and returns it.
// The boolean flag is true if the value was loaded, false if created.
func (n *Cache) ForNode(name string) (nodeCache *NodeCache, exists bool) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's weird for the receiver variable to be n. Could you change it to c?

}

// removeCachedPreds deletes cached predicates by given keys.
func (c *NodeCache) removeCachedPreds(predicateKeys sets.String) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this just be called invalidatePreds?

}

// removeCachedPreds deletes cached predicates by given keys.
func (c *NodeCache) removeCachedPreds(predicateKeys sets.String) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably we should move this function down to the other NodeCache receivers.

return &Cache{
cache: make(nodeMap),
// newCache returns an empty NodeCache.
func newCache() *NodeCache {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's call this newNodeCache to avoid confusion with NewCache

@@ -97,7 +218,7 @@ type predicateResult struct {
// run and its results cached for the next call.
//
// NOTE: RunPredicate will not update the equivalence cache if the given NodeInfo is stale.
func (c *Cache) RunPredicate(
func (c *NodeCache) RunPredicate(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to above, receiver variable should probably change from c to n.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks for catch this nits.

@@ -126,16 +247,15 @@ func (c *Cache) RunPredicate(
}

// updateResult updates the cached result of a predicate.
func (c *Cache) updateResult(
// This function is thread safe for second level Cache, no need to sync with top level cache.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All operations on NodeCache objects are threadsafe in the context of that NodeCache object, right? It might be worth mentioning that on the type's comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

// ForNode returns the existing NodeCache for given node if present. Otherwise,
// it creates the NodeCache and returns it.
// The boolean flag is true if the value was loaded, false if created.
func (n *Cache) ForNode(name string) (nodeCache *NodeCache, exists bool) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should callers do differently depending on the boolean return?

Copy link
Contributor Author

@resouer resouer Jul 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to expose a indication of the change we did in ForNode() call as it has mixed duties. This is only used by tests for now.

@resouer
Copy link
Contributor Author

resouer commented Jul 10, 2018

/test pull-kubernetes-e2e-kops-aws

Copy link

@misterikkit misterikkit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

(I still don't have approver powers, btw)

}

// NewCache returns an empty Cache.
// NewCache create am empty equiv class cache.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/create am/creates an/

c.nodeToCache[name] = newNodeCache()
nodeCache = c.nodeToCache[name]
}
return

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still new to golang - what's the general style for naked returns? Should we be redundant and say return nodeCache, exists here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is one of benefits of named return value, i.e. no need to refactor the return line in the future change.


// InvalidateCachedPredicateItemForPodAdd is a wrapper of
// InvalidateCachedPredicateItem for pod add case
// TODO: This does not belong with the equivalence cache implementatioc.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/implementatioc/implementation/

}
}
// predKeySet returns all cached predicate keys.
func (n *NodeCache) predKeySet() sets.String {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this function used anywhere? Can we delete it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, a legacy one.

mockCache := &syncingMockCache{
Cache: cache,
cycleStart: make(chan struct{}),
cacheInvalidated: make(chan struct{}),
}

eCache := equivalence.NewCache()
// ForNode will lazily create NodeCache for given test node.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed as part of this test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, no need.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 10, 2018
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 11, 2018
// it creates the NodeCache and returns it.
// The boolean flag is true if the value was loaded, false if created.
func (c *Cache) ForNode(name string) (nodeCache *NodeCache, exists bool) {
c.mu.RLock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we do c.mu.Lock() and c.mu.UnLock() as we are adding a new entry in case we don't have one?

}

// InvalidatePredicatesOnNode clears cached results for the given predicates on one node.
func (c *Cache) InvalidatePredicatesOnNode(nodeName string, predicateKeys sets.String) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can probably combine this function and above one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will choose to keep them as separate calls are needed.

@resouer resouer force-pushed the fix-63784 branch 2 times, most recently from 1cb63e9 to 153045e Compare July 11, 2018 15:34
Copy link
Contributor

@ravisantoshgudimetla ravisantoshgudimetla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@resouer Thanks. The changes LTM. I am just curious about RWMutex scale issue. Can you try couple of things

  • Schedule pods on a bigger node where GO_MAX_PROCS is higher(>8).
  • Schedule more pods lon more nodes ike 30k and 5k nodes.

I want to know, if we are going to hit any of the scaling issues mentioned for RWlock.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bsalamat, misterikkit, ravisantoshgudimetla, resouer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 19, 2018
@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Jul 19, 2018

@resouer: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-e2e-gce-device-plugin-gpu e5a7a4c link /test pull-kubernetes-e2e-gce-device-plugin-gpu

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-github-robot
Copy link

Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here.

@resouer
Copy link
Contributor Author

resouer commented Jul 20, 2018

Thanks Bobby, I am keeping eyes on tests grid.

@resouer
Copy link
Contributor Author

resouer commented Jul 25, 2018

cc @wojtek-t you maybe interested in this fix as well. The integration test is still happy by now.

ref: #58222

@bsalamat
Copy link
Member

@resouer I checkout the master today and made sure that your changes are in, but when I run scheduler benchmarks for 5000 nodes, scheduling 1000 pods, I still see a performance degradation when I enable equivalence cache. 20ms/pod when ecache is enabled vs. 18ms/pod when ecache is disabled. I enable/disable eCache here. Am I missing anything in running the test?

eCache disabled:

pkg: k8s.io/kubernetes/test/integration/scheduler_perf
BenchmarkScheduling/5000Nodes/1000Pods-12         	    1000	  18598875 ns/op
PASS
ok  	k8s.io/kubernetes/test/integration/scheduler_perf	41.281s
+++ [0725 11:41:51] Cleaning up etcd
+++ [0725 11:41:51] Integration test cleanup complete

eCache enabled:

pkg: k8s.io/kubernetes/test/integration/scheduler_perf
BenchmarkScheduling/5000Nodes/1000Pods-12         	    1000	  20127793 ns/op
PASS
ok  	k8s.io/kubernetes/test/integration/scheduler_perf	43.632s

@resouer
Copy link
Contributor Author

resouer commented Jul 26, 2018

The test method is right, while I just tested with HEAD@d2fc875489, it seems work well:

{nodes: 5000, existingPods: 0, minPods: 10000},

make test-integration WHAT=./test/integration/scheduler_perf KUBE_TEST_ARGS="-run=xxx -bench=.*BenchmarkScheduling.* -cpuprofile cpu-5000n-10000p.out"

ecache=false: 
BenchmarkScheduling/5000Nodes/0Pods-64             10000          50404244 ns/op
BenchmarkSchedulingAntiAffinity/500Nodes/5000Pods-64        	     250	  18695074 ns/o

ecache=true:
BenchmarkScheduling/5000Nodes/0Pods-64             10000          15933557 ns/op
BenchmarkSchedulingAntiAffinity/500Nodes/5000Pods-64                 250           7225648 ns/op

It's wired. I'm looking into it.

@bsalamat
Copy link
Member

Hmmm. I don't know what is going on. I checked out the head again and used your exact configutation: {nodes: 5000, existingPods: 0, minPods: 10000}. Still the results are in favor of disabled ecache.

Just to make sure we are using similar config, I use the following command line to run the benchmarks:

make test-integration WHAT=./test/integration/scheduler_perf GOFLAGS="-v=1" KUBE_TEST_VMODULE="''" KUBE_TEST_ARGS="-run=xxx -bench BenchmarkScheduling/5000Nodes/0Pods"

Cache enabled:

goos: linux
goarch: amd64
pkg: k8s.io/kubernetes/test/integration/scheduler_perf
BenchmarkScheduling/5000Nodes/0Pods-12         	   10000	  10227194 ns/op
PASS
ok  	k8s.io/kubernetes/test/integration/scheduler_perf	116.835s

cache disabled:

goos: linux
goarch: amd64
pkg: k8s.io/kubernetes/test/integration/scheduler_perf
BenchmarkScheduling/5000Nodes/0Pods-12         	   10000	   8873203 ns/op
PASS
ok  	k8s.io/kubernetes/test/integration/scheduler_perf	103.355s

@resouer
Copy link
Contributor Author

resouer commented Jul 27, 2018

Emm, when I rebased to HEAD@e4465b6e2f (should be latest commit of yesterday), I can see your problem happens:

ecache=false: 
BenchmarkScheduling/5000Nodes/0Pods-64             10000          16414234 ns/op

ecache=true:
BenchmarkScheduling/5000Nodes/0Pods-64             10000          15934557 ns/op

Seems like some recent commits "improved" default scheduler (or bench testing at least) somehow, it's interesting. I am taking look at this.

[Update] After bi-search the commits, it seems related to: 0bf7427 of #66061

// before  0bf7427
ecache=false: 
BenchmarkScheduling/5000Nodes/0Pods-64             10000          57223416 ns/op

// after (with)  0bf7427
ecache=false: 
BenchmarkScheduling/5000Nodes/0Pods-64             10000          16414234 ns/op

@resouer
Copy link
Contributor Author

resouer commented Jul 29, 2018

@bsalamat , a quick follow up is, with test command below, everything works as expected in my env:

export KUBE_TEST_ARGS="-run=xxx -bench=BenchmarkScheduling -cpuprofile cpu-5000n-10000p.out"
make test-integration WHAT=./test/integration/scheduler_perf

But with test command like this, benchmark test will finish much earlier (which never happen before #66061):

make test-integration WHAT=./test/integration/scheduler_perf KUBE_TEST_ARGS="-run=xxx -bench=BenchmarkScheduling -cpuprofile cpu-5000n-10000p.out"

The difference seems related to env passing across multiple .sh files

[Update]: I've reported the issue and this inconsistency will be solved by #66816

@misterikkit
Copy link

misterikkit commented Jul 30, 2018 via email

@resouer
Copy link
Contributor Author

resouer commented Aug 1, 2018

@misterikkit @bsalamat Just fired #66862 and #66816 to partially fix this wired case.

While the root cause I recently observed is: depending on KUBE_TEST_VMODULE="''" is set or not, the test result will be hugely differentiated. i.e. with -vmodule='' the benchmark test will finish much earlier than expected.

@bsalamat
Copy link
Member

bsalamat commented Aug 2, 2018

Thanks, @resouer! As you have noticed, vmodule skews the results significantly. It must be disabled for performance benchmarking, otherwise vmodule takes most of the time. "vmodule" is disabled in our production code, but it is enabled by default when running integration tests, including our benchmarks.

I used the following command to run the benchmarks while vmodule is disabled:

KUBE_TEST_ARGS="-run=xxx -bench=BenchmarkScheduling/5000Nodes/0Pods" GOFLAGS="-v=1" KUBE_TEST_VMODULE="''" make test-integration WHAT=./test/integration/scheduler_perf

I still see a slight performance decrease when ecache is enabled. We should think what we kind of changes we should make to improve eCache's performance. CPU profiling may reveal areas of potential improvement.

Cache disabled

goos: linux
goarch: amd64
pkg: k8s.io/kubernetes/test/integration/scheduler_perf
BenchmarkScheduling/5000Nodes/0Pods-12         	   10000	   9401233 ns/op
PASS
ok  	k8s.io/kubernetes/test/integration/scheduler_perf	108.352s

Cache enabled:

pkg: k8s.io/kubernetes/test/integration/scheduler_perf
BenchmarkScheduling/5000Nodes/0Pods-12         	   10000	  10969937 ns/op
PASS
ok  	k8s.io/kubernetes/test/integration/scheduler_perf	124.011s

@bsalamat
Copy link
Member

bsalamat commented Aug 2, 2018

I have some good news. The above results are obtained with regular Scheduler benchmark. That benchmark tries scheduling a number of simple pods. The only scheduling requirement of those pods is the amount of resources available on nodes. Checking amount of resources is a simple mathematical operation. That's why we don't see performance improvement when we enable eCache. I ran the AntiAffinity benchmarks with and without eCache. eCache improves performance by over 16x!

These results indicate that eCache is beneficial only for more sophisticated predicates which are computationally expensive. Examples are inter-pod affinity and anti-affinity predicates. What I think we should do is to find a list of predicates which are more computationally expensive and use eCache only for those, instead of using the eCache for all the predicates.

Cache disabled

goos: linux
goarch: amd64
pkg: k8s.io/kubernetes/test/integration/scheduler_perf
BenchmarkSchedulingAntiAffinity/5000Nodes/0Pods-12         	    2000	 287683998 ns/op
PASS
ok  	k8s.io/kubernetes/test/integration/scheduler_perf	589.456s

Cache enabled

goos: linux
goarch: amd64
pkg: k8s.io/kubernetes/test/integration/scheduler_perf
BenchmarkSchedulingAntiAffinity/5000Nodes/0Pods-12         	    2000	  17425632 ns/op
PASS
ok  	k8s.io/kubernetes/test/integration/scheduler_perf	48.989s

@resouer
Copy link
Contributor Author

resouer commented Aug 3, 2018

AntiAffinity benchmarks with and without eCache. eCache improves performance by over 16x!

@bsalamat, Yes, I am aware that the current benchmark test we are playing is too "simple" :D

And another thing I'd like to amend is, in current logic, GenericPredicates are frequently invalidated as a whole by new Pod added event. This should explains why the tests results are close.

We should think what we kind of changes we should make to improve eCache's performance

Actually, after #66862 is in, the only impact of ecache is map accesses in lookupResult(), unfortunately, according to CPU profiling, I am pretty sure map access is not O(1) in Golang, and just as you mentioned, it's slower than simple mathematical operation.

What I think we should do is to find a list of predicates which are more computationally expensive and use eCache only for those, instead of using the eCache for all the predicates.

This is exactly what's in my mind. Thanks! Will continue on this direction, while instead of bypass specific predicates, I would prefer if we can measure them in the future.

And I was also hoping to invalidate PodFitsResources separately or skip it during ecache, it will be a noticeable improvement for the ecache hit rate.

k8s-github-robot pushed a commit that referenced this pull request Aug 22, 2018
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="/~https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Add misterikkit to sig-scheduling REVIEWERS.

I have met the following criteria.
- member for at least 3 months
- primary reviewer for at least 5 PRs
  - #63603
  - #63665 (and related PRs)
  - #63839
  - #65714
  - #66862
- reviewed or merged at least 20 PRs
  reviewed 13: /~https://github.com/pulls?utf8=%E2%9C%93&q=is%3Apr+archived%3Afalse+is%3Amerged+repo%3Akubernetes%2Fkubernetes+commenter%3Amisterikkit+in%3Acomment+assignee%3Amisterikkit+
  merged 22: /~https://github.com/pulls?utf8=%E2%9C%93&q=is%3Apr+author%3Amisterikkit+archived%3Afalse+is%3Amerged+repo%3Akubernetes%2Fkubernetes+


**Release note**:
```release-note
NONE
```
/cc @bsalamat
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

When enable the equivalence cache feature which makes the pod scheduling performance decline
7 participants