-
Notifications
You must be signed in to change notification settings - Fork 40.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-design equivalence class cache to two level cache #65714
Conversation
ec39e38
to
014d756
Compare
@@ -51,7 +51,7 @@ type schedulerCache struct { | |||
period time.Duration | |||
|
|||
// This mutex guards all fields within this cache struct. | |||
mu sync.Mutex | |||
mu sync.RWMutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment for reviewer: this change is to improve cache.IsUpToDate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we forget to approve/merge the PR where someone else made this change? /=
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, that previous PR only improved ecache
, not cache
:D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, the way I see it, we are making 2 changes as part of this PR,
- Changing scheduler cache's lock from Mutex to RWMutex
- Changing ecache to be 2 level cache.
So, the flame graphs attached, could be result of both the changes? Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scheduler cache refactoring is just a bonus :D Performance improvement is brought by 2 level cache which is the core idea of this patch.
cc @bsalamat @misterikkit Seems CI is happy now, please take a look :D |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, we want to avoid top level RLock() multiple times when running various predicates for one node. If that's the case, I think we should structure the equivalence pkg API around that. (which is basically what you've done) I just think the names of things could better reflect how we expect users to use them. For example, how about this?
eCache := equivalence.NewCache()
class := equivalence.NewClass(pod)
nodeCache := eCache.ForNode(nodeName)
fit, reasons, err := nodeCache.RunPredicate(p, pod, class, etc...)
The type NodeCache
would have only one exported function and no exported fields (maybe it's an interface).
AddNode
is done lazily by ForNode
.
RemoveNode
could keep it's old name of InvalidateAllPredicatesOnNode
.
// TopLevelEquivCache is a thread safe nodeMap | ||
// NOTE(harry): Theoretically sync.Map has better performance in machine with 8+ CPUs, while | ||
// the reality is lock contention in first level cache is rare. | ||
type TopLevelEquivCache struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exported symbols from this package should be "Cache" and "NewCache". The top level struct/constructor should take those names, while we rename the old ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side note: To avoid "stutter" we should not put "equiv" or "equivalence" in exported symbols of this package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And NodeCache
, we need to pass this type as parameter in generic_scheduler
@@ -33,6 +33,152 @@ import ( | |||
"github.com/golang/glog" | |||
) | |||
|
|||
// nodeMap is type for a map of {nodeName: *Cache} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is mostly redundant. We know it's a type, a map, and that the value is *Cache. How about something like, "nodeMap stores a *Cache for each node."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thanks!
// GetOrCreateCache returns the second level Cache for given node name. | ||
// If the cache does not exist, create it. | ||
// This should be called in generic scheduler before predicate phase begins for given node. | ||
func (n *TopLevelEquivCache) GetOrCreateCache(name string) *Cache { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the API surface of this package need to change for this improvement? RunPredicate had the side effect of creating new cache items if needed.
I'm ok with improvements to the eCache API, but this seems like more management work will be put on generic scheduler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As RunPredicate
now only handled by NodeCache
, the first level cache has to be managed by generic scheduler.
@@ -41,14 +187,14 @@ import ( | |||
// class". (Equivalence class is defined in the `Class` type.) Saved results | |||
// will be reused until an appropriate invalidation function is called. | |||
type Cache struct { | |||
mu sync.RWMutex | |||
cache nodeMap | |||
sync.RWMutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove the 'mu' here? By embedding the type, ppl outside this package can start calling (*Cache).Lock()
, which I don't think is intended.
@@ -355,21 +355,27 @@ func (g *genericScheduler) findNodesThatFit(pod *v1.Pod, nodes []*v1.Node) ([]*v | |||
// We can use the same metadata producer for all nodes. | |||
meta := g.predicateMetaProducer(pod, g.cachedNodeInfoMap) | |||
|
|||
var equivClass *equivalence.Class | |||
var ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
single var can be on single line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! Also, refactored functions to new names :-) Please review.
e36cdd4
to
f505798
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! I left mostly style/naming comments, and they don't all need to be addressed. There are a couple API/threading questions though.
@@ -51,7 +51,7 @@ type schedulerCache struct { | |||
period time.Duration | |||
|
|||
// This mutex guards all fields within this cache struct. | |||
mu sync.Mutex | |||
mu sync.RWMutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we forget to approve/merge the PR where someone else made this change? /=
// nodeMap stores a *Cache for each node. | ||
type nodeMap map[string]*NodeCache | ||
|
||
// Cache is a thread safe nodeMap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the Cache type is the main entry point of this package, I think we should keep a lot of the previous type comment with it. i.e. the fact that it saves and reuses predicate results, and the explanation of node/predicate/class heirarchy. (Those comments were moved to the NodeCache type, but I think it's okay if we repeat ourselves a bit.)
Also, this is the second significant API change I've seen that is driven by performance needs. I think we should add to the comment on this type that users are expected to precompute the equivalence.Class value and pre-lookup the NodeCache within one scheduling cycle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I will clarify these with more comments.
// ForNode returns the existing NodeCache for given node if present. Otherwise, | ||
// it creates the NodeCache and returns it. | ||
// The boolean flag is true if the value was loaded, false if created. | ||
func (n *Cache) ForNode(name string) (nodeCache *NodeCache, exists bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's weird for the receiver variable to be n
. Could you change it to c
?
} | ||
|
||
// removeCachedPreds deletes cached predicates by given keys. | ||
func (c *NodeCache) removeCachedPreds(predicateKeys sets.String) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this just be called invalidatePreds
?
} | ||
|
||
// removeCachedPreds deletes cached predicates by given keys. | ||
func (c *NodeCache) removeCachedPreds(predicateKeys sets.String) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably we should move this function down to the other NodeCache receivers.
return &Cache{ | ||
cache: make(nodeMap), | ||
// newCache returns an empty NodeCache. | ||
func newCache() *NodeCache { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's call this newNodeCache
to avoid confusion with NewCache
@@ -97,7 +218,7 @@ type predicateResult struct { | |||
// run and its results cached for the next call. | |||
// | |||
// NOTE: RunPredicate will not update the equivalence cache if the given NodeInfo is stale. | |||
func (c *Cache) RunPredicate( | |||
func (c *NodeCache) RunPredicate( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to above, receiver variable should probably change from c
to n
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thanks for catch this nits.
@@ -126,16 +247,15 @@ func (c *Cache) RunPredicate( | |||
} | |||
|
|||
// updateResult updates the cached result of a predicate. | |||
func (c *Cache) updateResult( | |||
// This function is thread safe for second level Cache, no need to sync with top level cache. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All operations on NodeCache objects are threadsafe in the context of that NodeCache object, right? It might be worth mentioning that on the type's comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
// ForNode returns the existing NodeCache for given node if present. Otherwise, | ||
// it creates the NodeCache and returns it. | ||
// The boolean flag is true if the value was loaded, false if created. | ||
func (n *Cache) ForNode(name string) (nodeCache *NodeCache, exists bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What should callers do differently depending on the boolean return?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to expose a indication of the change we did in ForNode()
call as it has mixed duties. This is only used by tests for now.
/test pull-kubernetes-e2e-kops-aws |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
(I still don't have approver powers, btw)
} | ||
|
||
// NewCache returns an empty Cache. | ||
// NewCache create am empty equiv class cache. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/create am/creates an/
c.nodeToCache[name] = newNodeCache() | ||
nodeCache = c.nodeToCache[name] | ||
} | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still new to golang - what's the general style for naked returns? Should we be redundant and say return nodeCache, exists
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is one of benefits of named return value, i.e. no need to refactor the return
line in the future change.
|
||
// InvalidateCachedPredicateItemForPodAdd is a wrapper of | ||
// InvalidateCachedPredicateItem for pod add case | ||
// TODO: This does not belong with the equivalence cache implementatioc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/implementatioc/implementation/
} | ||
} | ||
// predKeySet returns all cached predicate keys. | ||
func (n *NodeCache) predKeySet() sets.String { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this function used anywhere? Can we delete it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, a legacy one.
mockCache := &syncingMockCache{ | ||
Cache: cache, | ||
cycleStart: make(chan struct{}), | ||
cacheInvalidated: make(chan struct{}), | ||
} | ||
|
||
eCache := equivalence.NewCache() | ||
// ForNode will lazily create NodeCache for given test node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed as part of this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, no need.
// it creates the NodeCache and returns it. | ||
// The boolean flag is true if the value was loaded, false if created. | ||
func (c *Cache) ForNode(name string) (nodeCache *NodeCache, exists bool) { | ||
c.mu.RLock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we do c.mu.Lock()
and c.mu.UnLock()
as we are adding a new entry in case we don't have one?
} | ||
|
||
// InvalidatePredicatesOnNode clears cached results for the given predicates on one node. | ||
func (c *Cache) InvalidatePredicatesOnNode(nodeName string, predicateKeys sets.String) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can probably combine this function and above one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will choose to keep them as separate calls are needed.
1cb63e9
to
153045e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@resouer Thanks. The changes LTM. I am just curious about RWMutex scale issue. Can you try couple of things
- Schedule pods on a bigger node where GO_MAX_PROCS is higher(>8).
- Schedule more pods lon more nodes ike 30k and 5k nodes.
I want to know, if we are going to hit any of the scaling issues mentioned for RWlock.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bsalamat, misterikkit, ravisantoshgudimetla, resouer The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@resouer: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here. |
Thanks Bobby, I am keeping eyes on tests grid. |
@resouer I checkout the master today and made sure that your changes are in, but when I run scheduler benchmarks for 5000 nodes, scheduling 1000 pods, I still see a performance degradation when I enable equivalence cache. 20ms/pod when ecache is enabled vs. 18ms/pod when ecache is disabled. I enable/disable eCache here. Am I missing anything in running the test? eCache disabled:
eCache enabled:
|
The test method is right, while I just tested with HEAD@d2fc875489, it seems work well:
It's wired. I'm looking into it. |
Hmmm. I don't know what is going on. I checked out the head again and used your exact configutation: {nodes: 5000, existingPods: 0, minPods: 10000}. Still the results are in favor of disabled ecache. Just to make sure we are using similar config, I use the following command line to run the benchmarks: make test-integration WHAT=./test/integration/scheduler_perf GOFLAGS="-v=1" KUBE_TEST_VMODULE="''" KUBE_TEST_ARGS="-run=xxx -bench BenchmarkScheduling/5000Nodes/0Pods" Cache enabled:
cache disabled:
|
Emm, when I rebased to HEAD@e4465b6e2f (should be latest commit of yesterday), I can see your problem happens:
Seems like some recent commits "improved" default scheduler (or bench testing at least) somehow, it's interesting. I am taking look at this. [Update] After bi-search the commits, it seems related to: 0bf7427 of #66061
|
@bsalamat , a quick follow up is, with test command below, everything works as expected in my env: export KUBE_TEST_ARGS="-run=xxx -bench=BenchmarkScheduling -cpuprofile cpu-5000n-10000p.out"
make test-integration WHAT=./test/integration/scheduler_perf But with test command like this, benchmark test will finish much earlier (which never happen before #66061): make test-integration WHAT=./test/integration/scheduler_perf KUBE_TEST_ARGS="-run=xxx -bench=BenchmarkScheduling -cpuprofile cpu-5000n-10000p.out"
The difference seems related to env passing across multiple .sh files [Update]: I've reported the issue and this inconsistency will be solved by #66816 |
Maybe add "-count 1" just to be paranoid?
|
@misterikkit @bsalamat Just fired #66862 and #66816 to partially fix this wired case. While the root cause I recently observed is: depending on |
Thanks, @resouer! As you have noticed, vmodule skews the results significantly. It must be disabled for performance benchmarking, otherwise vmodule takes most of the time. "vmodule" is disabled in our production code, but it is enabled by default when running integration tests, including our benchmarks. I used the following command to run the benchmarks while vmodule is disabled: KUBE_TEST_ARGS="-run=xxx -bench=BenchmarkScheduling/5000Nodes/0Pods" GOFLAGS="-v=1" KUBE_TEST_VMODULE="''" make test-integration WHAT=./test/integration/scheduler_perf I still see a slight performance decrease when ecache is enabled. We should think what we kind of changes we should make to improve eCache's performance. CPU profiling may reveal areas of potential improvement. Cache disabled
Cache enabled:
|
I have some good news. The above results are obtained with regular Scheduler benchmark. That benchmark tries scheduling a number of simple pods. The only scheduling requirement of those pods is the amount of resources available on nodes. Checking amount of resources is a simple mathematical operation. That's why we don't see performance improvement when we enable eCache. I ran the AntiAffinity benchmarks with and without eCache. eCache improves performance by over 16x! These results indicate that eCache is beneficial only for more sophisticated predicates which are computationally expensive. Examples are inter-pod affinity and anti-affinity predicates. What I think we should do is to find a list of predicates which are more computationally expensive and use eCache only for those, instead of using the eCache for all the predicates. Cache disabled
Cache enabled
|
@bsalamat, Yes, I am aware that the current benchmark test we are playing is too "simple" :D And another thing I'd like to amend is, in current logic,
Actually, after #66862 is in, the only impact of ecache is map accesses in
This is exactly what's in my mind. Thanks! Will continue on this direction, while instead of bypass specific predicates, I would prefer if we can measure them in the future. And I was also hoping to invalidate |
I have met the following criteria. - member for at least 3 months - primary reviewer for at least 5 PRs - kubernetes#63603 - kubernetes#63665 (and related PRs) - kubernetes#63839 - kubernetes#65714 - kubernetes#66862 - reviewed or merged at least 20 PRs reviewed 13: /~https://github.com/pulls?utf8=%E2%9C%93&q=is%3Apr+archived%3Afalse+is%3Amerged+repo%3Akubernetes%2Fkubernetes+commenter%3Amisterikkit+in%3Acomment+assignee%3Amisterikkit+ merged 22: /~https://github.com/pulls?utf8=%E2%9C%93&q=is%3Apr+author%3Amisterikkit+archived%3Afalse+is%3Amerged+repo%3Akubernetes%2Fkubernetes+
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="/~https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Add misterikkit to sig-scheduling REVIEWERS. I have met the following criteria. - member for at least 3 months - primary reviewer for at least 5 PRs - #63603 - #63665 (and related PRs) - #63839 - #65714 - #66862 - reviewed or merged at least 20 PRs reviewed 13: /~https://github.com/pulls?utf8=%E2%9C%93&q=is%3Apr+archived%3Afalse+is%3Amerged+repo%3Akubernetes%2Fkubernetes+commenter%3Amisterikkit+in%3Acomment+assignee%3Amisterikkit+ merged 22: /~https://github.com/pulls?utf8=%E2%9C%93&q=is%3Apr+author%3Amisterikkit+archived%3Afalse+is%3Amerged+repo%3Akubernetes%2Fkubernetes+ **Release note**: ```release-note NONE ``` /cc @bsalamat
What this PR does / why we need it:
The current ecache introduced a global lock across all the nodes, and this patch tried to assign ecache per node to eliminate that global lock. The improvement of scheduling performance and throughput are both significant.
CPU Profile Result
Machine: 32-core 60GB GCE VM
1k nodes 10k pods bench test (we've highlighted the critical function):
Throughput Test Result
1k nodes 3k pods
scheduler_perf
test:Current default scheduler, ecache is disabled:
Minimal observed throughput for 3k pod test: 200 PASS ok k8s.io/kubernetes/test/integration/scheduler_perf 30.091s
With this patch, ecache is enabled:
Minimal observed throughput for 3k pod test: 556 PASS ok k8s.io/kubernetes/test/integration/scheduler_perf 11.119s
Design and implementation:
The idea is: we re-designed ecache into a "two level cache".
The first level cache holds the global lock across nodes and sync is needed only when node is added or deleted, which is of much lower frequency.
The second level cache is assigned per node and its lock is restricted to per node level, thus there's no need to bother the global lock during whole predicate process cycle. For more detail, please check the original discussion.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #63784
Special notes for your reviewer:
Tagged as WIP to make sure this does not break existing code and tests, we can start review after CI is happy.Release note: