simulator: synchronically replicate cluster state from a real cluster to a fake one for the scheduler testing #327

sanposhiho · 2023-12-10T02:55:21Z

/assign
/kind feature

This issue proposes a new simple component to keep replicating the state from a prod cluster to a fake cluster.

background

Testing the scheduler is a complex challenge. There are countless patterns of operations executed within a cluster, making it impractical to anticipate every scenario with a finite number of tests. More often than not, bugs are discovered only when the scheduler is deployed in an actual cluster.

Having a development or sandbox environment for testing the scheduler—or, indeed, any Kubernetes controllers—is a common practice. However, this approach falls short of capturing all the potential scenarios that might arise in a production cluster. It’s an inevitable truth that a development cluster never sees the exact same use or exhibits the same behavior as its production counterpart, with notable differences in workload sizes and scaling dynamics.

User story

We have a custom scheduler which has a co-scheduling feature.
We want to test it in a cluster that gets similar resources as our production cluster. But, our production cluster is much bigger than our development cluster and it's unrealistic to catch all bugs there.

Resources to sync

We shouldn't simply do that, we have to think about what to sync and what not to.

All resources involved in the scheduling should be synced.
And, we should make it configurable to select which resources to sync, given everyone could have a different scheduler plugin which schedules Pods based on anything.

By default, we should sync:

Pods
Nodes
PVs
PVCs
SC

Scheduled Pods

We cannot simply sync all changes to Pods, because the real cluster has the scheduler, and it schedules all Pods in the cluster.
If we simply synced all changes to Pods, the scheduling result would also be synced. (and may conflicted with the decision of another scheduler which is in a fake cluster.)

So, we don't sync any of updated events to scheduled Pods.
Pods are synced like:

In a real cluster, Pod-a is created
In a fake cluster, Pod-a is created. (synced)
In a real cluster, the scheduler schedules Pod-a to Node-a. We don't copy this change to a fake cluster.
In a fake cluster, the scheduler, which is different one from (3), schedules Pod-a to Node-x.

It means that the scheduling results may be different between a real cluster and a fake cluster. But, it's OK.
Our purpose is to create a fake cluster for testing the scheduling, which gets the same load as the production cluster.

utam0k · 2023-12-13T07:32:38Z

I'm also interested in this feature.

sanposhiho · 2024-01-07T08:25:14Z

/retitle simulator: synchronically replicate cluster state from a real cluster to a fake one for the scheduler testing
/area simulator

I'll make it in the simulator, on second thought.

sanposhiho · 2024-03-31T07:22:46Z

/priority next-release

k8s-triage-robot · 2024-06-29T08:17:29Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

sanposhiho · 2024-06-30T04:42:42Z

/remove-lifecycle stale

k8s-triage-robot · 2024-07-30T05:25:10Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

sanposhiho · 2024-07-30T06:15:50Z

/remove-lifecycle rotten

utam0k · 2024-07-31T11:40:29Z

Our company is interested in giving it a try. May I ask you to assign me?

sanposhiho · 2024-07-31T12:58:12Z

/assign @utam0k

saza-ku · 2024-08-19T06:05:21Z

Hi! I'd like to take over this issue. Could you assign me?

saza-ku · 2024-08-19T08:09:43Z

First, I would start with adding a pseudo real cluster needed for debugging syncer to docker-compose-local.yml.

Then, I'd incrementally implement syncer replacing #335. Is it okay to close #335?

sanposhiho · 2024-08-19T14:27:20Z

/assign @saza-ku

Is it okay to close #335?

Yup, I'll close that one myself.

sanposhiho · 2024-08-19T14:28:01Z

/unassign

saza-ku · 2024-08-20T09:07:50Z

So, we don't sync any of updated events to scheduled Pods.

When it starts to sync, it won't sync pods that are already exist in the real cluster. But is it okay because they can be imported by enabling externalImportEnabled?

saza-ku · 2024-08-20T09:18:01Z

Anyway I'll proceed to implement in the following steps.

Add a psuedo real cluster required for debugging syncer: Add fake-source-cluster for testing the sync service #364
Rename ImportClusterResourceService to OneshotClusterResourceService: Rename ImportClusterResourceService to OneshotClusterResourceImporter #365
Implement the syncing service
Add the syncing service to the DI container, and call it in the entry point of the simulator server

sanposhiho · 2024-08-20T11:04:35Z

When it starts to sync, it won't sync pods that are already exist in the real cluster.

It will. The syncer's event handler propagates existing Pods to the fake cluster when it starts.

saza-ku · 2024-08-21T13:27:29Z

It will. The syncer's event handler propagates existing Pods to the fake cluster when it starts.

I see. So the syncer will add all the pods regardless of whether they are scheduled or not, right?

If so, it might be a problem when already scheduled pods (users manually specify nodeName of them) are added in the real cluster. For example, the specified node in the simulator cluster can have no capacity while the one in the real cluster has.

But it's an edge case that only occurs occasionally. Let's discuss the problem in a following issue. I'll do the first implementation in the way you say.

sanposhiho · 2024-08-21T16:06:04Z

We don't sync update event of scheduled Pods, but we can sync addition events of them.

k8s-ci-robot assigned sanposhiho Dec 10, 2023

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Dec 10, 2023

k8s-ci-robot changed the title ~~mimicube: the tool to synchronically replicate cluster state from a real cluster to a fake one for the scheduler testing~~ simulator: synchronically replicate cluster state from a real cluster to a fake one for the scheduler testing Jan 7, 2024

k8s-ci-robot added the area/simulator Issues or PRs related to the simulator. label Jan 7, 2024

sanposhiho mentioned this issue Feb 18, 2024

PoC: implement syncer #335

Closed

k8s-ci-robot added the priority/next-release Issues or PRs related to features should be implemented in time for the next release. label Mar 31, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 29, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 30, 2024

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jul 30, 2024

k8s-ci-robot assigned utam0k Jul 31, 2024

k8s-ci-robot assigned saza-ku Aug 19, 2024

k8s-ci-robot unassigned sanposhiho Aug 19, 2024

saza-ku mentioned this issue Aug 20, 2024

Add fake-source-cluster for testing the sync service #364

Merged

saza-ku mentioned this issue Aug 21, 2024

Rename ImportClusterResourceService to OneshotClusterResourceImporter #365

Merged

saza-ku mentioned this issue Aug 22, 2024

Implement syncer #367

Merged

k8s-ci-robot closed this as completed in #367 Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

simulator: synchronically replicate cluster state from a real cluster to a fake one for the scheduler testing #327

simulator: synchronically replicate cluster state from a real cluster to a fake one for the scheduler testing #327

sanposhiho commented Dec 10, 2023 •

edited

Loading

utam0k commented Dec 13, 2023

sanposhiho commented Jan 7, 2024

sanposhiho commented Mar 31, 2024

k8s-triage-robot commented Jun 29, 2024

sanposhiho commented Jun 30, 2024

k8s-triage-robot commented Jul 30, 2024

sanposhiho commented Jul 30, 2024

utam0k commented Jul 31, 2024

sanposhiho commented Jul 31, 2024

saza-ku commented Aug 19, 2024

saza-ku commented Aug 19, 2024 •

edited

Loading

sanposhiho commented Aug 19, 2024

sanposhiho commented Aug 19, 2024

saza-ku commented Aug 20, 2024

saza-ku commented Aug 20, 2024 •

edited

Loading

sanposhiho commented Aug 20, 2024

saza-ku commented Aug 21, 2024

sanposhiho commented Aug 21, 2024

simulator: synchronically replicate cluster state from a real cluster to a fake one for the scheduler testing #327

simulator: synchronically replicate cluster state from a real cluster to a fake one for the scheduler testing #327

Comments

sanposhiho commented Dec 10, 2023 • edited Loading

background

User story

Resources to sync

Scheduled Pods

utam0k commented Dec 13, 2023

sanposhiho commented Jan 7, 2024

sanposhiho commented Mar 31, 2024

k8s-triage-robot commented Jun 29, 2024

sanposhiho commented Jun 30, 2024

k8s-triage-robot commented Jul 30, 2024

sanposhiho commented Jul 30, 2024

utam0k commented Jul 31, 2024

sanposhiho commented Jul 31, 2024

saza-ku commented Aug 19, 2024

saza-ku commented Aug 19, 2024 • edited Loading

sanposhiho commented Aug 19, 2024

sanposhiho commented Aug 19, 2024

saza-ku commented Aug 20, 2024

saza-ku commented Aug 20, 2024 • edited Loading

sanposhiho commented Aug 20, 2024

saza-ku commented Aug 21, 2024

sanposhiho commented Aug 21, 2024

sanposhiho commented Dec 10, 2023 •

edited

Loading

saza-ku commented Aug 19, 2024 •

edited

Loading

saza-ku commented Aug 20, 2024 •

edited

Loading