Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

📖 CAEP: Add support for infrastructure cluster resources to be managed externally #4135

Merged

Conversation

enxebre
Copy link
Member

@enxebre enxebre commented Feb 3, 2021

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #4095

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 3, 2021
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 3, 2021
@enxebre enxebre force-pushed the unmanaged-infrastructure branch from 9fead58 to b025329 Compare February 3, 2021 11:23
@caruccio
Copy link

caruccio commented Feb 3, 2021

Great!!! Looking forward to putting my Pentium back again in the game.

btw: s/introudce/introduce/

Copy link
Member

@randomvariable randomvariable left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this. Maybe a little more detail required. Just make sure if there's any new terms, refer to the glossary, and use personas already listed there as best suits.


### Security Model

When unmanaged no additional privileges for a cloud provider need to be given to CAPI other than the required to manage machines.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to have a section that deals with single controller multitenancy, and in addition modify the multitenancy contract doc in #4074 within this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how this would affect single controller multitenancy? In such a model privileges are given for each infraResource so this would be transparent right? can you elaborate a bit?

docs/proposals/20210203-unmanaged-infrastructure.md Outdated Show resolved Hide resolved
docs/proposals/20210203-unmanaged-infrastructure.md Outdated Show resolved Hide resolved
docs/proposals/20210203-unmanaged-infrastructure.md Outdated Show resolved Hide resolved
docs/proposals/20210203-unmanaged-infrastructure.md Outdated Show resolved Hide resolved
docs/proposals/20210203-unmanaged-infrastructure.md Outdated Show resolved Hide resolved
docs/proposals/20210203-unmanaged-infrastructure.md Outdated Show resolved Hide resolved

**Unmanaged**

- The provider infraCluster controller will skip any infrastructure reconciliation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- The provider infraCluster controller will skip any infrastructure reconciliation.
- The provider infraCluster controller will skip any infrastructure reconciliation with the exception of read-only queries to ensure the operation of the related cloud provider integration, e.g. required tags for working of the AWS Cloud Provider..

Copy link
Member Author

@enxebre enxebre Feb 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@randomvariable I know we discuss about this but would you mind refreshing my mind where specifically you see the need to read?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC there are some requirements for the cloud-controller-manager that the AWSCluster controller verifies at the moment. I think checking that tags have been set on the VPC correctly was one of them?

I think it's reasonable that if a InfraCluster controller normally would perform validation of the environment it has created as a requirement for parts of core K8s to function, then we should also do this validation on an unmanaged cluster.

Though, we should make sure the expectations of the environment are documented well so users can ensure their infrastructure provisioning can set the right tags etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cloud-controller-manager that the AWSCluster controller verifies at the moment.

@JoelSpeed @randomvariable I see, we have a list of this somewhere atm?

docs/proposals/20210203-unmanaged-infrastructure.md Outdated Show resolved Hide resolved
@randomvariable
Copy link
Member

randomvariable commented Feb 3, 2021

Another thing I think we should go into more details is why the unmanaged field is useful based on our discussions - that we need information from the infraCluster object in order to inform the infraMachine controller what to do without a lot of repetition. Given our discussions, maybe walk through the example with the CAPA controller. A flow diagram may also be useful.

@enxebre enxebre force-pushed the unmanaged-infrastructure branch from b025329 to b7677c7 Compare February 3, 2021 17:21
@enxebre
Copy link
Member Author

enxebre commented Feb 3, 2021

Another thing I think we should go into more details is why the unmanaged field is useful based on our discussions - that we need information from the infraCluster object in order to inform the infraMachine controller what to do without a lot of repetition. Given our discussions, maybe walk through the example with the CAPA controller. A flow diagram may also be useful.

@randomvariable You mean the info inferred from the infraCluster when managed as a reason to reuse the CR to support unmanaged?

docs/proposals/20210203-unmanaged-infrastructure.md Outdated Show resolved Hide resolved
docs/proposals/20210203-unmanaged-infrastructure.md Outdated Show resolved Hide resolved
docs/proposals/20210203-unmanaged-infrastructure.md Outdated Show resolved Hide resolved
docs/proposals/20210203-unmanaged-infrastructure.md Outdated Show resolved Hide resolved
docs/proposals/20210203-unmanaged-infrastructure.md Outdated Show resolved Hide resolved
docs/proposals/20210203-unmanaged-infrastructure.md Outdated Show resolved Hide resolved
docs/proposals/20210203-unmanaged-infrastructure.md Outdated Show resolved Hide resolved
docs/proposals/20210203-unmanaged-infrastructure.md Outdated Show resolved Hide resolved
docs/proposals/20210203-unmanaged-infrastructure.md Outdated Show resolved Hide resolved
@enxebre enxebre force-pushed the unmanaged-infrastructure branch from b7677c7 to cca34bd Compare February 4, 2021 10:30
@enxebre
Copy link
Member Author

enxebre commented Feb 9, 2021

Open questions:
@randomvariable can you confirm which tags would you expect to be validated #4135 (comment)

@vincepri #4135 (comment) isn't this what the existing unmanaged VPC approach does /~https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/0987c8d9f16a4f891d05a622aa92b4592ad5c154/api/v1alpha3/types.go#L244 but we'd need to teach it to not reconcile sg and LBs, hence we'd use the externalManagedField for this, i.e the contract for an externalManaged AWSCluster is the vpc ID.

Could you please provide specific details of what are you missing from the proposal in its current form?

@vincepri
Copy link
Member

vincepri commented Feb 9, 2021

@enxebre I've outlined most of the thinking in #4135 (comment) — Yes there is going to be contract changes for the InfraCluster controllers, although those can be made generic based on a signal (annotation) that's common across all infrastructure providers.

@vincepri
Copy link
Member

Hi folks, a few of us met today to discuss some details of this proposal, and wanted to provide a summary and outline next step.

The proposal is focused only on getting the InfrastructureCluster bits to be externally managed, while keeping the infrastructure machine controller intact, at least for now.

A few considerations / things to decide:

  • Identify that an InfraCluster is externally managed.

    • Use an annotation
    • Use a spec field
    • Use a different CRD / Kind resource that included all fields from the current one.
  • Readiness and contracts.

    • Have external managers / controllers to respect the Cluster API contracts.
    • Have an additional annotation, or spec field that marks the resource as ready.
      • Open question: How do we deal with new contracts as they come in? (example, failure domains)

@enxebre enxebre force-pushed the unmanaged-infrastructure branch from cca34bd to 0649740 Compare February 15, 2021 15:45
@enxebre
Copy link
Member Author

enxebre commented Feb 24, 2021

@vincepri @randomvariable @CecileRobertMichon @detiber @fabriziopandini do you have any additional feedback after latest changes pushed?

Also can anyone clarify why couldn't we just reuse the existing unmanagedVPC and teach it to not reconcile the lb when the field is not set and SGs as suggested above?

docs/proposals/20210203-unmanaged-infrastructure.md Outdated Show resolved Hide resolved
docs/proposals/20210203-unmanaged-infrastructure.md Outdated Show resolved Hide resolved
docs/proposals/20210203-unmanaged-infrastructure.md Outdated Show resolved Hide resolved
docs/proposals/20210203-unmanaged-infrastructure.md Outdated Show resolved Hide resolved
docs/proposals/20210203-unmanaged-infrastructure.md Outdated Show resolved Hide resolved
docs/proposals/20210203-unmanaged-infrastructure.md Outdated Show resolved Hide resolved
docs/proposals/20210203-unmanaged-infrastructure.md Outdated Show resolved Hide resolved
docs/proposals/20210203-unmanaged-infrastructure.md Outdated Show resolved Hide resolved
@vincepri
Copy link
Member

Also can anyone clarify why couldn't we just reuse the existing unmanagedVPC and teach it to not reconcile the lb when the field is not set and SGs as suggested above?

This point should probably be tackled in the load balancer proposal later, cc @randomvariable @detiber

docs/proposals/20210203-unmanaged-infrastructure.md Outdated Show resolved Hide resolved
docs/proposals/20210203-unmanaged-infrastructure.md Outdated Show resolved Hide resolved
docs/proposals/20210203-unmanaged-infrastructure.md Outdated Show resolved Hide resolved
There is no way to predict what would happen in this scenario.
The InfraCluster controller would start attempting to reconcile infrastructure that it did not create, and therefore, there may be assumptions it makes that mean it cannot manage this infrastructure.

To prevent this, we will have to implement (in the InfraCluster webhook) a means to prevent users converting externally managed InfraClusters into managed InfraClusters.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about adding this as a requirement for the external management system up in the document?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only concern with making this a requirement of the external management is that we will have to duplicate that across each external system right? If this is a core requirement of this contract and the check just needs to be, if this annotation is present, don't let it be changed, that feels like it could be part of the core validation?

I guess we would need to implement this for each provider anyway so maybe not much is lost 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH I would prefer to lust this as a responsibility of the External providers, because I don't see how this can be centralised given that we are talking about InfraCluster objects and one of the of the goal of this KEP is to avoid changes to the existing infrastructure providers.
If instead you can come out with a solution that centralises this responsibility, +1 to that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering about a generic validating webhook that could be implemented to look at the metadata only (therefore could accept any type), this code could be provided centrally and generic, but providers or external infrastructure providers would need to set it up to look at the correct type

Ie all they would need to do is configure the ValidatingWebhookConfiguration correctly. Do you think something like this would work/be useful?


- Write a kubectl plugin that allows a user to mark their InfraCluster resources as ready

- Add a secondary annotation to this contract that causes the provider InfraCluster controller to mark resources as ready
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming that this requires all the infrastructure providers to reconcile the externally managed infra clusters, I'm +1 to remove this option from the list

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's value in keeping it so we know it's been discussed, but maybe we can expand on this later to explain why we don't want to go down this route?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok to keep to as a record, but please add a note reporting the outcome of the discussion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has this been done?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also prefer to not implement this option as it's basically additional complexity in ClusterAPI because of a (current) limitation of kubectl which will probably be resolved mid-term: [WIP] kubectl: support --subresource flag #99556

@vincepri
Copy link
Member

/kind proposal

@k8s-ci-robot k8s-ci-robot added the kind/proposal Issues or PRs related to proposals. label Mar 31, 2021
@CecileRobertMichon
Copy link
Contributor

@enxebre are we waiting for any reviewer's lgtm?

@CecileRobertMichon
Copy link
Contributor

(checked from the list of reviewers in the PR)

/assign @yastij @vincepri

@vincepri
Copy link
Member

Action item: Add some text on the decision about how do we manage or integrate with load balancer proposal

@JoelSpeed
Copy link
Contributor

I've re-reviewed the load balancer proposal today and 'm not sure there is anything worth noting in this proposal with regards to it.

My evaluation from #4135 (comment) still stands.

  • Load balancers are optional, and would take precedence if they are defined.
  • Load balancer provider reads from InfraClusters but doesn't write to them (as per the proposal today), as such, given we state the external infra must fill all the required fields, nothing to note here
  • The LB proposal doesn't mention making any changes to InfraClusters at present, nothing for us to account for here

The only thing I found, was that the Load Balancer proposal should (I don't think it does yet) specify some mechanism for signalling to InfraCluster providers that they should not be creating a control plane LB as this is handled separately. However that manifests (either by some annotation or changing the InfraCluster contract to check the Cluster resource or whatever), it will mean updating the InfraCluster contract documentation, which this proposal already notes that external infra providers must fulfil, so I don't think it is worth calling out explicitly.

@detiber I would appreciate if you could have a double check of my evaluation in case I've misinterpreted the LB proposal or missed some of the details

@vincepri Was there some specific interaction you were concerned about that I might not have had in mind while reviewing the two proposals? I can double check or provide further context if there are specific concerns, but at the moment I don't see real overlap here given the ordering and precedence set out in the LB proposal

@randomvariable
Copy link
Member

Broadly agree with the approach.

I think @JoelSpeed 's comments are correct here.

+1 on proceeding with this.

@vincepri
Copy link
Member

@vincepri Was there some specific interaction you were concerned about that I might not have had in mind while reviewing the two proposals? I can double check or provide further context if there are specific concerns, but at the moment I don't see real overlap here given the ordering and precedence set out in the LB proposal

I think we're good to proceed, the explanation above seems in line with my assumptions, although it's good to have a sounding board.

Copy link
Member

@vincepri vincepri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

/hold
1 week lazy-consensus

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 15, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vincepri

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 15, 2021
Copy link
Member

@yastij yastij left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@enxebre
Copy link
Member Author

enxebre commented Apr 21, 2021

Let's merge this and proceed with the implementation since lazy consensus week is over today and does not seem to be any objection.
Thanks a lot everyone for the reviews.
/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 21, 2021
@k8s-ci-robot k8s-ci-robot merged commit 50f89f3 into kubernetes-sigs:master Apr 21, 2021
enxebre added a commit to enxebre/hypershift that referenced this pull request Jun 1, 2021
This PR drops the ExternalInfraCluster CRD in favour of AWSCluster.

Originally we added support for externally managed infra in CAPA via the ExternalInfraCluster CRD kubernetes-sigs/cluster-api-provider-aws#2124 and we used that commit of CAPA in hypershift.

Later on we decided to revert that approach upstream and reuse the existing ${infra}Cluster CRDs with an annotation to support externally managed infrastructure kubernetes-sigs/cluster-api#4135

This PR bring latest CAPI/CAPA with one additional patch on top
kubernetes-sigs/cluster-api#4709
kubernetes-sigs/cluster-api-provider-aws#2453
to avoid running webhooks.

As a follow up we need to rebuild the images from the main branch once those patches are merged or otherwise enable webhooks.
enxebre added a commit to enxebre/hypershift that referenced this pull request Jun 1, 2021
This PR drops the ExternalInfraCluster CRD in favour of AWSCluster.

Originally we added support for externally managed infra in CAPA via the ExternalInfraCluster CRD kubernetes-sigs/cluster-api-provider-aws#2124 and we used that commit of CAPA in hypershift.

Later on we decided to revert that approach upstream and reuse the existing ${infra}Cluster CRDs with an annotation to support externally managed infrastructure kubernetes-sigs/cluster-api#4135

This PR bring latest CAPI/CAPA with one additional patch on top
kubernetes-sigs/cluster-api#4709
kubernetes-sigs/cluster-api-provider-aws#2453
to avoid running webhooks.

As a follow up we need to rebuild the images from the main branch once those patches are merged or otherwise enable webhooks.
enxebre added a commit to enxebre/hypershift that referenced this pull request Jun 2, 2021
This PR drops the ExternalInfraCluster CRD in favour of AWSCluster.

Originally we added support for externally managed infra in CAPA via the ExternalInfraCluster CRD kubernetes-sigs/cluster-api-provider-aws#2124 and we used that commit of CAPA in hypershift.

Later on we decided to revert that approach upstream and reuse the existing ${infra}Cluster CRDs with an annotation to support externally managed infrastructure kubernetes-sigs/cluster-api#4135

This PR bring latest CAPI/CAPA.

As a follow up we need to rebuild the images and consume them from quay.io/hypershift.
enxebre added a commit to enxebre/hypershift that referenced this pull request Jun 2, 2021
This PR drops the ExternalInfraCluster CRD in favour of AWSCluster.

Originally we added support for externally managed infra in CAPA via the ExternalInfraCluster CRD kubernetes-sigs/cluster-api-provider-aws#2124 and we used that commit of CAPA in hypershift.

Later on we decided to revert that approach upstream and reuse the existing ${infra}Cluster CRDs with an annotation to support externally managed infrastructure kubernetes-sigs/cluster-api#4135

This PR bring latest CAPI/CAPA.

As a follow up we need to rebuild the images and consume them from quay.io/hypershift.
enxebre added a commit to enxebre/hypershift that referenced this pull request Jun 2, 2021
This PR drops the ExternalInfraCluster CRD in favour of AWSCluster.

Originally we added support for externally managed infra in CAPA via the ExternalInfraCluster CRD kubernetes-sigs/cluster-api-provider-aws#2124 and we used that commit of CAPA in hypershift.

Later on we decided to revert that approach upstream and reuse the existing ${infra}Cluster CRDs with an annotation to support externally managed infrastructure kubernetes-sigs/cluster-api#4135

This PR bring latest CAPI/CAPA.

As a follow up we need to rebuild the images and consume them from quay.io/hypershift.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/proposal Issues or PRs related to proposals. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Decouple Infrastructure resource from infra providers machine controller