Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for CNI 1.0.0 config #96

Closed
wants to merge 3 commits into from

Conversation

TBBle
Copy link
Contributor

@TBBle TBBle commented Oct 21, 2023

Trivially upgraded github.com/containernetworking/cni to v1.1.2 and expanded the list of known versions to all the ones it supports for auto-conversion.

Specifically, the internal CNI representation is still using CNI spec 0.4.0, changing that would have required deeper changes. As long as any future CNI library updates can still convert to/from 0.4.0, we should be able to trivially upgrade CNI support using the CNI library auto-conversion. It might make sense to move to 1.0.0 internally, but that isn't necessary now.

Tested with the NAT plugin and the following CNI config, mostly generated by nerdctl 1.6.2. (I had to add the DNS config myself, and the gateway and subnet generated by nerdctl were incorrect for the existing NAT network created by (I think) Docker; both are things I'm chasing up elsewhere)

{
  "cniVersion": "1.0.0",
  "name": "nat",
  "nerdctlID": "d919a100ce6b45524d415d52d088d5817587c6dd8c3691b03b8063c44d043523",
  "nerdctlLabels": {
    "nerdctl/default-network": "true"
  },
  "plugins": [
    {
      "type": "nat",
      "master": "Ethernet",
      "ipam": {
        "routes": [
          {
            "gateway": "172.30.224.1"
          }
        ],
        "subnet": "172.30.224.0/20",
        "type": ""
      },
      "dns": {
        "nameservers": [ "103.167.44.10" ]
      }
    }
  ]
}

@TBBle
Copy link
Contributor Author

TBBle commented Oct 22, 2023

Oops. I just noticed #90 which did update the internal structure to a later CNI version (and it looks like it was a pretty trivial change), but did not enable newer CNI versions to be parsed/output.

So I guess the best of both worlds would be to combine them both?

@TBBle
Copy link
Contributor Author

TBBle commented Oct 24, 2023

Weird. The two test suites failed on different tests, TestBridgeCmdAdd on 2019, and TestOverlayCmdAdd on 2022. The latter panicked on failure, looks like the test tried to continue with a nil pointer. I haven't looked at the tests, maybe I should. I might even be able to add some tests to exercise the newly-supported CNI config versions, perhaps.

I also see the tests dumping byte arrays in "End Unit Test for case" which https://www.browserling.com/tools/ascii-to-text demonstrates are actually ASCII or UTF-8 JSON...

@iankingori
Copy link
Contributor

Thanks @TBBle for the PR, recent commit has restored CI to match state on master. Appears we oscillate between failure on 2019 and failure on 2022. I can't repro the test failure on a local Server 2019, we should perhaps consider hosting our own runner and see whether that improves CI stability. cc: @debj1t

@aznashwan please help review and see whether we can merge this then port your fixes over.

@aznashwan
Copy link
Contributor

LGTM FWIW, thank you!

and the gateway and subnet generated by nerdctl were incorrect for the existing NAT network created by (I think) Docker

There is indeed currently no logic in this repo which compares the network config provided to the plugins with the actual properties of the HCN network if it is pre-created. (it simply looks the network up by name and trusts it matches)

IMO this should be fixed and the plugins should error on config mismatches with pre-existing networks, but that will be a separate patch for another day.

I'll be following up with a patch which'll add a workflow to run containerd+cri integration tests on top of both the NAT and Bridge/Overlay plugins, so I'd suggest we merge this now @debj1t

@TBBle
Copy link
Contributor Author

TBBle commented Nov 2, 2023

IMO this should be fixed and the plugins should error on config mismatches with pre-existing networks, but that will be a separate patch for another day.

Makes sense. I was also planning to poke at nerdctl and see if I can work out why it created a faulty CNI config; I got rabbitholed trying to make sense of the duality of HCN/HNS and forgot about that. (I _suspect without actually checking that nerdctl just output a set of hard-coded details assuming the network doesn't already exist, rather than asking HCN for details of an existing network.)

An alternative to this (which I thought I mentioned elsewhere, I think there's a discussion about CNI network name/type behaviour in another ticket) would be to actually complete the TODO in this repo and remove network-creation support, relying on the host to already have the desired network created. That would actually remove (I think...) the need to have multiple CNI network types in this codebase at all, since they all trust HCN over the user config anyway.

@aznashwan
Copy link
Contributor

I _suspect without actually checking that nerdctl just output a set of hard-coded details

You are correct, nerdctl will simply use some hardcoded defaults for the subnet unless the user explicitly passes in a different one.

assuming the network doesn't already exists

Also correct, nerdctl does zero checking/creation of OS-side resources when the user creates a network (on Windows and Linux too afaik), all a nerdctl network create does is write the CNI config JSON and call it a day.

An alternative to this (which I thought I mentioned elsewhere, I think there's a discussion about CNI network name/type behaviour in another ticket) would be to actually complete the TODO in this repo and remove network-creation support

IMHO, it should be expected, and even downright warranted of the CNI plugins to create underlying OS resources as needed (the trivial reference Linux CNI plugins do this, at least), so from my PoV the real question is what the WinCNI plugins should do in case of the HNS networks pre-existing with different properties, for which I have opened the following issue for further discussion: #98

I think there's a discussion about CNI network name/type behaviour in another ticket

I hope to contribute a patch which should alleviate the BinaryName != HcnType issue (#57) which would hopefully not break any of the usecases these plugins are already used for.

An honest thank you for all your help in all of this, @TBBle!

@debj1t looks like the updated test suite has passed (at least for 0.2.0 configs, which are the only ones we currently test) just fine on this PR so please consider merging this!

@mdebjit mdebjit requested a review from MikeZappa87 November 3, 2023 04:19
@MikeZappa87
Copy link
Contributor

  1. What is the work required to move to 1.0.0?
  2. Is the missing type value under ipam apart of the nerdctl cni configuration generation?

@TBBle
Copy link
Contributor Author

TBBle commented Nov 3, 2023

To move to using the 1.0.0 structures internally? Replace the use of "github.com/containernetworking/cni/pkg/types/040" with "github.com/containernetworking/cni/pkg/types/100" and adapt to any differences in the structures in code in this repo that interacts with them, as far as I know.

And yeah, nerdctl's code on Windows starts with a blank IPAM structure, and never populates the type field. The comparable code on Unix has different methods for different IPAM types which prepopulate the type field.

I suspect the Windows-side nerdctl CNI generation implementation has been bitrotting a little, since it still references CNI 0.2.0, while the Unix side references 1.0.1 and 1.1.0. (Edit: That reference is to 0.2.0 in this repo, not CNI. The unix side is references the containernetworking/plugins repo, which are AFAIK not kept in-sync with CNI versions. I still think it's bitrotted a little, but both platform's references are to older-than-latest releases.)

Edit: I also note that the current example nat.conf also does not specify an ipam type, and in fact excludes the whole field. Technically this example is invalid because IpamConfig's Type field is not marked omitempty, but the Go JSON deserialiser is not as strict as to reject this case. If you use Go JSON to deserialise and reserialise that example config, you'll get the same as nerdctl produces. (Because that's what it does...)

@aznashwan
Copy link
Contributor

Had already half-baked the reply to question 1 so posting it for posterity anyway:

What is the work required to move to 1.0.0?

If you meant moving the internal structures used by the plugins to 1.0.0, it's a matter of switching the containernetworking/cni/pkg/types/040 deps to types/100, which seemed as trivial as dropping some fields in my other closed PR, though I did not get to thoroughly test those changes since the testing framework was pretty faulty.

I suspect the Windows-side nerdctl CNI generation implementation has been bitrotting a little

Indeed, the nerdctl network create code is a whole other can of worms, but it should be a completely separate issue from whether the plugins can handle 1.0.0 configs or not, and should hopefully not imply express updates to the plugins to facilitate it further.

Signed-off-by: Paul "TBBle" Hampson <Paul.Hampson@Pobox.com>
The type is internally a byte array, which is not directly readable in
the logs as its rendered as individual byte values. Per the CNI spec
it's JSON, so almost certainly UTF-8.

Signed-off-by: Paul "TBBle" Hampson <Paul.Hampson@Pobox.com>
This closes a path to a nil-dereference panic when test setup fails.

Signed-off-by: Paul "TBBle" Hampson <Paul.Hampson@Pobox.com>
@TBBle TBBle force-pushed the cni_1.0.0-support branch from d42d304 to db9cf71 Compare November 13, 2023 11:05
// Supported CNI versions.
var VersionsSupported = []string{"0.2.0", "0.3.0"}
// Oldest Supported CNI versions
var OldestVersionSupported = "0.2.0"
Copy link
Contributor

@iankingori iankingori Nov 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TBBle could we revert this to explicitly indicating supported versions? The current release versioning seems to correspond with greatest supported version. We can pull this change into a separate PR and discuss it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, you're requesting I open-code this list to the current return value of cniVers.VersionsStartingFrom(OldestVersionSupported)?

Copy link
Contributor

@iankingori iankingori Nov 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was the idea but seems counter productive. @debj1t would we be okay with using the cni import versioning to limit the max supported version? looking at the releases it doesn't seem to be revved that frequently

Copy link
Contributor Author

@TBBle TBBle Nov 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I was poking around the CNI spec a little bit, and it's possible this change is faulty, as if I'm skimming it correctly, by declaring we support config files 0.4.0 or newer, we are also opening ourselves up to having CHECK called for such config files, but we currently pass nil as the check callback. A quick look at the CNI skeleton code doesn't show any checks for nil in the relevant code, so this would panic if we hit it. (CHECK was introduced in CNI spec 0.4.0 and it also changed the DEL API; 0.4.0 to 1.0.0 looks like it only changed config and structures, according to the "major changes" list)

A quick look at BuildKit (my test case) doesn't show any calls to the go-cni Check API which wraps this, so I haven't exercised this change in my limited testing. (And even briefer look in containerd's CRI implementation also shows no calls to Check and no documented use-case)

I am not totally on top of how CNI works (and particularly the interactions of library version, spec version, skeleton version, and config version), so when I was building this PR I was mostly looking at the config-file parser, but perhaps we need to (for purposes of this PR) keep the existing VersionsSupported array, and push the actual supported versions change out to require an implementation of CHECK as well. (And probably a more-careful check of changes in the spec)

If I'm right and the version bump was incorrect, my understanding is we can still use the latest library (we were already on library 0.8.1 which appears to have supported spec 0.4.0), so we only need to revert our version declaration to avoid the behaviour change. We can also separately go ahead and upgrade the internal structures to types/100 or later in the meantime, as again, we are already using version types/040 with spec 0.3.0 support. (Again, might be worth validating in the spec that the meaning of those fields hasn't changed, in case something changed that isn't caught by simple Go compiler type-checking.)

Edit: I consider this a bug in the CNI Skeleton code, it probably should have either rejected the nil function, or rejected attempts to declare spec 0.4.0 or higher as supported. I haven't checked, it might be that its code layout doesn't support this, i.e. potentially no part of the skeleton code except the CHECK call itself has all of these details. That code should probably fail rather than panicking, at least.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CHECK is not implemented in containerd or cri-o. I added the check code in go-cni however discontinued it as it involves a significant amount of work to the runtime/cri-api/kubelet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So maybe just a function that returns a "not implemented" error would be sufficient to avoid the panic, in case that rework is done in the future, or another CNI user appears.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the cni plugins callback function returns nil which emulates success? not implemented would allow us to catch unknown or future scenarios earlier though. @MikeZappa87 are we ok with adding a no-op check command callback + an issue to track the work to implement it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can proceed with returning "not implemented" for CHECK, and adding both 0.4.0 and 1.0.0 as the supported versions in the array. We can track the remaining work with the below issues based on requirement to support them (doesn't seem to be widely used at the moment):

  1. Implement CHECK; update config examples to include disableCheck field (0.4.0)
  2. Remove non-list configurations; remove version field of interfaces array (1.0.0)

@Lin-Huang-Wei
Copy link

nerdctlID

Hi @TBBle
I wonder how did you got the nerdctlID in Powershell ?

@aznashwan
Copy link
Contributor

@Lin-Huang-Wei note that the nerdctlID field is only set by nerdctl network create when it is the one creating the network, but there's nothing stopping you from defining CNI networks yourself, in which case they will not have an ID.

Example containerd setup script creating ID-less CNI network:
/~https://github.com/containerd/containerd/blob/main/script/setup/install-cni-windows#L78-L97

@donno
Copy link

donno commented Mar 15, 2024

Thanks @TBBle, I was able to use this request/branch to get networking working with containerd.

I have someone else has since submitted a pull request, #101 which has merged. I'm yet to try building that and checking that but perhaps this require is due for being closed.

@TBBle
Copy link
Contributor Author

TBBle commented Mar 15, 2024

Yeah, that makes sense, #101 appears to be the union of these changes, #90, and the resolution of the missing CHECK support discussion in this PR.

So I'll close this as redundant.

@TBBle TBBle closed this Mar 15, 2024
@TBBle TBBle deleted the cni_1.0.0-support branch March 15, 2024 11:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants