Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Podman netavark/aardvark-dns in container fails e2e tests #13533

Closed
cevich opened this issue Mar 16, 2022 · 34 comments
Closed

Podman netavark/aardvark-dns in container fails e2e tests #13533

cevich opened this issue Mar 16, 2022 · 34 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@cevich
Copy link
Member

cevich commented Mar 16, 2022

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)

/kind bug

Description

When executing the podman e2e tests inside a F36 super-privledged-container, there are several failures which all report name-resolution errors.

  • Podman run networking [It] podman Netavark network works across user ns
  • Podman run networking [It] podman run check dnsname plugin with Netavark
  • Podman network [It] podman Netavark network with multiple aliases

Steps to reproduce the issue:

  1. From a F36 host, install the latest netavark & aardvark-dns packages
  2. Build and install podman from source in /var/tmp/go/src/github.com/containers/podman
  3. Create the directory /var/tmp/tmp.KkVmGQxqxK
  4. As root, run podman run --rm --privileged --net=host --cgroupns=host -v /var/tmp/tmp.KkVmGQxqxK:/tmp:Z -v /dev/fuse:/dev/fuse -v /var/tmp/go:/var/tmp/go:Z --workdir /var/tmp/go/src/github.com/containers/podman -e CONTAINER=1 '-e NETWORK_BACKEND=netavark' '-e GOSRC=/var/tmp/go/src/github.com/containers/podman' '-e CIRRUS_USER_PERMISSION=admin' '-e CIRRUS_LAST_GREEN_BUILD_ID=5325573969936384' '-e CTR_FQIN=quay.io/libpod/fedora_podman:c5140433071243264' '-e CIRRUS_DEFAULT_BRANCH=main' '-e FEDORA_CONTAINER_FQIN=quay.io/libpod/fedora_podman:c5140433071243264' '-e PRIOR_FEDORA_CACHE_IMAGE_NAME=prior-fedora-c5140433071243264' '-e CIRRUS_REPO_ID=6707778565701632' '-e TEST_ENVIRON=container' '-e VM_IMAGE_NAME=fedora-c5140433071243264' '-e CIRRUS_REPO_FULL_NAME=containers/podman' '-e CIRRUS_REPO_OWNER=containers' '-e OCI_RUNTIME=crun' '-e SCRIPT_BASE=./contrib/cirrus' '-e GOCACHE=/var/tmp/go/cache' '-e FEDORA_CACHE_IMAGE_NAME=fedora-c5140433071243264' '-e CIRRUS_USER_COLLABORATOR=true' '-e CI_NODE_TOTAL=55' '-e CIRRUS_SHELL=/bin/bash' '-e CIRRUS_PR=13376' '-e CIRRUS_BASE_BRANCH=main' '-e FEDORA_NAME=fedora-36' '-e CIRRUS_REPO_CLONE_HOST=github.com' '-e CIRRUS_CI=true' '-e PODBIN_NAME=podman' '-e CIRRUS_BUILD_ID=5182796640550912' '-e DISTRO_NV=fedora-36' '-e GOPATH=/var/tmp/go' '-e TEST_FLAVOR=int' '-e CIRRUS_BASE_SHA=32fd5d885a648c65158a2127332f9b0a0f2d6fa0' '-e CIRRUS_REPO_NAME=podman' '-e CIRRUS_REPO_CLONE_URL=/~https://github.com/containers/podman.git' '-e CIRRUS_PR_DRAFT=true' '-e PRIV_NAME=root' '-e CI_NODE_INDEX=29' '-e CIRRUS_TASK_ID=5853207413915648' '-e CIRRUS_ENV=/tmp/cirrus-env-task-5853207413915648-e1fbbc6e-4064-4d94-b383-3f4f2bd2503d' '-e CIRRUS_CHANGE_IN_REPO=9bef86b31e8873249bbd250cba4141f03f83115d' '-e CIRRUS_ARCH=amd64' '-e CIRRUS_WORKING_DIR=/var/tmp/go/src/github.com/containers/podman' '-e CI=true' '-e CIRRUS_OS=linux' '-e CIRRUS_LAST_GREEN_CHANGE=a5e327941423983529b771a03691dc2fe2390e0f' '-e CIRRUS_BRANCH=pull/13376' '-e CIRRUS_TASK_NAME=int\ podman\ fedora-36\ root\ container' quay.io/libpod/fedora_podman:c5140433071243264 bash -c './contrib/cirrus/setup_environment.sh && ./contrib/cirrus/runner.sh'

Describe the results you received:

Annotated results: https://storage.googleapis.com/cirrus-ci-6707778565701632-fcae48/artifacts/containers/podman/5853207413915648/html/int-podman-fedora-36-root-container.log.html

Describe the results you expected:

All tests pass

Additional information you deem important (e.g. issue happens only occasionally):

Output of podman version:

Build from source

Output of podman info --debug:

host:
  arch: amd64
  buildahVersion: 1.24.2
  cgroupControllers:
  - cpuset
  - cpu
  - io
  - memory
  - hugetlb
  - pids
  - misc
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.0-2.fc36.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.0, commit: '
  cpus: 2
  distribution:
    distribution: fedora
    variant: cloud
    version: "36"
  eventLogger: journald
  hostname: cirrus-task-5853207413915648
  idMappings:
    gidmap: null
    uidmap: null
  kernel: 5.17.0-0.rc7.116.fc36.x86_64
  linkmode: dynamic
  logDriver: journald
  memFree: 1787170816
  memTotal: 4109574144
  networkBackend: netavark
  ociRuntime:
    name: crun
    package: crun-1.4.3-1.fc36.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 1.4.3
      commit: 61c9600d1335127eba65632731e2d72bc3f0b9e8
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
  os: linux
  remoteSocket:
    path: /run/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: false
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.2.0-0.2.beta.0.fc36.x86_64
    version: |-
      slirp4netns version 1.2.0-beta.0
      commit: 477db14a24ff1a3de3a705e51ca2c4c1fe3dda64
      libslirp: 4.6.1
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.5.3
  swapFree: 4092850176
  swapTotal: 4109365248
  uptime: 35m 38.14s
plugins:
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  volume:
  - local
registries:
  docker.io:
    Blocked: false
    Insecure: false
    Location: mirror.gcr.io
    MirrorByDigestOnly: false
    Mirrors: null
    Prefix: docker.io
  docker.io/library:
    Blocked: false
    Insecure: false
    Location: quay.io/libpod
    MirrorByDigestOnly: false
    Mirrors: null
    Prefix: docker.io/library
  localhost:5000:
    Blocked: false
    Insecure: true
    Location: localhost:5000
    MirrorByDigestOnly: false
    Mirrors: null
    Prefix: localhost:5000
  search:
  - docker.io
  - quay.io
  - registry.fedoraproject.org
store:
  configFile: /usr/share/containers/storage.conf
  containerStore:
    number: 0
    paused: 0
    running: 0
    stopped: 0
  graphDriverName: overlay
  graphOptions:
    overlay.mountopt: nodev,metacopy=on
  graphRoot: /var/lib/containers/storage
  graphStatus:
    Backing Filesystem: btrfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "true"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 1
  runRoot: /run/containers/storage
  volumePath: /var/lib/containers/storage/volumes
version:
  APIVersion: 4.0.0-dev
  Built: 1646947243
  BuiltTime: Thu Mar 10 15:20:43 2022
  GitCommit: 9bef86b31e8873249bbd250cba4141f03f83115d
  GoVersion: go1.18beta2
  OsArch: linux/amd64
  Version: 4.0.0-dev

Package info (e.g. output of rpm -q podman or apt list podman):

Fedora release 36 (Thirty Six)
Kernel:  5.17.0-0.rc7.116.fc36.x86_64
Cgroups:  cgroup2fs
conmon-2.1.0-2.fc36-x86_64
containers-common-1-53.fc36-noarch
container-selinux-2.180.0-1.fc36-noarch
criu-3.16.1-7.fc36-x86_64
crun-1.4.3-1.fc36-x86_64
golang-1.18~beta2-1.fc36-x86_64
libseccomp-2.5.3-2.fc36-x86_64
netavark-1.0.1-1.fc36-x86_64
package aardvark is not installed
package containernetworking-plugins is not installed
podman-4.0.2-1.fc36-x86_64
runc-1.1.0-2.fc36-x86_64
skopeo-1.5.2-2.fc36-x86_64
slirp4netns-1.2.0-0.2.beta.0.fc36-x86_64

Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide? (/~https://github.com/containers/podman/blob/main/troubleshooting.md)

Yes

Additional environment details (AWS, VirtualBox, physical, etc.):

Ref.: #13376

@openshift-ci openshift-ci bot added the kind/bug Categorizes issue or PR as related to a bug. label Mar 16, 2022
@flouthoc
Copy link
Collaborator

@cevich I played around with this and was able to figure out the cause, problem is --net=host, it is almost certain that dns does not works fine with --net=host since containers are expected to use host's resolve.conf. I think it can work with --net=host as well but lot of manual plumbing is needed.

  • Both aardvark-dns and dnsname/dnsmasq does not works when --net=host is being used and things are inside container, so this is expected.

My suggestion is to never test dns with --net=host unless --dns is set.

I tried podman-inside-podman with netavark/aardvark without --net=host and everything works fine.

Here are my steps and working example.

[root@fedora ~]# podman --network-backend netavark run --privileged -v /dev/fuse:/dev/fuse:Z -v /var/tmp/tmp.KkVmGQxqxK:/tmp:Z -it fedora:37 bash
[root@31384c95607e /]# dnf install podman fuse-overlayfs -y
# populate /etc/containers/storage.conf so it uses fuse-overlayfs
# populate /etc/containers/containers.conf with needed config
[root@31384c95607e /]# podman network create test
[root@31384c95607e /]# podman run --network test --name hello nicolaka/netshoot:latest dig hello

; <<>> DiG 9.16.22 <<>> hello
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 56771
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;hello.				IN	A

;; ANSWER SECTION:
hello.			0	IN	A	10.89.0.3

;; Query time: 0 msec
;; SERVER: 10.89.0.1#53(10.89.0.1)
;; WHEN: Mon Mar 21 12:29:36 UTC 2022
;; MSG SIZE  rcvd: 50

TLDR

  • Avoid DNS related tests in --net=host and podman-inside-container environment unless --dns is used. Following is expected to fail for both aardvark-dns/netavark and dnsname/dnsmasq

@Luap99
Copy link
Member

Luap99 commented Mar 21, 2022

@cevich I played around with this and was able to figure out the cause, problem is --net=host, it is almost certain that dns does not works fine with --net=host since containers are expected to use host's resolve.conf. I think it can work with --net=host as well but lot of manual plumbing is needed.

I don't understand this. The container should still have a valid resolv.conf. Why would running aardvark fail in this case?

@Luap99
Copy link
Member

Luap99 commented Mar 21, 2022

@cevich btw you package list says package aardvark is not installed, is aardvark installed on the system?

@flouthoc
Copy link
Collaborator

@Luap99 dnsname also fails in this case. I verified both with aardvark and dnsname.

@flouthoc
Copy link
Collaborator

@Luap99 Its mainly because the second container contains the resolv.conf from host. You can try a reproducer with

podman run -it --net=host --cgroupns=host --privileged quay.io/podman/stable bash
# inside container
podman network create test
# run any container with --network test

@Luap99
Copy link
Member

Luap99 commented Mar 21, 2022

The second container contains the correct entry for me:

podman run --net test alpine cat /etc/resolv.conf 
search dns.podman
nameserver 10.89.0.1

@flouthoc
Copy link
Collaborator

@Luap99 Does dns works for you in second container.

@flouthoc
Copy link
Collaborator

@Luap99 Try podman run --rm --network test --name hello nicolaka/netshoot:latest dig hello

@flouthoc
Copy link
Collaborator

It should not work since dnsname/dnsmasq or aardvark basically gets spawned in hostns.

@Luap99
Copy link
Member

Luap99 commented Mar 21, 2022

Yes dns does not work but this is not related too /etc/resolv.conf. The hostns would be correct since we started the parent container with --net=host.

The problem is that I cannot ping/curl from within the container to the outside of the system, so the general network connectivity is broken.

@flouthoc
Copy link
Collaborator

@Luap99 Yes any address in hostns is not accessible, in my reproducer all host address is unreachable and resolv.conf is rendered incorrectly in my second container. But it could be due to my custom config while spawning second container.

But anyways after playing around I think case with both dnsname and aardvark-dns and container inside docker with similar scenario causes DNS to don't work and I guess similar is expected to happen in CI.

I'll close this issue once we have a concrete statement about why is this not supposed to work ? and it looks worth documenting in docs.

@Luap99
Copy link
Member

Luap99 commented Mar 21, 2022

This should work!

I disabled friewalld and it started working with dnsname. I have nor yet tested with aardvark.

@flouthoc
Copy link
Collaborator

I think it can work with --net=host as well but lot of manual plumbing is needed.

@Luap99 Yes as I mentioned it works after plumbing, also works if you pass --dns which is running on host. But is it expected to work without any plumbing ?

@Luap99
Copy link
Member

Luap99 commented Mar 21, 2022

Yes it should work out of the box, I need to investigate why I had to disable firewalld.
I also tried it with aardvark and it works as well.

What is your content in the hosts and containers /etc/resolv.conf?

@flouthoc
Copy link
Collaborator

@Luap99 It works for aardvark-dns as well if you disable firewalled in the reproducer setup, #13533 (comment)

@flouthoc
Copy link
Collaborator

On first container is it carried from host , but it is fine on the second container if i do nothing. @Luap99 Did you try docker yet ?

@Luap99
Copy link
Member

Luap99 commented Mar 21, 2022

I don't understand what any of this has to do with docker?

I just used the quay.io/libpod/fedora_podman:c5140433071243264 image from the command and the problem is that we use systemd-run to start aardvark but there is no systemd running inside the container thus the command is failing.

So we have to fix this netavark to make sure systemd-run is only used when systemd is running.

@cevich Can you add rm /usr/bin/systemd-run to the setup logic and try again, would be good to have this confirmed.

@flouthoc
Copy link
Collaborator

@Luap99 I meant podman inside docker setup where first container is created by docker there are similar issues there with dnsname :\ but lets discuss that later in separate thread to avoid confusion.


I just used the quay.io/libpod/fedora_podman:c5140433071243264 image from the command and the problem is that we use systemd-run to start aardvark but there is no systemd running inside the container thus the command is failing.

I think systemd-run is only invoked if systemd-run binary is found /~https://github.com/containers/netavark/blob/main/src/dns/aardvark.rs#L81 , otherwise it calls binary directly.

Irrespective of systemd-run I am still not sure about the rationale for

why I had to disable firewalld.

@cevich
Copy link
Member Author

cevich commented Mar 21, 2022

Re: --net=host and --dns

"Because it was always done this way" 😁 Seriously, I'm absolutely not against changing the setup to suit our needs. Even in a super-privileged container, I'm sure there's a limit to how much the container can affect changes on the host, so some test skipping is also okay.

Otherwise, please don't wait on me to test changes. You're able to go directly hands-on to mess with the setup however you like. Just clone my #13376 and use hack/get_ci_vm.sh int podman fedora-36 root container to your hearts content. It drops you into a shell right after setup_environment.sh runs, but before runner.sh is called.

@flouthoc
Copy link
Collaborator

I'm sure there's a limit to how much the container can affect changes on the host, so some test skipping is also okay.

@cevich Yes i think if we are not able to make it work in most generic way then I see no harm in skipping and we can have separate task for few tests. So this should not be a blocker.

But I think we can keep this issue open till we are sure about this.

@cevich
Copy link
Member Author

cevich commented Mar 22, 2022

we can have separate task for few tests

In many cases we should already have separate (non-containerized) tests, so only skipping incompatible is necessary.

But I think we can keep this issue open till we are sure about this.

Yep, and if you think we need to do something like add --dns or other arguments to the podman call, I'm not opposed to that either.

@Luap99
Copy link
Member

Luap99 commented Mar 22, 2022

This is clearly a bug in netavark, we just should not use systemd-run when systemd is not running. Just checking if systemd-run executable is in $PATH is not good enough.

@cevich
Copy link
Member Author

cevich commented Mar 22, 2022

Newer containerized F36 testing results (annotated log) - Should these tests be skipped also?

@Luap99
Copy link
Member

Luap99 commented Mar 22, 2022

Well they should work too but this is the same problem as in the other tests, feel free to skip them until we fix it or add the rm /usr/bin/systemd-run workaround.

@Luap99
Copy link
Member

Luap99 commented Mar 24, 2022

@cevich The fix for this is in netavark v1.0.2.

@flouthoc
Copy link
Collaborator

@Luap99 Does it also work without disabling the firewalld in the host or we still have to do it in CI.

@flouthoc
Copy link
Collaborator

I haven't tried the new patch yet but i can give it it a try.

@Luap99
Copy link
Member

Luap99 commented Mar 24, 2022

CI runs without firewalld AFAIK, I have not looked why it is failing with firewalld.

@cevich
Copy link
Member Author

cevich commented Mar 24, 2022

I believe that's right, ya, we do not install firewalld explicitly.

@cevich
Copy link
Member Author

cevich commented Mar 24, 2022

The fix for this is in netavark v1.0.2.

Thanks guys, I'll keep an eye out for that version number

@cevich
Copy link
Member Author

cevich commented Apr 12, 2022

Okay, I'm now seeing 1.0.2-1 coming into new F36 VM images. I'll do a run of podman CI with these images and reverting my Temporarily skip netavark/aardvark e2e tests commit

@cevich
Copy link
Member Author

cevich commented Apr 12, 2022

I can confirm, with 1.0.2 the test-failures originally reported have been fixed. Though now we have new ones (yay!). I'll open a new issue for them.

@cevich cevich closed this as completed Apr 12, 2022
@cevich
Copy link
Member Author

cevich commented Apr 20, 2022

New issue ref: #13931

@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 20, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 20, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

No branches or pull requests

3 participants