-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[k8s] The k8s integration tests are failing #33520
Comments
I have been unable to reproduce the issues locally and reverting #33415 did not help (according to the CI jobs on main that was the first commit where things started to flake). Looking at the workflow it looks like all versions are pinned so I don't think we suddenly started using some new action, kind versions, etc. |
Pinging code owners for internal/k8stest: @crobert-1. See Adding Labels via Comments if you do not have permissions to add labels yourself. |
@jinja2 @fatsheep9146 any guesses? |
Pinging code owners for receiver/k8sobjects: @dmitryax @hvaghani221 @TylerHelmuth. See Adding Labels via Comments if you do not have permissions to add labels yourself. |
Pinging code owners for processor/k8sattributes: @dmitryax @rmfitzpatrick @fatsheep9146 @TylerHelmuth. See Adding Labels via Comments if you do not have permissions to add labels yourself. |
Pinging code owners for receiver/k8scluster: @dmitryax @TylerHelmuth @povilasv. See Adding Labels via Comments if you do not have permissions to add labels yourself. |
Pinging code owners for receiver/kubeletstats: @dmitryax @TylerHelmuth. See Adding Labels via Comments if you do not have permissions to add labels yourself. |
I had reproduced the error locally yesterday (or at least something that looked the same), but had to switch focus before I could find the root cause. Now I can't reproduce it :( One thing I did notice was in the collector logs there were errors about not being able to connect to kind-control-plane. Perhaps the e2e workflow should capture the pod logs before tearing down, to make debugging easier. |
I also could not reproduce the same error like github action, it's really weird. But your advise is really good to capture the logs of pod (no matter collector or telemetrygen) in workflow to help debugging. @axw |
Not sure if there is another way to get access to the Pods' logs but I tried sth dirty to capture the logs of the Pods: #33538. |
Got some interesting "connection refused" errors: /~https://github.com/open-telemetry/opentelemetry-collector-contrib/actions/runs/9497224255/job/26173693278?pr=33538#step:11:225 2024-06-13T09:44:56.953Z info exporterhelper/retry_sender.go:118 Exporting failed. Will retry the request after interval. {"kind": "exporter", "data_type": "traces", "name": "otlp", "error": "rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:4317: connect: connection refused\"", "interval": "7.546970563s"}
2024-06-13T09:44:57.064Z info exporterhelper/retry_sender.go:118 Exporting failed. Will retry the request after interval. {"kind": "exporter", "data_type": "logs", "name": "otlp", "error": "rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:4317: connect: connection refused\"", "interval": "7.612004411s"}
2024-06-13T09:44:57.486Z info exporterhelper/retry_sender.go:118 Exporting failed. Will retry the request after interval. {"kind": "exporter", "data_type": "traces", "name": "otlp", "error": "rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:4317: connect: connection refused\"", "interval": "6.403460654s"} |
@ChrsMark |
|
I suspect the reason is due to the /~https://github.com/actions/runner-images/pull/10039/files. |
Sounds possible @fatsheep9146, I will try to upgrade docker on my machine to Update: I was able to reproduce this locally with docker
|
@ChrsMark I'm trying to update the sdk version docker to see if it can fix the problem. |
@fatsheep9146 thank's! FYI debugging this, I spot that
context deadline exceeded , but the weird thing is that this error is for some reason "muted".
Hopefully the lib upgrade can solve this. |
I had a successful run at #33548. I'm going to enable the rest of the tests and check again. |
Yes, I found update docker sdk library is blocked by for some reasons. So I also try to use another way to get the right host endpoint I think we can try in both ways and get more opnions from others. |
I hit an additional error at potential fix: c87a639 |
I think this maybe due to the newer version of kind |
@fatsheep9146 e2e tests passed at #33548. I'm opening that one for review since it offers a fix anyways. I'll be out tomorrow (Friday) so feel free to pick the gateway check and proceed with yours if people find the approach more suitable. I'm fine either way as soon as we solve the issue :). |
**Description:** <Describe what has changed.> <!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> Only return address that is not empty for `kind` network. This started affecting the e2e tests possibly because of the `ubuntu-latest`'s docker version update that is mentioned at #33520 (comment). Relates to #33520. /cc @fatsheep9146 Sample `kind` network: ```console curl --unix-socket /run/docker.sock http://docker/networks/kind | jq % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 841 100 841 0 0 821k 0 --:--:-- --:--:-- --:--:-- 821k { "Name": "kind", "Id": "801d2abe204253cbd5d1d135f111a7fb386b830382bde79a699fb4f9aaf674b1", "Created": "2024-06-13T15:31:57.738509232+03:00", "Scope": "local", "Driver": "bridge", "EnableIPv6": true, "IPAM": { "Driver": "default", "Options": {}, "Config": [ { "Subnet": "fc00:f853:ccd:e793::/64" }, { "Subnet": "172.18.0.0/16", "Gateway": "172.18.0.1" } ] }, "Internal": false, "Attachable": false, "Ingress": false, "ConfigFrom": { "Network": "" }, "ConfigOnly": false, "Containers": { "db113750635782bc1bfdf31e5f62af3c63f02a9c8844f7fe9ef045b5d9b76d12": { "Name": "kind-control-plane", "EndpointID": "8b15bb391109ca1ecfbb4bf7a96060b01e3913694d34e23d67eec22684f037bb", "MacAddress": "02:42:ac:12:00:02", "IPv4Address": "172.18.0.2/16", "IPv6Address": "fc00:f853:ccd:e793::2/64" } }, "Options": { "com.docker.network.bridge.enable_ip_masquerade": "true", "com.docker.network.driver.mtu": "1500" }, "Labels": {} } ``` **Link to tracking Issue:** <Issue number if applicable> **Testing:** <Describe what testing was performed and which tests were added.> **Documentation:** <Describe the documentation added.> --------- Signed-off-by: ChrsMark <chrismarkou92@gmail.com>
Resolved by #33548 |
Thanks for addressing and fixing so quickly @ChrsMark and @fatsheep9146! |
Component(s)
processor/k8sattributes, receiver/k8scluster, receiver/k8sobjects, receiver/kubeletstats
Describe the issue you're reporting
The k8s integration tests have started failing. See /~https://github.com/open-telemetry/opentelemetry-collector-contrib/actions/workflows/e2e-tests.yml?query=branch%3Amain.
The text was updated successfully, but these errors were encountered: