Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingressroute status updates can cause update loop #854

Closed
dbason opened this issue Jan 22, 2019 · 17 comments
Closed

Ingressroute status updates can cause update loop #854

dbason opened this issue Jan 22, 2019 · 17 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Milestone

Comments

@dbason
Copy link
Contributor

dbason commented Jan 22, 2019

What steps did you take and what happened:
[A clear and concise description of what the bug is.]

Added non root IngressRoutes. Then added a root IngressRoute delegating to those. This caused a relatively high CPU and Memory usage (150%cpu usage and 2GB RAM - container is limited to 2GB). This seems quite high for a single IngressRoute

What did you expect to happen:
Envoy continuing as normal

Anything else you would like to add:
After adding the IngressRoute with delegates it seems the envoy container is constantly logging the following message:
[2019-01-22 02:20:05.816][1][info][upstream] source/server/lds_api.cc:80] lds: add/update listener 'ingress_https'

Environment:

  • Contour version: 1.8.1
  • Kubernetes version: (use kubectl version): Client Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.3", GitCommit:"f0efb3cb883751c5ffdbe6d515f3cb4fbe7b7acd", GitTreeState:"clean", BuildDate:"2017-11-08T18:39:33Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.3", GitCommit:"a4529464e4629c21224b3d52edfe0ea91b072862", GitTreeState:"clean", BuildDate:"2018-09-09T17:53:03Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
  • Kubernetes installer & version: Rancher 2.1.5
  • Cloud provider or hardware configuration: VMWare virtual servers
  • OS (e.g. from /etc/os-release): CentOS 7
@stevesloka
Copy link
Member

That's interesting @dbason. How many services/endpoints do you have in the cluster?

In performance testing, we've encountered high memory, but that was in the thousands of services range due to some gRPC implementations that should be resolved in Envoy soon (or could be already), however, seems like that listener update is the culprit.

Could you share the IngressRoute yaml to look at?

@dbason
Copy link
Contributor Author

dbason commented Jan 23, 2019

@stevesloka here it is

kind: IngressRoute
metadata:
  name: ir-root
  namespace: zero
spec:
  virtualhost:
    fqdn: root.domain.com
    tls:
      secretName: tls-secret
  routes:
    - match: /
      services:
        - name: s0
          port: 80
    - match: /api
      delegate:
        name: ir-s1
    - match: /api/file
      delegate:
        name: ir-s2
    - match: /api/download
      delegate:
        name: ir-s2
    - match: /api/v2/download
      delegate:
        name: ir-s2
---
apiVersion: contour.heptio.com/v1beta1
kind: IngressRoute
metadata:
  name: ir-s1
  namespace: zero
spec:
  routes:
    - match: /api
      services:
        - name: s1
          port: 80
---
apiVersion: contour.heptio.com/v1beta1
kind: IngressRoute
metadata:
  name: ir-s2
  namespace: zero
spec:
  routes:
    - match: /api/file
      services:
        - name: s2
          port: 80
    - match: /api/download
      services:
        - name: s2
          port: 80
    - match: /api/v2/download
      services:
        - name: s2
          port: 80

@davecheney
Copy link
Contributor

Thanks for your report. I'm sorry you're having issues. Can you please provide the logs from the envoy and contour containers, hopefully there will be some indication what is happening.

@davecheney davecheney added kind/bug Categorizes issue or PR as related to a bug. blocked/needs-info Categorizes the issue or PR as blocked because there is insufficient information to advance it. labels Jan 23, 2019
@dbason
Copy link
Contributor Author

dbason commented Jan 23, 2019

@davecheney the only thing I see in the envoy container after adding this is the log line in my original report. That one is repeated every second or so. I've had to pull the routes out of my cluster as they were impacting our QA environment however I will try and get the logs from the contour container when I added them.

@davecheney
Copy link
Contributor

Thanks. That will be a help in helping me understanding if the contour container is wedging, if we're sending constant updates to envoy, or if envoy is wedged on a single update.

@dbason
Copy link
Contributor Author

dbason commented Jan 24, 2019

Contour.xlsx

Attached is an extract of the contour logs from around the time of the issue

Edit: Also the test seemed repeatable with the yaml I have included. If it helps I can set up a test k8s cluster and try to reproduce the issue there with the same services (I may not be able to set this up until tomorrow though).

@davecheney
Copy link
Contributor

davecheney commented Jan 24, 2019 via email

@davecheney
Copy link
Contributor

@dbason I think I might have reproduced the fault. Can you please confirm for me which contour tag you have deployed, eg, v0.8.1, latest, or master. If you're using master, can you please downgrade to a released version and let me know if that fixes the problem.

@davecheney davecheney self-assigned this Jan 25, 2019
@davecheney
Copy link
Contributor

I've confirmed this happens with 0.8.1 as well, which is currently the :latest tag.

The bug, I believe, is contour is updating the status on an ingressroute document, which causes the ingressroute update to fire again, forming a loop.

The errant status is, I thinki

status:
  currentStatus: invalid
  description: 'tcpforward: service default/kuard: not found'

In your example, if the services are all present, this should fix the loop. I think

Obviously I'm going to supress the looping, this is unhelpful.

@davecheney davecheney added this to the 0.9.0 milestone Jan 25, 2019
@davecheney davecheney added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed blocked/needs-info Categorizes the issue or PR as blocked because there is insufficient information to advance it. labels Jan 25, 2019
@dbason
Copy link
Contributor Author

dbason commented Jan 25, 2019

@davecheney we're on 0.8.1 - did you want me to downgrade to 0.8.0 and try again?

@davecheney
Copy link
Contributor

davecheney commented Jan 25, 2019 via email

@davecheney davecheney changed the title Adding an IngressRoute with delegate causes high mem and cpu in envoy Ingressroute status updates can cause update loop Jan 25, 2019
@davecheney
Copy link
Contributor

I've updated the ticket to reflect what I think is happening. The problem is contour can get into a loop where writing a status update to the IngressRoute object causes another contour to see the update, which causes it to process the ingressroute, find it invalid and update the status.

Normally the status update should be idempotent, but in the case where there is more than one invalid part of the ingressroute spec, the patch operation succeeds, generating a new onUpdate message and things start again.

i'm testing a patch where I selectively filter out parts of the IngressRoute object during the onUpdate calculation. That should unblock 0.9 but a more permanent solution will be needed for 0.10

davecheney added a commit to davecheney/contour that referenced this issue Jan 25, 2019
Updates projectcontour#854

Depending on the order in which parts of an ingressroute document are
processed the invalid status message generated may differ over runs.
Specifically two or more invalid components of an ingressroute will
patch status in a random order, defeating patch's "is this the same"
check and causing Contour to see the update to the ingressroute's
document, thus triggering a new OnUpdate event and the cycle starts
again.

For the interim filter out fields which are known to change -- status,
and metadata.resourceversion, during the OnUpdate comparision.

Signed-off-by: Dave Cheney <dave@cheney.net>
davecheney added a commit to davecheney/contour that referenced this issue Jan 25, 2019
Updates projectcontour#854

Depending on the order in which parts of an ingressroute document are
processed the invalid status message generated may differ over runs.
Specifically two or more invalid components of an ingressroute will
patch status in a random order, defeating patch's "is this the same"
check and causing Contour to see the update to the ingressroute's
document, thus triggering a new OnUpdate event and the cycle starts
again.

For the interim filter out fields which are known to change -- status,
and metadata.resourceversion, during the OnUpdate comparision.

Signed-off-by: Dave Cheney <dave@cheney.net>
@davecheney davecheney modified the milestones: 0.9.0, 0.10.0 Jan 25, 2019
@davecheney davecheney added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Feb 19, 2019
@davecheney davecheney removed this from the 0.10.0 milestone Feb 19, 2019
@davecheney davecheney removed their assignment Feb 19, 2019
@davecheney davecheney added this to the 0.12.0 milestone Mar 12, 2019
@davecheney
Copy link
Contributor

Moving to 0.12 because its blocked on #881

@davecheney davecheney removed this from the 0.12.0 milestone Apr 9, 2019
@davecheney davecheney added this to the 1.0.0.beta.1 milestone Jun 18, 2019
@davecheney
Copy link
Contributor

I think I've found a (there might be more than one) cause of this. If a certificate is missing, then, depending on the order in which it is processed by the dag builder, the Ingressroute may flip between valid and invalid, and those status updates case the dag to be regenerated.

We already filter out the status field when comparing Ingressroute objects, but this looping may be caused by the kubectl.kubernetes.io/last-applied-configuration annotation.

For those interested, this is relatively easy to recreate in a cluster with some Ingressroute records; after #1608 change internal/dag/secret.go to always reject certificates, even valid ones.

@davecheney davecheney modified the milestones: 1.0.0-rc.2, 1.0.0-rc.1 Oct 2, 2019
@davecheney davecheney self-assigned this Oct 2, 2019
@davecheney
Copy link
Contributor

ok, its likely not the last-applied-configuration meta data; the Ingressroute is just flipping from valid to invalid on each update

% diff -u a b
--- a   2019-10-08 12:13:06.000000000 +1100
+++ b   2019-10-08 12:13:12.000000000 +1100
@@ -7,10 +7,10 @@
 es/default/ingressroutes/httpbin2","uid":"5b284a81-882a-11e9-a1b8-42010a980029"},"spec":{"routes":[{"match":"/","services":[{"name":"httpbin","port":8080,"strategy":"Cookie"},{"name":"httpbin","port":8080}]}],"tcpproxy":{"services":[{"name":"httpbin","port":8080}]},"virtualhost":{
 "fqdn":"httpbin2.davecheney.com","tls":{"secretName":"missing"}}},"status":{"currentStatus":"invalid","description":"TLS Secret [missing] not found or is malformed"}}
   creationTimestamp: "2019-08-30T01:28:56Z"
-  generation: 650
+  generation: 651
   name: httpbin2
   namespace: default
-  resourceVersion: "190179947"
+  resourceVersion: "190179948"
   selfLink: /apis/contour.heptio.com/v1beta1/namespaces/default/ingressroutes/httpbin2
   uid: 88a9e755-cac5-11e9-a775-42010a980092
 spec:
@@ -29,5 +29,6 @@
     tls:
       secretName: httpbin
 status:
-  currentStatus: valid
-  description: valid IngressRoute
+  currentStatus: invalid
+  description: TLS Secret [httpbin] not found or is malformed

@davecheney
Copy link
Contributor

Digging into this a bit more I believe the root cause is multiple contours trying to update status concurrently. Moving to a model where the status writer is the leader is the solution.

@davecheney davecheney modified the milestones: 1.0.0-rc.1, 1.0.0-rc.2 Oct 8, 2019
@davecheney davecheney added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Oct 11, 2019
@davecheney
Copy link
Contributor

I'm going to close this issue in favour of #1425. The underlying issue is multiple contour's attempting to update status. The move to a single status writer driven by leader election, I feel, is the solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests

3 participants