Ingressroute status updates can cause update loop #854

dbason · 2019-01-22T02:19:08Z

What steps did you take and what happened:
[A clear and concise description of what the bug is.]

Added non root IngressRoutes. Then added a root IngressRoute delegating to those. This caused a relatively high CPU and Memory usage (150%cpu usage and 2GB RAM - container is limited to 2GB). This seems quite high for a single IngressRoute

What did you expect to happen:
Envoy continuing as normal

Anything else you would like to add:
After adding the IngressRoute with delegates it seems the envoy container is constantly logging the following message:
[2019-01-22 02:20:05.816][1][info][upstream] source/server/lds_api.cc:80] lds: add/update listener 'ingress_https'

Environment:

Contour version: 1.8.1
Kubernetes version: (use kubectl version): Client Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.3", GitCommit:"f0efb3cb883751c5ffdbe6d515f3cb4fbe7b7acd", GitTreeState:"clean", BuildDate:"2017-11-08T18:39:33Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.3", GitCommit:"a4529464e4629c21224b3d52edfe0ea91b072862", GitTreeState:"clean", BuildDate:"2018-09-09T17:53:03Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
Kubernetes installer & version: Rancher 2.1.5
Cloud provider or hardware configuration: VMWare virtual servers
OS (e.g. from /etc/os-release): CentOS 7

The text was updated successfully, but these errors were encountered:

stevesloka · 2019-01-23T14:32:00Z

That's interesting @dbason. How many services/endpoints do you have in the cluster?

In performance testing, we've encountered high memory, but that was in the thousands of services range due to some gRPC implementations that should be resolved in Envoy soon (or could be already), however, seems like that listener update is the culprit.

Could you share the IngressRoute yaml to look at?

dbason · 2019-01-23T22:06:05Z

@stevesloka here it is

kind: IngressRoute
metadata:
  name: ir-root
  namespace: zero
spec:
  virtualhost:
    fqdn: root.domain.com
    tls:
      secretName: tls-secret
  routes:
    - match: /
      services:
        - name: s0
          port: 80
    - match: /api
      delegate:
        name: ir-s1
    - match: /api/file
      delegate:
        name: ir-s2
    - match: /api/download
      delegate:
        name: ir-s2
    - match: /api/v2/download
      delegate:
        name: ir-s2
---
apiVersion: contour.heptio.com/v1beta1
kind: IngressRoute
metadata:
  name: ir-s1
  namespace: zero
spec:
  routes:
    - match: /api
      services:
        - name: s1
          port: 80
---
apiVersion: contour.heptio.com/v1beta1
kind: IngressRoute
metadata:
  name: ir-s2
  namespace: zero
spec:
  routes:
    - match: /api/file
      services:
        - name: s2
          port: 80
    - match: /api/download
      services:
        - name: s2
          port: 80
    - match: /api/v2/download
      services:
        - name: s2
          port: 80

davecheney · 2019-01-23T22:10:51Z

Thanks for your report. I'm sorry you're having issues. Can you please provide the logs from the envoy and contour containers, hopefully there will be some indication what is happening.

dbason · 2019-01-23T22:18:32Z

@davecheney the only thing I see in the envoy container after adding this is the log line in my original report. That one is repeated every second or so. I've had to pull the routes out of my cluster as they were impacting our QA environment however I will try and get the logs from the contour container when I added them.

davecheney · 2019-01-23T22:57:17Z

Thanks. That will be a help in helping me understanding if the contour container is wedging, if we're sending constant updates to envoy, or if envoy is wedged on a single update.

dbason · 2019-01-24T00:44:06Z

Contour.xlsx

Attached is an extract of the contour logs from around the time of the issue

Edit: Also the test seemed repeatable with the yaml I have included. If it helps I can set up a test k8s cluster and try to reproduce the issue there with the same services (I may not be able to set this up until tomorrow though).

davecheney · 2019-01-24T01:01:40Z

Thank you. From what I can see contour is not trying to send continuous updates to envoy so the problem could be in envoy jammed updating itself.

…

On Thu, 24 Jan 2019 at 11:44, dbason ***@***.***> wrote: Contour.xlsx Attached is an extract of the contour logs from around the time of the issue — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

davecheney · 2019-01-25T01:07:38Z

@dbason I think I might have reproduced the fault. Can you please confirm for me which contour tag you have deployed, eg, v0.8.1, latest, or master. If you're using master, can you please downgrade to a released version and let me know if that fixes the problem.

davecheney · 2019-01-25T01:31:14Z

I've confirmed this happens with 0.8.1 as well, which is currently the :latest tag.

The bug, I believe, is contour is updating the status on an ingressroute document, which causes the ingressroute update to fire again, forming a loop.

The errant status is, I thinki

status:
  currentStatus: invalid
  description: 'tcpforward: service default/kuard: not found'

In your example, if the services are all present, this should fix the loop. I think

Obviously I'm going to supress the looping, this is unhelpful.

dbason · 2019-01-25T03:30:39Z

@davecheney we're on 0.8.1 - did you want me to downgrade to 0.8.0 and try again?

davecheney · 2019-01-25T03:35:48Z

I don't think it'll help, I suspect the problem has been there for a few releases. The underlying cause is contour is constantly updating an ingressroute record's status which triggers contour to refresh, push a change to envoy, update the status .. rinse, repeat. You can figure out which ingressroute record is being pummelled with something like k get ingressroutes --all-namespaces -w and then watching which object is being constantly rewritten. I'll have this fixed for 0.9 ASAP

…

On Fri, 25 Jan 2019 at 14:30, dbason ***@***.***> wrote: @davecheney </~https://github.com/davecheney> we're on 0.8.1 - did you want me to downgrade to 0.8.0 and try again? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#854 (comment)>, or mute the thread </~https://github.com/notifications/unsubscribe-auth/AAAcA9xiNXn7rY_n_h97VioQAwy7PjKPks5vGnpggaJpZM4aLzd4> .

davecheney · 2019-01-25T07:02:59Z

I've updated the ticket to reflect what I think is happening. The problem is contour can get into a loop where writing a status update to the IngressRoute object causes another contour to see the update, which causes it to process the ingressroute, find it invalid and update the status.

Normally the status update should be idempotent, but in the case where there is more than one invalid part of the ingressroute spec, the patch operation succeeds, generating a new onUpdate message and things start again.

i'm testing a patch where I selectively filter out parts of the IngressRoute object during the onUpdate calculation. That should unblock 0.9 but a more permanent solution will be needed for 0.10

Updates projectcontour#854 Depending on the order in which parts of an ingressroute document are processed the invalid status message generated may differ over runs. Specifically two or more invalid components of an ingressroute will patch status in a random order, defeating patch's "is this the same" check and causing Contour to see the update to the ingressroute's document, thus triggering a new OnUpdate event and the cycle starts again. For the interim filter out fields which are known to change -- status, and metadata.resourceversion, during the OnUpdate comparision. Signed-off-by: Dave Cheney <dave@cheney.net>

davecheney · 2019-03-12T07:38:22Z

Moving to 0.12 because its blocked on #881

davecheney · 2019-10-02T02:07:16Z

I think I've found a (there might be more than one) cause of this. If a certificate is missing, then, depending on the order in which it is processed by the dag builder, the Ingressroute may flip between valid and invalid, and those status updates case the dag to be regenerated.

We already filter out the status field when comparing Ingressroute objects, but this looping may be caused by the kubectl.kubernetes.io/last-applied-configuration annotation.

For those interested, this is relatively easy to recreate in a cluster with some Ingressroute records; after #1608 change internal/dag/secret.go to always reject certificates, even valid ones.

davecheney · 2019-10-08T01:14:04Z

ok, its likely not the last-applied-configuration meta data; the Ingressroute is just flipping from valid to invalid on each update

% diff -u a b
--- a   2019-10-08 12:13:06.000000000 +1100
+++ b   2019-10-08 12:13:12.000000000 +1100
@@ -7,10 +7,10 @@
 es/default/ingressroutes/httpbin2","uid":"5b284a81-882a-11e9-a1b8-42010a980029"},"spec":{"routes":[{"match":"/","services":[{"name":"httpbin","port":8080,"strategy":"Cookie"},{"name":"httpbin","port":8080}]}],"tcpproxy":{"services":[{"name":"httpbin","port":8080}]},"virtualhost":{
 "fqdn":"httpbin2.davecheney.com","tls":{"secretName":"missing"}}},"status":{"currentStatus":"invalid","description":"TLS Secret [missing] not found or is malformed"}}
   creationTimestamp: "2019-08-30T01:28:56Z"
-  generation: 650
+  generation: 651
   name: httpbin2
   namespace: default
-  resourceVersion: "190179947"
+  resourceVersion: "190179948"
   selfLink: /apis/contour.heptio.com/v1beta1/namespaces/default/ingressroutes/httpbin2
   uid: 88a9e755-cac5-11e9-a775-42010a980092
 spec:
@@ -29,5 +29,6 @@
     tls:
       secretName: httpbin
 status:
-  currentStatus: valid
-  description: valid IngressRoute
+  currentStatus: invalid
+  description: TLS Secret [httpbin] not found or is malformed

davecheney · 2019-10-08T06:10:28Z

Digging into this a bit more I believe the root cause is multiple contours trying to update status concurrently. Moving to a model where the status writer is the leader is the solution.

davecheney · 2019-10-14T02:24:21Z

I'm going to close this issue in favour of #1425. The underlying issue is multiple contour's attempting to update status. The move to a single status writer driven by leader election, I feel, is the solution.

davecheney added kind/bug Categorizes issue or PR as related to a bug. blocked/needs-info Categorizes the issue or PR as blocked because there is insufficient information to advance it. labels Jan 23, 2019

davecheney self-assigned this Jan 25, 2019

davecheney added this to the 0.9.0 milestone Jan 25, 2019

davecheney added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed blocked/needs-info Categorizes the issue or PR as blocked because there is insufficient information to advance it. labels Jan 25, 2019

davecheney changed the title ~~Adding an IngressRoute with delegate causes high mem and cpu in envoy~~ Ingressroute status updates can cause update loop Jan 25, 2019

davecheney mentioned this issue Jan 25, 2019

internal/contour: filter out status update OnUpdate #861

Merged

davecheney modified the milestones: 0.9.0, 0.10.0 Jan 25, 2019

davecheney mentioned this issue Feb 7, 2019

Split Contour and Envoy containers into their own pods #881

Closed

davecheney added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Feb 19, 2019

davecheney removed this from the 0.10.0 milestone Feb 19, 2019

davecheney removed their assignment Feb 19, 2019

davecheney added this to the 0.12.0 milestone Mar 12, 2019

davecheney removed this from the 0.12.0 milestone Apr 9, 2019

davecheney added this to the 1.0.0.beta.1 milestone Jun 18, 2019

davecheney modified the milestones: 1.0.0-beta.1, 1.0.0-rc.1, 1.0.0-rc.2 Sep 24, 2019

davecheney modified the milestones: 1.0.0-rc.2, 1.0.0-rc.1 Oct 2, 2019

davecheney self-assigned this Oct 2, 2019

davecheney modified the milestones: 1.0.0-rc.1, 1.0.0-rc.2 Oct 8, 2019

davecheney added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Oct 11, 2019

davecheney closed this as completed Oct 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingressroute status updates can cause update loop #854

Ingressroute status updates can cause update loop #854

dbason commented Jan 22, 2019 •

edited

Loading

stevesloka commented Jan 23, 2019

dbason commented Jan 23, 2019 •

edited

Loading

davecheney commented Jan 23, 2019

dbason commented Jan 23, 2019

davecheney commented Jan 23, 2019

dbason commented Jan 24, 2019 •

edited

Loading

davecheney commented Jan 24, 2019 via email

davecheney commented Jan 25, 2019

davecheney commented Jan 25, 2019

dbason commented Jan 25, 2019

davecheney commented Jan 25, 2019 via email

davecheney commented Jan 25, 2019

davecheney commented Mar 12, 2019

davecheney commented Oct 2, 2019

davecheney commented Oct 8, 2019

davecheney commented Oct 8, 2019

davecheney commented Oct 14, 2019

Ingressroute status updates can cause update loop #854

Ingressroute status updates can cause update loop #854

Comments

dbason commented Jan 22, 2019 • edited Loading

stevesloka commented Jan 23, 2019

dbason commented Jan 23, 2019 • edited Loading

davecheney commented Jan 23, 2019

dbason commented Jan 23, 2019

davecheney commented Jan 23, 2019

dbason commented Jan 24, 2019 • edited Loading

davecheney commented Jan 24, 2019 via email

davecheney commented Jan 25, 2019

davecheney commented Jan 25, 2019

dbason commented Jan 25, 2019

davecheney commented Jan 25, 2019 via email

davecheney commented Jan 25, 2019

davecheney commented Mar 12, 2019

davecheney commented Oct 2, 2019

davecheney commented Oct 8, 2019

davecheney commented Oct 8, 2019

davecheney commented Oct 14, 2019

dbason commented Jan 22, 2019 •

edited

Loading

dbason commented Jan 23, 2019 •

edited

Loading

dbason commented Jan 24, 2019 •

edited

Loading