Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

confirm-rollout and pre-rollout not working as expected #669

Closed
paritadhandha opened this issue Aug 13, 2020 · 4 comments · Fixed by #878
Closed

confirm-rollout and pre-rollout not working as expected #669

paritadhandha opened this issue Aug 13, 2020 · 4 comments · Fixed by #878

Comments

@paritadhandha
Copy link

Flagger version: 1.0.1/1.0.0
Kubernetes version: v1.15
Flagger-loadtester: 0.18.0
Flagger CRD version: Version: v1beta1
Config: canary with glooEnterprise (v1.4.6)

Issue:

Canary configuration:

spec:
  provider: gloo:gloo-system
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: podinfo
  autoscalerRef:
    apiVersion: autoscaling/v2beta1
    kind: HorizontalPodAutoscaler
    name: podinfo
  service:
    port: 9898
    targetPort: 9898
    portDiscovery: true
  analysis:
    interval: 30s
    threshold: 3
    maxWeight: 50
    stepWeight: 10
    metrics:
      - name: failed-http-request-total
        thresholdRange:
          max: 5
        interval: 30s
    webhooks:
      - name: "ask for confirmation"
        type: confirm-rollout
        url: http://flagger-loadtester.test-canary/gate/check
      - name: "acceptance-test"
        type: pre-rollout
        url: http://flagger-loadtester.test-canary/
        timeout: 5s
        metadata:
          type: bash
          cmd: "curl -sd 'test' http://podinfo-canary:9898/token | grep token"
          logCmdOutput: "true"

  • In the confirm-rollout web hook, when we add gate/check, after the approval, the Canary logs show New revision detected! Restarting analysis for podinfo.test-canary. And it starts the analysis without launching the canary pods. This results in failing pre-rollout tests since there are no canary pods running. And the canary fails.

  • But if we don’t have any kind of halt on confirm-rollout web hook (open gate or gate/approve) then during the canary rollout we see New revision detected! Scaling up podinfo.test-canary which launches canary pods and later the tests in pre-rollout web hook pass.

Canary Events:

  Type     Reason  Age                From     Message
  ----     ------  ----               ----     -------
  Normal   Synced  51m                flagger  Confirm-rollout check ask for confirmation passed
  Normal   Synced  50m                flagger  New revision detected! Restarting analysis for podinfo.test-canary
  Warning  Synced  48m (x5 over 50m)  flagger  Halt podinfo.test-canary advancement pre-rollout check acceptance-test failed command curl -sd 'test' http://podinfo-canary:9898/token | grep token failed: : exit status 1
  Warning  Synced  47m                flagger  Rolling back podinfo.test-canary failed checks threshold reached 5
  Warning  Synced  47m                flagger  Canary failed! Scaling down podinfo.test-canary
  Normal   Synced  44m                flagger  New revision detected! Scaling up podinfo.test-canary
  Normal   Synced  44m                flagger  Pre-rollout check acceptance-test passed
  Normal   Synced  44m                flagger  Advance podinfo.test-canary canary weight 10
  Normal   Synced  43m                flagger  Advance podinfo.test-canary canary weight 20
  Normal   Synced  43m                flagger  Advance podinfo.test-canary canary weight 30
  Normal   Synced  42m                flagger  Advance podinfo.test-canary canary weight 40
  Warning  Synced  41m                flagger  podinfo-primary.test-canary not ready: waiting for rollout to finish: 3 of 4 updated replicas are available
  Normal   Synced  37m (x5 over 42m)  flagger  (combined from similar events): New revision detected! Scaling up podinfo.test-canary
  Normal   Synced  33m                flagger  Routing all traffic to primary
  Normal   Synced  28m (x8 over 50m)  flagger  Starting canary analysis for podinfo.test-canary
  Warning  Synced  21m                flagger  Halt podinfo.test-canary advancement waiting for approval ask for confirmation
  Normal   Synced  20m                flagger  Confirm-rollout check ask for confirmation passed
  Normal   Synced  20m                flagger  New revision detected! Restarting analysis for podinfo.test-canary
  Normal   Synced  18m (x3 over 19m)  flagger  Starting canary analysis for podinfo.test-canary
  Warning  Synced  18m (x3 over 19m)  flagger  Halt podinfo.test-canary advancement pre-rollout check acceptance-test failed command curl -sd 'test' http://podinfo-canary:9898/token | grep token failed: : exit status 1
  Warning  Synced  18m                flagger  Rolling back podinfo.test-canary failed checks threshold reached 3
  Warning  Synced  18m                flagger  Canary failed! Scaling down podinfo.test-canary

Flagger logs:

{"level":"info","ts":"2020-08-13T15:05:45.552Z","caller":"flagger/main.go:112","msg":"Starting flagger version 1.0.1 revision 744b83253a5136e7c72f2b178ca143301c82e7b4 mesh provider gloo"}
{"level":"info","ts":"2020-08-13T15:05:45.574Z","caller":"flagger/main.go:383","msg":"Connected to Kubernetes API v1.15.11-eks-065dce"}
{"level":"info","ts":"2020-08-13T15:05:45.574Z","caller":"flagger/main.go:239","msg":"Waiting for canary informer cache to sync"}
{"level":"info","ts":"2020-08-13T15:05:45.674Z","caller":"flagger/main.go:246","msg":"Waiting for metric template informer cache to sync"}
{"level":"info","ts":"2020-08-13T15:05:45.775Z","caller":"flagger/main.go:253","msg":"Waiting for alert provider informer cache to sync"}
{"level":"info","ts":"2020-08-13T15:05:45.966Z","caller":"flagger/main.go:163","msg":"Connected to metrics server <metrics_server_url>"}
{"level":"error","ts":"2020-08-13T15:05:45.966Z","caller":"flagger/main.go:326","msg":"Notifier empty Slack channel","stacktrace":"main.initNotifier\n\t/home/circleci/build/cmd/flagger/main.go:326\nmain.main\n\t/home/circleci/build/cmd/flagger/main.go:169\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203"}
{"level":"info","ts":"2020-08-13T15:05:45.966Z","caller":"controller/controller.go:164","msg":"Starting operator"}
{"level":"info","ts":"2020-08-13T15:05:45.966Z","caller":"controller/controller.go:173","msg":"Started operator workers"}
{"level":"info","ts":"2020-08-13T15:05:45.966Z","caller":"server/server.go:29","msg":"Starting HTTP server on port 8080"}
{"level":"info","ts":"2020-08-13T15:05:45.972Z","caller":"controller/controller.go:281","msg":"Synced test-canary/podinfo"}
{"level":"info","ts":"2020-08-13T15:06:37.956Z","caller":"controller/controller.go:281","msg":"Synced test-canary/podinfo"}
{"level":"info","ts":"2020-08-13T15:06:56.013Z","caller":"controller/events.go:28","msg":"Halt podinfo.test-canary advancement waiting for approval ask for confirmation","canary":"podinfo.test-canary"}
{"level":"info","ts":"2020-08-13T15:07:26.008Z","caller":"controller/events.go:16","msg":"Confirm-rollout check ask for confirmation passed","canary":"podinfo.test-canary"}
{"level":"info","ts":"2020-08-13T15:07:56.008Z","caller":"controller/events.go:16","msg":"New revision detected! Restarting analysis for podinfo.test-canary","canary":"podinfo.test-canary"}
{"level":"info","ts":"2020-08-13T15:08:26.017Z","caller":"controller/events.go:16","msg":"Starting canary analysis for podinfo.test-canary","canary":"podinfo.test-canary"}
{"level":"info","ts":"2020-08-13T15:08:27.039Z","caller":"controller/events.go:28","msg":"Halt podinfo.test-canary advancement pre-rollout check acceptance-test failed command curl -sd 'test' http://podinfo-canary:9898/token | grep token failed: : exit status 1","canary":"podinfo.test-canary"}
{"level":"info","ts":"2020-08-13T15:08:56.014Z","caller":"controller/events.go:16","msg":"Starting canary analysis for podinfo.test-canary","canary":"podinfo.test-canary"}
{"level":"info","ts":"2020-08-13T15:08:57.023Z","caller":"controller/events.go:28","msg":"Halt podinfo.test-canary advancement pre-rollout check acceptance-test failed command curl -sd 'test' http://podinfo-canary:9898/token | grep token failed: : exit status 1","canary":"podinfo.test-canary"}
{"level":"info","ts":"2020-08-13T15:09:26.014Z","caller":"controller/events.go:16","msg":"Starting canary analysis for podinfo.test-canary","canary":"podinfo.test-canary"}
{"level":"info","ts":"2020-08-13T15:09:27.039Z","caller":"controller/events.go:28","msg":"Halt podinfo.test-canary advancement pre-rollout check acceptance-test failed command curl -sd 'test' http://podinfo-canary:9898/token | grep token failed: : exit status 1","canary":"podinfo.test-canary"}
{"level":"info","ts":"2020-08-13T15:09:56.015Z","caller":"controller/events.go:28","msg":"Rolling back podinfo.test-canary failed checks threshold reached 3","canary":"podinfo.test-canary"}
{"level":"info","ts":"2020-08-13T15:09:56.022Z","caller":"controller/events.go:28","msg":"Canary failed! Scaling down podinfo.test-canary","canary":"podinfo.test-canary"}
@paritadhandha
Copy link
Author

Just following up with this issue since there has been no response here or in the flagger slack community.

Please let me know if the issue could not be reproduced!

It might be that we have not been using the confirm-rollout webhook as expected and the gate (url) in confirm-rollout cannot be closed when a new k8 deployment is triggered and before canary pods are launched, but it can only be closed after canary pods get launched (e.g. during analysis phase).

If above is the case, it would be good to have that mentioned in the flagger documentation.

@axelbodo
Copy link

We've also faced this issue, and created a code WA for it, as Gate open will put the canary into progressing status, so CheckCanaryStatus doesn't execute the code part, where the canaryController.ScaleFromZero(canary) is executed in scheduler.go.

@paritadhandha
Copy link
Author

@axelbodo Thank you for your reply! That explains the issue. We also created work around for it.
Basically we kept the rollout gate open at the start of the canary so that the canary pods do get launched. And if we want to pause in the middle of the analysis, we would just close the rollout gate.

Was there any reason that flagger was built this way or can this be considered a bug?

@vic3lord
Copy link

We are also experiencing the same thing, check here #874

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants