Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloud_controller_worker pre-backup-lock hangs when there are 10 or more cloud_controller_workers #267

Open
ohkyle opened this issue Sep 13, 2022 · 1 comment

Comments

@ohkyle
Copy link

ohkyle commented Sep 13, 2022

Thanks for submitting an issue to capi-release. We are always trying to improve! To help us, please fill out the following template.

Issue

cloud_controller_worker pre-backup-lock hangs when there are 10 or more cloud_controller_workers

Context

We ran into issues when trying use bbr to backup our cf deployment.

We deployed cf with this ops file:

- type: replace
  path: /instance_groups/name=cc-worker/jobs/name=cloud_controller_worker/properties/cc/broker_client_default_async_poll_interval_seconds?
  value: 10

- type: replace
  path: /instance_groups/name=cc-worker/jobs/name=cloud_controller_worker/properties/cc/jobs?/generic?/number_of_workers?
  value: 11

- type: replace
  path: /instance_groups/name=cc-worker/vm_type
  value: medium

We observed these logs

...
[bbr] 2022/09/13 15:31:55 INFO - Finished locking cloud_controller_clock on scheduler/b468a5cd-130a-442a-bcab-bfbc40cb4ab5 for backup.
[bbr] 2022/09/13 15:32:03 INFO - Finished locking cloud_controller_ng on api/dc671209-452f-43c9-8077-3927b614ffad for backup.
[bbr] 2022/09/13 15:32:06 INFO - Finished locking cloud_controller_ng on api/5859cb84-76ed-492f-b03f-023532aa352e for backup.

On the vm we see

cc-worker/ff736dfe-8c81-4528-98a6-5698ba183123:~# monit summary
The Monit daemon 5.2.5 uptime: 2d 3h 22m

Process 'cloud_controller_worker_1' not monitored
Process 'cloud_controller_worker_2' running
Process 'cloud_controller_worker_3' running
Process 'cloud_controller_worker_4' running
Process 'cloud_controller_worker_5' running
Process 'cloud_controller_worker_6' running
Process 'cloud_controller_worker_7' running
Process 'cloud_controller_worker_8' running
Process 'cloud_controller_worker_9' running
Process 'cloud_controller_worker_10' running
Process 'cloud_controller_worker_11' running
Process 'loggregator_agent'         running
Process 'loggr-forwarder-agent'     running
Process 'loggr-syslog-agent'        running
Process 'prom_scraper'              running
Process 'metrics-discovery-registrar' running
Process 'metrics-agent'             running
Process 'bosh-dns'                  running
Process 'bosh-dns-resolvconf'       running
Process 'bosh-dns-healthcheck'      running

On the vm we also see

cc-worker/ff736dfe-8c81-4528-98a6-5698ba183123:~# /var/vcap/jobs/cloud_controller_worker/bin/bbr/pre-backup-lock
Waiting for cloud_controller_worker_1 to be unmonitored...
Waiting for cloud_controller_worker_1 to be unmonitored...
Waiting for cloud_controller_worker_1 to be unmonitored...
Waiting for cloud_controller_worker_1 to be unmonitored...
Waiting for cloud_controller_worker_1 to be unmonitored...
Waiting for cloud_controller_worker_1 to be unmonitored...
Waiting for cloud_controller_worker_1 to be unmonitored...
Waiting for cloud_controller_worker_1 to be unmonitored...
Waiting for cloud_controller_worker_1 to be unmonitored...
Waiting for cloud_controller_worker_1 to be unmonitored...
Waiting for cloud_controller_worker_1 to be unmonitored...
Waiting for cloud_controller_worker_1 to be unmonitored...

Steps to Reproduce

  1. deploy cf with
- type: replace
  path: /instance_groups/name=cc-worker/jobs/name=cloud_controller_worker/properties/cc/broker_client_default_async_poll_interval_seconds?
  value: 10

- type: replace
  path: /instance_groups/name=cc-worker/jobs/name=cloud_controller_worker/properties/cc/jobs?/generic?/number_of_workers?
  value: 11

- type: replace
  path: /instance_groups/name=cc-worker/vm_type
  value: medium

and bbr

- type: replace
  path: /releases/-
  value:
    name: backup-and-restore-sdk
    sha1: 238c36f2229f303ebf96f6b24b29799232195e38
    url: https://bosh.io/d/github.com/cloudfoundry-incubator/backup-and-restore-sdk-release?v=1.18.52
    version: 1.18.52
- type: replace
  path: /instance_groups/-
  value:
    azs:
    - z1
    instances: 1
    jobs:
    - name: database-backup-restorer
      release: backup-and-restore-sdk
    - name: bbr-cfnetworkingdb
      properties:
        release_level_backup: true
      release: cf-networking
    - name: bbr-cloudcontrollerdb
      release: capi
    - name: bbr-routingdb
      release: routing
    - name: bbr-uaadb
      properties:
        release_level_backup: true
      release: uaa
    - name: bbr-credhubdb
      properties:
        release_level_backup: true
      release: credhub
    - name: cf-cli-6-linux
      release: cf-cli
    name: backup-restore
    networks:
    - name: default
    persistent_disk_type: 10GB
    stemcell: default
    vm_type: minimal
- type: replace
  path: /instance_groups/name=api/jobs/name=routing-api/properties/release_level_backup?
  value: true
  1. try to take a backup of cf with bbr
$ bbr deployment --deployment cf backup
  1. observe the failure
...
[bbr] 2022/09/13 15:31:55 INFO - Finished locking cloud_controller_clock on scheduler/b468a5cd-130a-442a-bcab-bfbc40cb4ab5 for backup.
[bbr] 2022/09/13 15:32:03 INFO - Finished locking cloud_controller_ng on api/dc671209-452f-43c9-8077-3927b614ffad for backup.
[bbr] 2022/09/13 15:32:06 INFO - Finished locking cloud_controller_ng on api/5859cb84-76ed-492f-b03f-023532aa352e for backup.

Expected result

We expected the command in step 2 to succeed.

Current result

Currently the command is failing.

Possible Fix

This piece of code

function wait_unmonitor_job() {
  local job_name="$1"

  while true; do
    if [[ $(sudo /var/vcap/bosh/bin/monit summary | grep ${job_name} ) =~ not[[:space:]]monitored[[:space:]]*$ ]]; then
      echo "Unmonitored ${job_name}"
      return 0
    else
      echo "Waiting for ${job_name} to be unmonitored..."
    fi

    sleep 0.1
  done
}

seems to assume that there will always be less than 10 cloud_controller_workers.

On our vm with 11 workers

cc-worker/ff736dfe-8c81-4528-98a6-5698ba183123:~# monit summary | grep cloud_controller_worker_1
Process 'cloud_controller_worker_1' not monitored
Process 'cloud_controller_worker_10' running
Process 'cloud_controller_worker_11' running
@ohkyle
Copy link
Author

ohkyle commented Sep 26, 2022

A potential fix for this bug

.*${job_name}.*[[:space:]]not[[:space:]]monitored[[:space:]]

I have attached a patch as well (since I do not have push rights to the repo)
0001-monit_utils-Add-support-for-more-than-10-cc-workers.txt

ohkyle added a commit to ohkyle/capi-release that referenced this issue Oct 10, 2022
- cloudfoundry#267 identified a
  bug where the wait_unmonitor_job function assumed there would be less
  than 10 cloud controller workers
- this fix removes that assumption and looks for the job_name in the
  regex

Authored-by: Kyle Ong <kyleo@vmware.com>
ohkyle added a commit to ohkyle/capi-release that referenced this issue Oct 11, 2022
- cloudfoundry#267 identified a
  bug where the wait_unmonitor_job function assumed there would be less
  than 10 cloud controller workers
- this fix removes the less than 10 worker assumption and looks for the job_name in the
  regex

Authored-by: Kyle Ong <kyleo@vmware.com>
ohkyle added a commit to ohkyle/capi-release that referenced this issue Oct 13, 2022
- cloudfoundry#267 identified a
  bug where the wait_unmonitor_job function assumed there would be less
  than 10 cloud controller workers
- this fix removes the less than 10 worker assumption by looking for the job_name in the
  regex

Authored-by: Kyle Ong <kyleo@vmware.com>
ohkyle added a commit to ohkyle/capi-release that referenced this issue Oct 18, 2022
- cloudfoundry#267 identified a
  bug where the wait_unmonitor_job function assumed there would be less
  than 10 cloud controller workers
- this fix removes the less than 10 worker assumption and looks for the job_name in the
  regex

Authored-by: Kyle Ong <kyleo@vmware.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants