Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Galal hussein etcd backup restore #2154

Merged
merged 57 commits into from
Aug 28, 2020

Conversation

briandowns
Copy link
Contributor

@briandowns briandowns commented Aug 22, 2020

Proposed changes

Add etcd snapshot and restoration feature, this include adding 5 new flags:

  • "etcd-snapshot-interval"
  • "etcd-snapshot-dir"
  • "etcd-snapshot-restore-path"
  • "etcd-disable-snapshots"
  • "etcd-snapshot-retention"

Types of changes

  • New feature (non-breaking change which adds functionality)

Verification

snapshot should be enabled by default and the snapshot dir defaults to /server/db/snapshots, you should verify by checking the directory for snapshots saved there.

restoration should happen the same way cluster-reset is working, when specifying the flag it will move the old data dir to /server/db/etcd-old and then attempt to restore the snapshot by creating a new data dir, and then start etcd with forcing a new cluster so that it start a 1 member cluster. Upon completion of the restore, k3s will exit.

to verify the restoration you should be able to see cluster starting with only one etcd member and the data from the specified snapshot is being restored correctly.

The pr also will add etcd snapshot retention and also a flag to disable snapshots altogether.

Testing

###1 Testing disabling snapshots

  • start k3s with --cluster-init and --etcd-disable-snapshots

you should see no snapshot has been created in /server/db/snapshots

###2 Testing snapshot interval and retention

  • start k3s with --etcd-snapshot-interval set to 5s and --etcd-snapshot-retention set to 10

you should see snapshots created every 5 seconds in /server/db/snapshots and no more than 10 snapshots in the directory

Make sure that if the --cluster-reset-restore-path flag is given that the --cluster-reset flag is present:

./k3s server --datastore-endpoint=etcd --cluster-reset-restore-path=/home/bdowns/etcd_snapshot-etcd-snapshot-1598631420
FATA[2020-08-28T19:06:27.534957919Z] Invalid flag use. --cluster-reset required with --cluster-reset-restore-path

Verify that there's a failure when an invalid path is provided:

FATA[2020-08-28T19:07:46.056232602Z] starting kubernetes: preparing server: start cluster and https: etcd: snapshot path does not exist: /home/bdowns/etcd_snapshot-etcd-snapshot-1598631420

###3 testing snapshot restore

To test snapshot restore, just use --etcd-snapshot-restore-path and point to any snapshot file that you have on the system

The cluster should restore and etcd will only have one member, also you should see /server/db/etcd-old directory has been created.

Linked Issues

rancher/rke2#45

galal-hussein and others added 9 commits August 13, 2020 01:25
Signed-off-by: galal-hussein <hussein.galal.ahmed.11@gmail.com>
Signed-off-by: galal-hussein <hussein.galal.ahmed.11@gmail.com>
Signed-off-by: galal-hussein <hussein.galal.ahmed.11@gmail.com>
Signed-off-by: galal-hussein <hussein.galal.ahmed.11@gmail.com>
Signed-off-by: galal-hussein <hussein.galal.ahmed.11@gmail.com>
Signed-off-by: galal-hussein <hussein.galal.ahmed.11@gmail.com>
Signed-off-by: Brian Downs <brian.downs@gmail.com>
Signed-off-by: Brian Downs <brian.downs@gmail.com>
Signed-off-by: Brian Downs <brian.downs@gmail.com>
@briandowns briandowns requested a review from a team as a code owner August 22, 2020 02:27
Signed-off-by: Brian Downs <brian.downs@gmail.com>
Signed-off-by: Brian Downs <brian.downs@gmail.com>
Signed-off-by: Brian Downs <brian.downs@gmail.com>
Signed-off-by: Brian Downs <brian.downs@gmail.com>
Signed-off-by: Brian Downs <brian.downs@gmail.com>
Signed-off-by: Brian Downs <brian.downs@gmail.com>
Signed-off-by: Brian Downs <brian.downs@gmail.com>
Signed-off-by: Brian Downs <brian.downs@gmail.com>
pkg/etcd/etcd.go Outdated Show resolved Hide resolved
pkg/etcd/etcd.go Outdated Show resolved Hide resolved
pkg/etcd/etcd.go Show resolved Hide resolved
pkg/cli/cmds/server.go Outdated Show resolved Hide resolved
pkg/etcd/etcd.go Outdated Show resolved Hide resolved
pkg/etcd/etcd.go Outdated Show resolved Hide resolved
pkg/etcd/etcd.go Outdated Show resolved Hide resolved
Signed-off-by: Brian Downs <brian.downs@gmail.com>
Signed-off-by: Brian Downs <brian.downs@gmail.com>
Signed-off-by: Brian Downs <brian.downs@gmail.com>
Signed-off-by: Brian Downs <brian.downs@gmail.com>
Signed-off-by: Brian Downs <brian.downs@gmail.com>
@briandowns briandowns self-assigned this Aug 22, 2020
@briandowns briandowns requested a review from a team August 22, 2020 22:50
pkg/cli/cmds/server.go Outdated Show resolved Hide resolved
pkg/cli/cmds/server.go Outdated Show resolved Hide resolved
pkg/cli/cmds/server.go Outdated Show resolved Hide resolved
pkg/cli/cmds/server.go Outdated Show resolved Hide resolved
pkg/cli/cmds/server.go Outdated Show resolved Hide resolved
Signed-off-by: Brian Downs <brian.downs@gmail.com>
Signed-off-by: Brian Downs <brian.downs@gmail.com>
@briandowns
Copy link
Contributor Author

The sub command was discussed in a conversation last week or Monday and I think it was expressed as a desire but a step after this was accomplished.

@cjellick
Copy link
Contributor

makes sense. i'll wait to see what darren says only to make sure his opinion on the UX doesnt clash with what we have here.
if it doesnt, then we'll merge this and slot the subcommands work in for later. its the one-off snapshotting that im mostly concerned about, which i dont recall talking about, but quite possible i just didnt catch it or it didnt sink in.

@cjellick
Copy link
Contributor

Ok, synced with @ibuildthecloud and here's the changes. SPOILER: he wants to do less than what i proposed.

  1. He wants restoring to be part of the --cluster-reset UX. So, --etcd-snapshot-restore-path should be --cluster-reset-restore-path and it should do nothing on its own. It should just influence where --cluster-reset loads its data from
  2. --etcd-snapshot-interval should be changed from an interval to a cron. So, --etcd-snaphot-schedule-cron. Default 12 hours.

Two things unrelated to this PR:

  1. S3 is out of 1.0. We will do S3 backups when we get a full rancher integration in 2.6. We only need local backup for GA.
  2. A command for a one-off on-demand backup is out of 1.0. Same deal: we will hit it as part of the fullfledged 2.6 integration
    cc @davidnuzik

@davidnuzik
Copy link
Contributor

To track the unrelated items we'd like to pick up later I have created the following for tracking purposes:
rancher/rke2#249 for S3 bucket
rancher/rke2#250 for on-demand backups
cc @cjellick

@briandowns
Copy link
Contributor Author

Ok, synced with @ibuildthecloud and here's the changes. SPOILER: he wants to do less than what i proposed.

  1. He wants restoring to be part of the --cluster-reset UX. So, --etcd-snapshot-restore-path should be --cluster-reset-restore-path and it should do nothing on its own. It should just influence where --cluster-reset loads its data from
  2. --etcd-snapshot-interval should be changed from an interval to a cron. So, --etcd-snaphot-schedule-cron. Default 12 hours.

Two things unrelated to this PR:

  1. S3 is out of 1.0. We will do S3 backups when we get a full rancher integration in 2.6. We only need local backup for GA.
  2. A command for a one-off on-demand backup is out of 1.0. Same deal: we will hit it as part of the fullfledged 2.6 integration
    cc @davidnuzik

For the con, do we need to support cronspec?

Signed-off-by: Brian Downs <brian.downs@gmail.com>
Signed-off-by: Brian Downs <brian.downs@gmail.com>
Signed-off-by: Brian Downs <brian.downs@gmail.com>
Signed-off-by: Brian Downs <brian.downs@gmail.com>
Signed-off-by: Brian Downs <brian.downs@gmail.com>
Signed-off-by: Brian Downs <brian.downs@gmail.com>
Signed-off-by: Brian Downs <brian.downs@gmail.com>
Signed-off-by: Brian Downs <brian.downs@gmail.com>
Signed-off-by: Brian Downs <brian.downs@gmail.com>
Signed-off-by: Brian Downs <brian.downs@gmail.com>
…enerate

Signed-off-by: Brian Downs <brian.downs@gmail.com>
Signed-off-by: Brian Downs <brian.downs@gmail.com>
@briandowns
Copy link
Contributor Author

@cjellick In regards to --etcd-snapshot-restore-path should be --cluster-reset-restore-path and it should do nothing on its own. It should just influence where --cluster-reset loads its data from, I'm assuming that if the --etcd-snapshot-restore-path flag is present, we're doing a restore of that on reset rather than new cluster creation?

Signed-off-by: Brian Downs <brian.downs@gmail.com>
Signed-off-by: Brian Downs <brian.downs@gmail.com>
Signed-off-by: Brian Downs <brian.downs@gmail.com>
Signed-off-by: Brian Downs <brian.downs@gmail.com>
Dockerfile.dapper761278592 Outdated Show resolved Hide resolved
@briandowns briandowns merged commit 866dc94 into k3s-io:master Aug 28, 2020
@liyimeng
Copy link
Contributor

So there should always need a snapshot to restore a cluster If it loss quorum?

@briandowns
Copy link
Contributor Author

So there should always need a snapshot to restore a cluster If it loss quorum?

If you issue --cluster-reset it will do a reset, however if you provide that flag along with --cluster-reset-restore-path it will do the restore with the given snapshot.

briandowns added a commit to briandowns/k3s that referenced this pull request Jan 14, 2021
* Add etcd snapshot and restore

Signed-off-by: galal-hussein <hussein.galal.ahmed.11@gmail.com>

* fix error logs

Signed-off-by: galal-hussein <hussein.galal.ahmed.11@gmail.com>

* goimports

Signed-off-by: galal-hussein <hussein.galal.ahmed.11@gmail.com>

* fix flag describtion

Signed-off-by: galal-hussein <hussein.galal.ahmed.11@gmail.com>

* Add disable snapshot and retention

Signed-off-by: galal-hussein <hussein.galal.ahmed.11@gmail.com>

* use creation time for snapshot retention

Signed-off-by: galal-hussein <hussein.galal.ahmed.11@gmail.com>

* unexport method, update var name

Signed-off-by: Brian Downs <brian.downs@gmail.com>

* adjust snapshot flags

Signed-off-by: Brian Downs <brian.downs@gmail.com>

* update var name, string concat

Signed-off-by: Brian Downs <brian.downs@gmail.com>

* revert previous change, create constants

Signed-off-by: Brian Downs <brian.downs@gmail.com>

* update

Signed-off-by: Brian Downs <brian.downs@gmail.com>

* updates

Signed-off-by: Brian Downs <brian.downs@gmail.com>

* type assertion error checking

Signed-off-by: Brian Downs <brian.downs@gmail.com>

* update

Signed-off-by: Brian Downs <brian.downs@gmail.com>

* update

Signed-off-by: Brian Downs <brian.downs@gmail.com>

* update

Signed-off-by: Brian Downs <brian.downs@gmail.com>

* pr remediation

Signed-off-by: Brian Downs <brian.downs@gmail.com>

* pr remediation

Signed-off-by: Brian Downs <brian.downs@gmail.com>

* pr remediation

Signed-off-by: Brian Downs <brian.downs@gmail.com>

* pr remediation

Signed-off-by: Brian Downs <brian.downs@gmail.com>

* pr remediation

Signed-off-by: Brian Downs <brian.downs@gmail.com>

* updates

Signed-off-by: Brian Downs <brian.downs@gmail.com>

* updates

Signed-off-by: Brian Downs <brian.downs@gmail.com>

* simplify logic, remove unneeded function

Signed-off-by: Brian Downs <brian.downs@gmail.com>

* update flags

Signed-off-by: Brian Downs <brian.downs@gmail.com>

* update flags

Signed-off-by: Brian Downs <brian.downs@gmail.com>

* add comment

Signed-off-by: Brian Downs <brian.downs@gmail.com>

* exit on restore completion, update flag names, move retention check

Signed-off-by: Brian Downs <brian.downs@gmail.com>

* exit on restore completion, update flag names, move retention check

Signed-off-by: Brian Downs <brian.downs@gmail.com>

* exit on restore completion, update flag names, move retention check

Signed-off-by: Brian Downs <brian.downs@gmail.com>

* update disable snapshots flag and field names

Signed-off-by: Brian Downs <brian.downs@gmail.com>

* move function

Signed-off-by: Brian Downs <brian.downs@gmail.com>

* update field names

Signed-off-by: Brian Downs <brian.downs@gmail.com>

* update var and field names

Signed-off-by: Brian Downs <brian.downs@gmail.com>

* update var and field names

Signed-off-by: Brian Downs <brian.downs@gmail.com>

* update defaultSnapshotIntervalMinutes to 12 like rke

Signed-off-by: Brian Downs <brian.downs@gmail.com>

* update directory perms

Signed-off-by: Brian Downs <brian.downs@gmail.com>

* update etc-snapshot-dir usage

Signed-off-by: Brian Downs <brian.downs@gmail.com>

* update interval to 12 hours

Signed-off-by: Brian Downs <brian.downs@gmail.com>

* fix usage typo

Signed-off-by: Brian Downs <brian.downs@gmail.com>

* add cron

Signed-off-by: Brian Downs <brian.downs@gmail.com>

* add cron

Signed-off-by: Brian Downs <brian.downs@gmail.com>

* add cron

Signed-off-by: Brian Downs <brian.downs@gmail.com>

* wire in cron

Signed-off-by: Brian Downs <brian.downs@gmail.com>

* wire in cron

Signed-off-by: Brian Downs <brian.downs@gmail.com>

* wire in cron

Signed-off-by: Brian Downs <brian.downs@gmail.com>

* wire in cron

Signed-off-by: Brian Downs <brian.downs@gmail.com>

* wire in cron

Signed-off-by: Brian Downs <brian.downs@gmail.com>

* wire in cron

Signed-off-by: Brian Downs <brian.downs@gmail.com>

* wire in cron

Signed-off-by: Brian Downs <brian.downs@gmail.com>

* update deps target to work, add build/data target for creation, and generate

Signed-off-by: Brian Downs <brian.downs@gmail.com>

* remove dead make targets

Signed-off-by: Brian Downs <brian.downs@gmail.com>

* error handling, cluster reset functionality

Signed-off-by: Brian Downs <brian.downs@gmail.com>

* error handling, cluster reset functionality

Signed-off-by: Brian Downs <brian.downs@gmail.com>

* update

Signed-off-by: Brian Downs <brian.downs@gmail.com>

* remove intermediate dapper file

Signed-off-by: Brian Downs <brian.downs@gmail.com>

Co-authored-by: galal-hussein <hussein.galal.ahmed.11@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants