Add etcd snapshot and restore #2118

galal-hussein · 2020-08-12T23:24:44Z

Proposed changes

Add etcd snapshot and restoration feature, this include adding 5 new flags:

"snapshot-interval"
"snapshot-dir"
"snapshot-restore-path"
"disable-snapshots"
"snapshot-retention"

Types of changes

New feature (non-breaking change which adds functionality)

Verification

snapshot should be enabled by default and the snapshot dir defaults to /server/db/snapshots, you should verify by checking the directory for snapshots saved there.

restoration should happen the same way cluster-reset is working, when specifying the flag it will move the old data dir to /server/db/etcd-old and then attempt to restore the snapshot by creating a new data dir, and then start etcd with forcing a new cluster so that it start a 1 member cluster.

to verify the restoration you should be able to see cluster starting with only one etcd member and the data from the specified snapshot is being restored correctly.

The pr also will add etcd snapshot retention and also a flag to disable snapshots altogether.

Testing

###1 Testing disabling snapshots

start k3s with --cluster-init and --disable-snapshots

you should see no snapshot has been created in /server/db/snapshots

###2 Testing snapshot interval and retention

start k3s with --snapshot-interval set to 5s and --snapshot-retention set to 10

you should see snapshots created every 5 seconds in /server/db/snapshots and no more than 10 snapshots in the directory

###3 testing snapshot restore

To test snapshot restore, just use --snapshot-restore-path and point to any snapshot file that you have on the system

The cluster should restore and etcd will only have one member, also you should see /server/db/etcd-old directory has been created.

Linked Issues

rancher/rke2#45

Signed-off-by: galal-hussein <hussein.galal.ahmed.11@gmail.com>

pkg/etcd/etcd.go

Signed-off-by: galal-hussein <hussein.galal.ahmed.11@gmail.com>

pkg/cli/cmds/server.go

Signed-off-by: galal-hussein <hussein.galal.ahmed.11@gmail.com>

briandowns

🚢 🇮🇹

brandond

💰

ElisaMeng · 2020-08-13T15:02:50Z

Will the snapshots be stored forever?

Signed-off-by: galal-hussein <hussein.galal.ahmed.11@gmail.com>

pkg/etcd/etcd.go

ElisaMeng · 2020-08-14T08:38:23Z

@galal-hussein I try to test after introduce learner. If I do a cluster reset, I will get panic like below, which was not seen before. What is going wrong? Or it is simply not ready yet?

How I test:

deploy k3s from master using ha
put down masters to make it lost quorum
run with --cluster-reset and panic happens

This is 100% rep-producable for me.


4T15:09:17.889190241+08:00] Starting k3s v1.18.6+k3s-026584e1 (026584e1)
[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
2020-08-14 15:09:17.905293 I | embed: peerTLS: cert = /var/lib/rancher/k3s/server/tls/etcd/peer-server-client.crt, key = /var/lib/rancher/k3s/server/tls/etcd/peer-server-client.key, trusted-ca = /var/lib/rancher/k3s/server/tls/etcd/peer-ca.crt, client-cert-auth = true, crl-file =
2020-08-14 15:09:17.905760 I | embed: name = master1-5c63d33d
2020-08-14 15:09:17.905775 I | embed: force new cluster
2020-08-14 15:09:17.905779 I | embed: data dir = /var/lib/rancher/k3s/server/db/etcd
2020-08-14 15:09:17.905837 I | embed: member dir = /var/lib/rancher/k3s/server/db/etcd/member
2020-08-14 15:09:17.905843 I | embed: heartbeat = 500ms
2020-08-14 15:09:17.905881 I | embed: election = 5000ms
2020-08-14 15:09:17.905920 I | embed: snapshot count = 100000
2020-08-14 15:09:17.905987 I | embed: advertise client URLs = https://172.20.1.168:2379
2020-08-14 15:09:17.905996 I | embed: initial advertise peer URLs = https://172.20.1.168:2380
2020-08-14 15:09:17.906060 I | embed: initial cluster =
2020-08-14 15:09:17.958057 C | etcdserver: ConfChange Type should be either ConfChangeAddNode or ConfChangeRemoveNode!
panic: ConfChange Type should be either ConfChangeAddNode or ConfChangeRemoveNode!

goroutine 1 [running]:
github.com/rancher/k3s/vendor/github.com/coreos/pkg/capnslog.(*PackageLogger).Panicf(0xc00100fba0, 0x430cca6, 0x4b, 0x0, 0x0, 0x0)
	/go/src/github.com/rancher/k3s/vendor/github.com/coreos/pkg/capnslog/pkg_logger.go:83 +0x135
github.com/rancher/k3s/vendor/go.etcd.io/etcd/etcdserver.getIDs(0x0, 0x0, 0xc0013aa000, 0x183e, 0x1871, 0x0, 0x0, 0x0)
	/go/src/github.com/rancher/k3s/vendor/go.etcd.io/etcd/etcdserver/raft.go:703 +0x5ef
github.com/rancher/k3s/vendor/go.etcd.io/etcd/etcdserver.restartAsStandaloneNode(0xc00102a860, 0x13, 0x0, 0x0, 0x0, 0x0, 0xc001299500, 0x1, 0x1, 0xc001299400, ...)
	/go/src/github.com/rancher/k3s/vendor/go.etcd.io/etcd/etcdserver/raft.go:608 +0x246
github.com/rancher/k3s/vendor/go.etcd.io/etcd/etcdserver.NewServer(0xc00102a860, 0x13, 0x0, 0x0, 0x0, 0x0, 0xc001299500, 0x1, 0x1, 0xc001299400, ...)
	/go/src/github.com/rancher/k3s/vendor/go.etcd.io/etcd/etcdserver/server.go:482 +0x1060
github.com/rancher/k3s/vendor/go.etcd.io/etcd/embed.StartEtcd(0xc0002ad800, 0xc000ff6680, 0x0, 0x0)
	/go/src/github.com/rancher/k3s/vendor/go.etcd.io/etcd/embed/etcd.go:211 +0x9e9
github.com/rancher/k3s/pkg/daemons/executor.Embedded.ETCD(0xc000e6de60, 0x19, 0xc000ab79e0, 0x2d, 0x41ce290, 0x3, 0xc000e6de20, 0x13, 0xc000ab7a40, 0x30, ...)
	/go/src/github.com/rancher/k3s/pkg/daemons/executor/etcd.go:23 +0xa9
github.com/rancher/k3s/pkg/daemons/executor.ETCD(...)
	/go/src/github.com/rancher/k3s/pkg/daemons/executor/executor.go:107
github.com/rancher/k3s/pkg/etcd.(*ETCD).cluster(0xc0005bc400, 0x4c54660, 0xc000ba90c0, 0x1, 0xc000e6de60, 0x19, 0xc000ab79e0, 0x2d, 0x41ce290, 0x3, ...)
	/go/src/github.com/rancher/k3s/pkg/etcd/etcd.go:374 +0x59b
github.com/rancher/k3s/pkg/etcd.(*ETCD).newCluster(0xc0005bc400, 0x4c54660, 0xc000ba90c0, 0x4c54601, 0xc000ba90c0, 0x0)
	/go/src/github.com/rancher/k3s/pkg/etcd/etcd.go:358 +0x241
github.com/rancher/k3s/pkg/etcd.(*ETCD).Reset(0xc0005bc400, 0x4c54660, 0xc000ba90c0, 0x0, 0xc000451ec0, 0xc00016ab60)
	/go/src/github.com/rancher/k3s/pkg/etcd/etcd.go:120 +0x88
github.com/rancher/k3s/pkg/cluster.(*Cluster).start(0xc000d0c5a0, 0x4c54660, 0xc000ba90c0, 0x0, 0x0)
	/go/src/github.com/rancher/k3s/pkg/cluster/managed.go:49 +0x6a
github.com/rancher/k3s/pkg/cluster.(*Cluster).Start(0xc000d0c5a0, 0x4c54660, 0xc000ba90c0, 0x0, 0x0, 0x41ebcfe)
	/go/src/github.com/rancher/k3s/pkg/cluster/cluster.go:33 +0x78
github.com/rancher/k3s/pkg/daemons/control.prepare(0x4c54660, 0xc000ba90c0, 0xc0005ff908, 0xc0005b3880, 0x1a, 0x0)
	/go/src/github.com/rancher/k3s/pkg/daemons/control/server.go:358 +0x286c
github.com/rancher/k3s/pkg/daemons/control.Server(0x4c54660, 0xc000ba90c0, 0xc0005ff908, 0xc0006adb70, 0xc0006adb70)
	/go/src/github.com/rancher/k3s/pkg/daemons/control/server.go:89 +0x155
github.com/rancher/k3s/pkg/server.StartServer(0x4c54660, 0xc000ba90c0, 0xc0005ff900, 0xc000ba90c0, 0x2)
	/go/src/github.com/rancher/k3s/pkg/server/server.go:55 +0x90
github.com/rancher/k3s/pkg/cli/server.run(0xc000ba88c0, 0x744b0a0, 0x1, 0xc0005e34c0)
	/go/src/github.com/rancher/k3s/pkg/cli/server/server.go:220 +0x132f
github.com/rancher/k3s/pkg/cli/server.Run(0xc000ba88c0, 0xc000759c70, 0x0)
	/go/src/github.com/rancher/k3s/pkg/cli/server/server.go:35 +0x37
github.com/rancher/k3s/pkg/cli/cmds.InitLogging.func1(0xc000ba88c0, 0x0, 0x0)
	/go/src/github.com/rancher/k3s/pkg/cli/cmds/log.go:73 +0xaa
github.com/rancher/k3s/vendor/github.com/rancher/spur/cli.(*Command).Run(0xc000e7d0e0, 0xc000a0d5c0, 0x0, 0x0)
	/go/src/github.com/rancher/k3s/vendor/github.com/rancher/spur/cli/command.go:164 +0x4b9
github.com/rancher/k3s/vendor/github.com/rancher/spur/cli.(*App).RunContext(0xc000ac2000, 0x4c546a0, 0xc0000e4010, 0xc000213bf0, 0x3, 0x3, 0x0, 0x0)
	/go/src/github.com/rancher/k3s/vendor/github.com/rancher/spur/cli/app.go:308 +0x5ed
github.com/rancher/k3s/vendor/github.com/rancher/spur/cli.(*App).Run(...)
	/go/src/github.com/rancher/k3s/vendor/github.com/rancher/spur/cli/app.go:225
main.main()
	/go/src/github.com/rancher/k3s/cmd/server/main.go:46 +0x3a6

galal-hussein · 2020-08-14T17:51:15Z

@ElisaMeng I will check that out, thanks a lot, if you dont mind can you open a separate issue with the problem

Signed-off-by: galal-hussein <hussein.galal.ahmed.11@gmail.com>

ElisaMeng · 2020-08-15T05:48:18Z

@ElisaMeng I will check that out, thanks a lot, if you dont mind can you open a separate issue with the problem

@galal-hussein I will do that, I post here because I thought this was a WIP, don't wanna press the alert too early. :)

See #2131

MonzElmasry

LGTM

brandond · 2020-08-21T22:41:22Z

pkg/cli/cmds/server.go

+				Destination: &ServerConfig.DisableSnapshots,
+			},
+			&cli.StringFlag{
+				Name:        "snapshot-dir",


It's too bad we didn't decide earlier if things are -path or -dir. I wanted to complain that we're adding two paths, but one is called Path and one is called Dir, but we're already inconsistent on this for other config settings. Should the the snapshot-restore-path point at a single file?

brandond · 2020-08-21T22:46:17Z

pkg/etcd/etcd.go

@@ -89,6 +97,25 @@ func nameFile(config *config.Control) string {
 	return filepath.Join(dataDir(config), "name")
 }

+func snapshotDir(config *config.Control) (string, error) {
+	if config.SnapshotDir == "" {
+		// we have to create the snapshot dir if we are using


So we create the default dir if it doesn't exist? What happens if the user specifies a nonexistent path? Do we want to create that also, or just fail when trying to save the snapshot?

I'd prefer we fail in that case. I'd wouldn't want the software creating n+1 nested dir structures.

Yeah same here, maybe check here that it's a writable directory?

pkg/etcd/etcd.go

davidnuzik · 2020-08-24T16:49:42Z

Use #2154

galal-hussein requested a review from a team as a code owner August 12, 2020 23:24

Add etcd snapshot and restore

3ead6fa

Signed-off-by: galal-hussein <hussein.galal.ahmed.11@gmail.com>

galal-hussein force-pushed the etcd_backup_restore branch from 2584ac0 to 3ead6fa Compare August 12, 2020 23:25

brandond requested changes Aug 12, 2020

View reviewed changes

pkg/etcd/etcd.go Outdated Show resolved Hide resolved

pkg/etcd/etcd.go Outdated Show resolved Hide resolved

fix error logs

78fd41f

Signed-off-by: galal-hussein <hussein.galal.ahmed.11@gmail.com>

galal-hussein force-pushed the etcd_backup_restore branch from 5b83392 to 78fd41f Compare August 12, 2020 23:40

briandowns reviewed Aug 12, 2020

View reviewed changes

pkg/cli/cmds/server.go Outdated Show resolved Hide resolved

galal-hussein requested review from briandowns and brandond August 12, 2020 23:51

goimports

d3575b6

Signed-off-by: galal-hussein <hussein.galal.ahmed.11@gmail.com>

galal-hussein force-pushed the etcd_backup_restore branch from 9f73245 to d3575b6 Compare August 12, 2020 23:51

fix flag describtion

4be58c5

Signed-off-by: galal-hussein <hussein.galal.ahmed.11@gmail.com>

briandowns approved these changes Aug 12, 2020

View reviewed changes

brandond approved these changes Aug 13, 2020

View reviewed changes

Add disable snapshot and retention

6fa3b27

Signed-off-by: galal-hussein <hussein.galal.ahmed.11@gmail.com>

galal-hussein force-pushed the etcd_backup_restore branch from 46f1902 to 6fa3b27 Compare August 13, 2020 22:09

galal-hussein requested review from brandond, briandowns, ibuildthecloud, erikwilson, cjellick, dweomer and MonzElmasry and removed request for briandowns August 13, 2020 22:09

briandowns reviewed Aug 13, 2020

View reviewed changes

pkg/etcd/etcd.go Outdated Show resolved Hide resolved

use creation time for snapshot retention

76216ac

Signed-off-by: galal-hussein <hussein.galal.ahmed.11@gmail.com>

galal-hussein requested a review from briandowns August 14, 2020 23:51

MonzElmasry approved these changes Aug 15, 2020

View reviewed changes

briandowns approved these changes Aug 15, 2020

View reviewed changes

brandond reviewed Aug 21, 2020

View reviewed changes

briandowns closed this Aug 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add etcd snapshot and restore #2118

Add etcd snapshot and restore #2118

galal-hussein commented Aug 12, 2020 •

edited by davidnuzik

Loading

briandowns left a comment

brandond left a comment

ElisaMeng commented Aug 13, 2020

ElisaMeng commented Aug 14, 2020 •

edited

Loading

galal-hussein commented Aug 14, 2020

ElisaMeng commented Aug 15, 2020 •

edited

Loading

MonzElmasry left a comment

brandond Aug 21, 2020

brandond Aug 21, 2020

briandowns Aug 22, 2020

brandond Aug 22, 2020

davidnuzik commented Aug 24, 2020

Add etcd snapshot and restore #2118

Add etcd snapshot and restore #2118

Conversation

galal-hussein commented Aug 12, 2020 • edited by davidnuzik Loading

Proposed changes

Types of changes

Verification

Testing

Linked Issues

briandowns left a comment

Choose a reason for hiding this comment

brandond left a comment

Choose a reason for hiding this comment

ElisaMeng commented Aug 13, 2020

ElisaMeng commented Aug 14, 2020 • edited Loading

galal-hussein commented Aug 14, 2020

ElisaMeng commented Aug 15, 2020 • edited Loading

MonzElmasry left a comment

Choose a reason for hiding this comment

brandond Aug 21, 2020

Choose a reason for hiding this comment

brandond Aug 21, 2020

Choose a reason for hiding this comment

briandowns Aug 22, 2020

Choose a reason for hiding this comment

brandond Aug 22, 2020

Choose a reason for hiding this comment

davidnuzik commented Aug 24, 2020

galal-hussein commented Aug 12, 2020 •

edited by davidnuzik

Loading

ElisaMeng commented Aug 14, 2020 •

edited

Loading

ElisaMeng commented Aug 15, 2020 •

edited

Loading