Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add etcd snapshot and restore #2118

Closed

Conversation

galal-hussein
Copy link
Contributor

@galal-hussein galal-hussein commented Aug 12, 2020

Proposed changes

Add etcd snapshot and restoration feature, this include adding 5 new flags:

  • "snapshot-interval"
  • "snapshot-dir"
  • "snapshot-restore-path"
  • "disable-snapshots"
  • "snapshot-retention"

Types of changes

  • New feature (non-breaking change which adds functionality)

Verification

snapshot should be enabled by default and the snapshot dir defaults to /server/db/snapshots, you should verify by checking the directory for snapshots saved there.

restoration should happen the same way cluster-reset is working, when specifying the flag it will move the old data dir to /server/db/etcd-old and then attempt to restore the snapshot by creating a new data dir, and then start etcd with forcing a new cluster so that it start a 1 member cluster.

to verify the restoration you should be able to see cluster starting with only one etcd member and the data from the specified snapshot is being restored correctly.

The pr also will add etcd snapshot retention and also a flag to disable snapshots altogether.

Testing

###1 Testing disabling snapshots

  • start k3s with --cluster-init and --disable-snapshots

you should see no snapshot has been created in /server/db/snapshots

###2 Testing snapshot interval and retention

  • start k3s with --snapshot-interval set to 5s and --snapshot-retention set to 10

you should see snapshots created every 5 seconds in /server/db/snapshots and no more than 10 snapshots in the directory

###3 testing snapshot restore

To test snapshot restore, just use --snapshot-restore-path and point to any snapshot file that you have on the system

The cluster should restore and etcd will only have one member, also you should see /server/db/etcd-old directory has been created.

Linked Issues

rancher/rke2#45

@galal-hussein galal-hussein requested a review from a team as a code owner August 12, 2020 23:24
Signed-off-by: galal-hussein <hussein.galal.ahmed.11@gmail.com>
pkg/etcd/etcd.go Outdated Show resolved Hide resolved
pkg/etcd/etcd.go Outdated Show resolved Hide resolved
Signed-off-by: galal-hussein <hussein.galal.ahmed.11@gmail.com>
pkg/cli/cmds/server.go Outdated Show resolved Hide resolved
Signed-off-by: galal-hussein <hussein.galal.ahmed.11@gmail.com>
Signed-off-by: galal-hussein <hussein.galal.ahmed.11@gmail.com>
Copy link
Contributor

@briandowns briandowns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢 🇮🇹

Copy link
Member

@brandond brandond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💰

@ElisaMeng
Copy link

Will the snapshots be stored forever?

Signed-off-by: galal-hussein <hussein.galal.ahmed.11@gmail.com>
pkg/etcd/etcd.go Outdated Show resolved Hide resolved
@ElisaMeng
Copy link

ElisaMeng commented Aug 14, 2020

@galal-hussein I try to test after introduce learner. If I do a cluster reset, I will get panic like below, which was not seen before. What is going wrong? Or it is simply not ready yet?

How I test:

  • deploy k3s from master using ha
  • put down masters to make it lost quorum
  • run with --cluster-reset and panic happens

This is 100% rep-producable for me.


4T15:09:17.889190241+08:00] Starting k3s v1.18.6+k3s-026584e1 (026584e1)
[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
2020-08-14 15:09:17.905293 I | embed: peerTLS: cert = /var/lib/rancher/k3s/server/tls/etcd/peer-server-client.crt, key = /var/lib/rancher/k3s/server/tls/etcd/peer-server-client.key, trusted-ca = /var/lib/rancher/k3s/server/tls/etcd/peer-ca.crt, client-cert-auth = true, crl-file =
2020-08-14 15:09:17.905760 I | embed: name = master1-5c63d33d
2020-08-14 15:09:17.905775 I | embed: force new cluster
2020-08-14 15:09:17.905779 I | embed: data dir = /var/lib/rancher/k3s/server/db/etcd
2020-08-14 15:09:17.905837 I | embed: member dir = /var/lib/rancher/k3s/server/db/etcd/member
2020-08-14 15:09:17.905843 I | embed: heartbeat = 500ms
2020-08-14 15:09:17.905881 I | embed: election = 5000ms
2020-08-14 15:09:17.905920 I | embed: snapshot count = 100000
2020-08-14 15:09:17.905987 I | embed: advertise client URLs = https://172.20.1.168:2379
2020-08-14 15:09:17.905996 I | embed: initial advertise peer URLs = https://172.20.1.168:2380
2020-08-14 15:09:17.906060 I | embed: initial cluster =
2020-08-14 15:09:17.958057 C | etcdserver: ConfChange Type should be either ConfChangeAddNode or ConfChangeRemoveNode!
panic: ConfChange Type should be either ConfChangeAddNode or ConfChangeRemoveNode!

goroutine 1 [running]:
github.com/rancher/k3s/vendor/github.com/coreos/pkg/capnslog.(*PackageLogger).Panicf(0xc00100fba0, 0x430cca6, 0x4b, 0x0, 0x0, 0x0)
	/go/src/github.com/rancher/k3s/vendor/github.com/coreos/pkg/capnslog/pkg_logger.go:83 +0x135
github.com/rancher/k3s/vendor/go.etcd.io/etcd/etcdserver.getIDs(0x0, 0x0, 0xc0013aa000, 0x183e, 0x1871, 0x0, 0x0, 0x0)
	/go/src/github.com/rancher/k3s/vendor/go.etcd.io/etcd/etcdserver/raft.go:703 +0x5ef
github.com/rancher/k3s/vendor/go.etcd.io/etcd/etcdserver.restartAsStandaloneNode(0xc00102a860, 0x13, 0x0, 0x0, 0x0, 0x0, 0xc001299500, 0x1, 0x1, 0xc001299400, ...)
	/go/src/github.com/rancher/k3s/vendor/go.etcd.io/etcd/etcdserver/raft.go:608 +0x246
github.com/rancher/k3s/vendor/go.etcd.io/etcd/etcdserver.NewServer(0xc00102a860, 0x13, 0x0, 0x0, 0x0, 0x0, 0xc001299500, 0x1, 0x1, 0xc001299400, ...)
	/go/src/github.com/rancher/k3s/vendor/go.etcd.io/etcd/etcdserver/server.go:482 +0x1060
github.com/rancher/k3s/vendor/go.etcd.io/etcd/embed.StartEtcd(0xc0002ad800, 0xc000ff6680, 0x0, 0x0)
	/go/src/github.com/rancher/k3s/vendor/go.etcd.io/etcd/embed/etcd.go:211 +0x9e9
github.com/rancher/k3s/pkg/daemons/executor.Embedded.ETCD(0xc000e6de60, 0x19, 0xc000ab79e0, 0x2d, 0x41ce290, 0x3, 0xc000e6de20, 0x13, 0xc000ab7a40, 0x30, ...)
	/go/src/github.com/rancher/k3s/pkg/daemons/executor/etcd.go:23 +0xa9
github.com/rancher/k3s/pkg/daemons/executor.ETCD(...)
	/go/src/github.com/rancher/k3s/pkg/daemons/executor/executor.go:107
github.com/rancher/k3s/pkg/etcd.(*ETCD).cluster(0xc0005bc400, 0x4c54660, 0xc000ba90c0, 0x1, 0xc000e6de60, 0x19, 0xc000ab79e0, 0x2d, 0x41ce290, 0x3, ...)
	/go/src/github.com/rancher/k3s/pkg/etcd/etcd.go:374 +0x59b
github.com/rancher/k3s/pkg/etcd.(*ETCD).newCluster(0xc0005bc400, 0x4c54660, 0xc000ba90c0, 0x4c54601, 0xc000ba90c0, 0x0)
	/go/src/github.com/rancher/k3s/pkg/etcd/etcd.go:358 +0x241
github.com/rancher/k3s/pkg/etcd.(*ETCD).Reset(0xc0005bc400, 0x4c54660, 0xc000ba90c0, 0x0, 0xc000451ec0, 0xc00016ab60)
	/go/src/github.com/rancher/k3s/pkg/etcd/etcd.go:120 +0x88
github.com/rancher/k3s/pkg/cluster.(*Cluster).start(0xc000d0c5a0, 0x4c54660, 0xc000ba90c0, 0x0, 0x0)
	/go/src/github.com/rancher/k3s/pkg/cluster/managed.go:49 +0x6a
github.com/rancher/k3s/pkg/cluster.(*Cluster).Start(0xc000d0c5a0, 0x4c54660, 0xc000ba90c0, 0x0, 0x0, 0x41ebcfe)
	/go/src/github.com/rancher/k3s/pkg/cluster/cluster.go:33 +0x78
github.com/rancher/k3s/pkg/daemons/control.prepare(0x4c54660, 0xc000ba90c0, 0xc0005ff908, 0xc0005b3880, 0x1a, 0x0)
	/go/src/github.com/rancher/k3s/pkg/daemons/control/server.go:358 +0x286c
github.com/rancher/k3s/pkg/daemons/control.Server(0x4c54660, 0xc000ba90c0, 0xc0005ff908, 0xc0006adb70, 0xc0006adb70)
	/go/src/github.com/rancher/k3s/pkg/daemons/control/server.go:89 +0x155
github.com/rancher/k3s/pkg/server.StartServer(0x4c54660, 0xc000ba90c0, 0xc0005ff900, 0xc000ba90c0, 0x2)
	/go/src/github.com/rancher/k3s/pkg/server/server.go:55 +0x90
github.com/rancher/k3s/pkg/cli/server.run(0xc000ba88c0, 0x744b0a0, 0x1, 0xc0005e34c0)
	/go/src/github.com/rancher/k3s/pkg/cli/server/server.go:220 +0x132f
github.com/rancher/k3s/pkg/cli/server.Run(0xc000ba88c0, 0xc000759c70, 0x0)
	/go/src/github.com/rancher/k3s/pkg/cli/server/server.go:35 +0x37
github.com/rancher/k3s/pkg/cli/cmds.InitLogging.func1(0xc000ba88c0, 0x0, 0x0)
	/go/src/github.com/rancher/k3s/pkg/cli/cmds/log.go:73 +0xaa
github.com/rancher/k3s/vendor/github.com/rancher/spur/cli.(*Command).Run(0xc000e7d0e0, 0xc000a0d5c0, 0x0, 0x0)
	/go/src/github.com/rancher/k3s/vendor/github.com/rancher/spur/cli/command.go:164 +0x4b9
github.com/rancher/k3s/vendor/github.com/rancher/spur/cli.(*App).RunContext(0xc000ac2000, 0x4c546a0, 0xc0000e4010, 0xc000213bf0, 0x3, 0x3, 0x0, 0x0)
	/go/src/github.com/rancher/k3s/vendor/github.com/rancher/spur/cli/app.go:308 +0x5ed
github.com/rancher/k3s/vendor/github.com/rancher/spur/cli.(*App).Run(...)
	/go/src/github.com/rancher/k3s/vendor/github.com/rancher/spur/cli/app.go:225
main.main()
	/go/src/github.com/rancher/k3s/cmd/server/main.go:46 +0x3a6

@galal-hussein
Copy link
Contributor Author

@ElisaMeng I will check that out, thanks a lot, if you dont mind can you open a separate issue with the problem

Signed-off-by: galal-hussein <hussein.galal.ahmed.11@gmail.com>
@ElisaMeng
Copy link

ElisaMeng commented Aug 15, 2020

@ElisaMeng I will check that out, thanks a lot, if you dont mind can you open a separate issue with the problem

@galal-hussein I will do that, I post here because I thought this was a WIP, don't wanna press the alert too early. :)

See #2131

Copy link
Contributor

@MonzElmasry MonzElmasry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Destination: &ServerConfig.DisableSnapshots,
},
&cli.StringFlag{
Name: "snapshot-dir",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's too bad we didn't decide earlier if things are -path or -dir. I wanted to complain that we're adding two paths, but one is called Path and one is called Dir, but we're already inconsistent on this for other config settings. Should the the snapshot-restore-path point at a single file?

@@ -89,6 +97,25 @@ func nameFile(config *config.Control) string {
return filepath.Join(dataDir(config), "name")
}

func snapshotDir(config *config.Control) (string, error) {
if config.SnapshotDir == "" {
// we have to create the snapshot dir if we are using
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we create the default dir if it doesn't exist? What happens if the user specifies a nonexistent path? Do we want to create that also, or just fail when trying to save the snapshot?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer we fail in that case. I'd wouldn't want the software creating n+1 nested dir structures.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah same here, maybe check here that it's a writable directory?

pkg/etcd/etcd.go Show resolved Hide resolved
@briandowns briandowns closed this Aug 22, 2020
@davidnuzik
Copy link
Contributor

Use #2154

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants