-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Rook Ceph for Jupyterhub and Conda Store drives #2541
Conversation
|
|
||
|
||
DEFAULT_DO_NODE_GROUPS = { | ||
"general": DigitalOceanNodeGroup(instance="g-8vcpu-32gb", min_nodes=1, max_nodes=1), | ||
"general": DigitalOceanNodeGroup(instance="g-8vcpu-32gb", min_nodes=1, max_nodes=5), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to keep max_nodes as 5
One issue I'm hitting is the Ceph cluster deployment takes a long time but helm thinks it's done b/c it just deploys a CephCluster object which is quick and the ceph operator handles the longer task of waiting for a bunch of rook resources to be created (in successive steps). What this means in practice currently is that the deploy step times out, and ceph isn't ready for another 5-10 minutes even beyond the timeout. I'm not sure the best way to address this. Maybe some sort of readiness check (in k8s or in python) can be added or increase the timeout or deploy ceph on an earlier stage of Nebari to give the user extra time to get ready. |
Currently, there are 3 storage nodes that require about 1cpu/2-6 GB of RAM. On Azure, I added 3 2cpu/8GB Ram nodes for storage adding $210/month to the operating cost of Nebari (on top of the pre-existing $280/month). We may be able to use a single node by using cluster-test.yaml from here, but we likely won't get the same performance speedup and would not have the redundancy benefit that a more typical multi-node ceph deployment can give. |
Co-authored-by: Marcelo Villa <mvilla@quansight.com>
B/c there isn't a pressing need for this, and the benefits of using ceph seem to only come with significant increased cost, I'm planning to put this on the backburner for the moment. Questions we need to resolve:
|
I think we can make Ceph the first stage after the k8s cluster (so stage 3?) and we can see if that is enough. As for costs, we could make ceph single node the default and have a HA flag in the config to move to multnode, HA setup. That way we still have the abstracted storage interface, but not the increased cost. @Adam-D-Lewis can you test with a single node ceph and see how it runs? |
@dcmcand Yeah, I'll test with a single node |
This PR is ready for review. You can do a deployment by adding storage.type = cephfs to the nebari config file. You shouldn't switch an existing nebari deployment from nfs (default) to cephfs b/c you'll lose all your files and conda envs. I'll write up a docs PR detailing this as experimental. I think we should start using this on Quansight's deployment (after we do a backup) to get a feel for it, and eventually make it the default if all goes well. storage:
type: cephfs |
@Adam-D-Lewis did you workout the destroy bits? |
No, not yet. The destroy still occurs, but it does take longer at the moment, and a message saying stage 07 had problems destroying, but only stages 01 and 02 are usually necessary for a successful destroy. |
Update: Fixed |
I attempted to just have the operator wait before it was destroyed with a local-exec provisioner, but that didn't work. There are all kinds of rook-ceph protections stopping the filesystem from being deleted. Attempting to delete the cephcluster results in a message along the lines of The only way I could delete the CephFileSystemSubVolumeGroup and CephFileSystem was through manually deleting the finalizers on them. I tried adding the Even after deleting the CephFileSystemSubVolumeGroup and CephFileSystem via the finalizers, the Cephcluster still won't delete. The following logs are in the operator. Overriding the Cephcluster finalizers also deleted it. So worse case, we could override the finalizers on those 3 objects in the destroy step of Nebari stage 07 and then nebari destory would succeed, but it's a bit hacky, but it's unlikely to cause a problem since we would only do so when we are deleting the cluster anyway. My view is still that while this is not ideal, we should still merge it in and start testing rook-ceph and fix this issue before moving ceph from expirmental to general availability. |
This reverts commit 6fd8a5a.
@Adam-D-Lewis can you merge the lastest develop in? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Adam-D-Lewis I think this is great work.
Some questions though:
- Are you planning on a HA multiple ceph node option for this in the future? Is there a follow on ticket for that?
- Is there a ticket for moving node storage to ceph?
- It appears this is using dynamic pvc provisioning for Ceph which would allow expanding conda-store volumes in the future, correct?
- I didn't see any validation or data loss warning for changing an existing nfs deployment to ceph or vise versa. I think we will definitely need that, is there a follow on ticket for that?
- Let's spend some time thinking about how to migrate from nfs to ceph and back
- Let's spend some time thinking about how to use this same design pattern for nfs with dynamic provisioning, etc
Thanks for the review @dcmcand. I responded to your questions below.
This PR provides feature parity with the existing NFS storage state in Nebari. I'm happy to work on allowing the use of additional nodes to Ceph assuming that work remains in scope for JATIC. I have created a follow on ticket to make the number of RookCeph backing nodes variable - #2573. I assumed we would validate that Ceph storage works for a single node and then make it the default for future deployments, and only after that point would we work on increasing the number of rook/ceph nodes.
I don't have a ticket for this. I'm not sure how we would do this, and I can't think of any strong benefits. Can you explain your thoughts further and/or what you're looking for here?
This does use dynamic provisioning of PVs since PVCs are provisioned from storage classes. I believe the PVC should be able to increase in size without data loss, but I'll test and get back to you. I'm not sure how increasing the size of the disks that back the RookCeph storage would work. I'll look into that a bit more and get back to you as well. UPDATE: I tested increasing the shared directory and the conda storage sizes on a GCP deployment. The PVC for the disk that backs the RookCeph storage was increased without a problem (verified in GCP Console as well) and the PVCs for the shared directory and conda env storage also increased without a problem.
Currently, the only warning is in the documentation PR. We don't have a way to recognize changes to the nebari config file during a redeploy which makes this difficult. I did open this ticket which explains the issue further and proposes a way to add this ability. I wasn't originally planning on working on that before this PR is merged, but I'd be happy to if you think it's needed. As a workaround we could require user confirmation on any deploy which has cephfs set as the storage, but I think it's fair to just put the warning in the docs as well as listing it in the docs as an alpha feature.
I think it's fair to not support that transition, at least initially while we prove that RookCeph is going to work well. So in that case, users would only be expected to make the decision on the initial deployment and not change it afterwards. I do think we should run this on the Quansight deployment to help test it though. In a case like that, I would tell an admin to copy all the files in the NFS drives, tar them and copy to cloud storage. Then copy them to the new disk and untar them after we redeploy similar to what we do in the manual backup docs. We should test this before doing it (particularly for the conda-store NFS drive), but I think it should work fine.
Can you elaborate here? What else are you looking for? |
👍
👍
I think that is fine for an alpha feature, but given the risk of user data loss, I think this is a problem we need to solve before we go mainstream with ceph, either by figuring out how to migrate, or by blocking switching
I think that is fair now, but I would prefer to have a switching path in the future, even if it is to use backup and restore.
Basically what I talked about in the meeting, removing dedicated PV's and instead creating an NFS storage class that dynamically provisions for each cloud provider. That way nfs wouuld be exandable for all providers in the future |
Here is the issue about moving other PVCs to use ceph for storage.
Yeah, I'm happy for backup and restore to be the switching plan. I'll test this out after this is merged since I'll want us to test using Rook Ceph on the Quansight Nebari deployment after this PR is merged.
I doubt it would be very hard to make that change, and I'd be happy to comment on an issue with more thoughts. |
Thanks again for the review, @dcmcand. Is there anything else you want to see before this is merged? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
The failing test in the Local Integration Tests has been seen in other PRs. @viniciusdc is working on a fix, and it is unrelated to this PR, so I will merge this PR. |
Reference Issues or PRs
Related to #2534
Deploys rook operator and rook-cluster helm charts, sets up some StorageClasses backed by a Rook Ceph cluster and uses those storage classes for what were previously the jupyterhub and conda store NFS drives.
Developed on Azure at the moment.
What does this implement/fix?
Put a
x
in the boxes that applyTesting
Any other comments?