Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Rook Ceph for Jupyterhub and Conda Store drives #2541

Merged
merged 73 commits into from
Aug 21, 2024
Merged

Conversation

Adam-D-Lewis
Copy link
Member

@Adam-D-Lewis Adam-D-Lewis commented Jun 25, 2024

Reference Issues or PRs

Related to #2534
Deploys rook operator and rook-cluster helm charts, sets up some StorageClasses backed by a Rook Ceph cluster and uses those storage classes for what were previously the jupyterhub and conda store NFS drives.

Developed on Azure at the moment.

What does this implement/fix?

Put a x in the boxes that apply

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds a feature)
  • Breaking change (fix or feature that would cause existing features not to work as expected)
  • Documentation Update
  • Code style update (formatting, renaming)
  • Refactoring (no functional changes, no API changes)
  • Build related changes
  • Other (please describe):

Testing

  • Did you test the pull request locally?
  • Did you add new tests?

Any other comments?

@Adam-D-Lewis Adam-D-Lewis linked an issue Jun 25, 2024 that may be closed by this pull request
@Adam-D-Lewis
Copy link
Member Author

Adam-D-Lewis commented Jun 25, 2024

  • See TODOs in code for what is still left.
  • Also, try to reduce storage node size.
  • Make using NFS or Ceph configurable in nebari-config.yaml
  • Test migration



DEFAULT_DO_NODE_GROUPS = {
"general": DigitalOceanNodeGroup(instance="g-8vcpu-32gb", min_nodes=1, max_nodes=1),
"general": DigitalOceanNodeGroup(instance="g-8vcpu-32gb", min_nodes=1, max_nodes=5),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to keep max_nodes as 5

@Adam-D-Lewis
Copy link
Member Author

Adam-D-Lewis commented Jun 28, 2024

One issue I'm hitting is the Ceph cluster deployment takes a long time but helm thinks it's done b/c it just deploys a CephCluster object which is quick and the ceph operator handles the longer task of waiting for a bunch of rook resources to be created (in successive steps).

What this means in practice currently is that the deploy step times out, and ceph isn't ready for another 5-10 minutes even beyond the timeout.

I'm not sure the best way to address this. Maybe some sort of readiness check (in k8s or in python) can be added or increase the timeout or deploy ceph on an earlier stage of Nebari to give the user extra time to get ready.

@Adam-D-Lewis
Copy link
Member Author

Adam-D-Lewis commented Jun 28, 2024

Currently, there are 3 storage nodes that require about 1cpu/2-6 GB of RAM. On Azure, I added 3 2cpu/8GB Ram nodes for storage adding $210/month to the operating cost of Nebari (on top of the pre-existing $280/month). We may be able to use a single node by using cluster-test.yaml from here, but we likely won't get the same performance speedup and would not have the redundancy benefit that a more typical multi-node ceph deployment can give.

@Adam-D-Lewis
Copy link
Member Author

Adam-D-Lewis commented Jun 28, 2024

B/c there isn't a pressing need for this, and the benefits of using ceph seem to only come with significant increased cost, I'm planning to put this on the backburner for the moment.

Questions we need to resolve:

  • Can you run Ceph without high-availability and disaster recovery with a single server so the cost is not increased beyond current costs? Is there a benefit in doing so? Speeds may not be faster anymore with a single node. You can consolidate block, object, and file system storage under ceph so maybe that's helpful.
  • Do we want to make Ceph configurable (support NFS and Ceph) or only allow Ceph from now on?

@dcmcand
Copy link
Contributor

dcmcand commented Jul 2, 2024

I think we can make Ceph the first stage after the k8s cluster (so stage 3?) and we can see if that is enough.

As for costs, we could make ceph single node the default and have a HA flag in the config to move to multnode, HA setup. That way we still have the abstracted storage interface, but not the increased cost.

@Adam-D-Lewis can you test with a single node ceph and see how it runs?

@Adam-D-Lewis
Copy link
Member Author

Adam-D-Lewis commented Jul 2, 2024

@dcmcand Yeah, I'll test with a single node

@Adam-D-Lewis
Copy link
Member Author

Adam-D-Lewis commented Jul 31, 2024

This PR is ready for review. You can do a deployment by adding storage.type = cephfs to the nebari config file. You shouldn't switch an existing nebari deployment from nfs (default) to cephfs b/c you'll lose all your files and conda envs. I'll write up a docs PR detailing this as experimental. I think we should start using this on Quansight's deployment (after we do a backup) to get a feel for it, and eventually make it the default if all goes well.

storage:
  type: cephfs

@Adam-D-Lewis Adam-D-Lewis added the project: JATIC Work item needed for the JATIC project label Aug 1, 2024
@viniciusdc
Copy link
Contributor

@Adam-D-Lewis did you workout the destroy bits?

@Adam-D-Lewis
Copy link
Member Author

@Adam-D-Lewis did you workout the destroy bits?

No, not yet. The destroy still occurs, but it does take longer at the moment, and a message saying stage 07 had problems destroying, but only stages 01 and 02 are usually necessary for a successful destroy.

@Adam-D-Lewis
Copy link
Member Author

Adam-D-Lewis commented Aug 1, 2024

Update: Fixed I deployed on Azure today and timed out. Redeploying resulted in a successful deployment. I'll work on fixing that, but I'd appreciate an initial review even before then.

@Adam-D-Lewis
Copy link
Member Author

@Adam-D-Lewis
Copy link
Member Author

Adam-D-Lewis commented Aug 5, 2024

I attempted to just have the operator wait before it was destroyed with a local-exec provisioner, but that didn't work. There are all kinds of rook-ceph protections stopping the filesystem from being deleted. Attempting to delete the cephcluster results in a message along the lines of
"failed to reconcile CephCluster "prod/prod". failed to clean up CephCluster "prod/prod": failed until dependent objects delted CephFileSystemSubVolumeGroup and CephFileSystem."

The only way I could delete the CephFileSystemSubVolumeGroup and CephFileSystem was through manually deleting the finalizers on them. I tried adding the rook.io/force-deletion="true" annotation, but that didn't work. I also tried setting the cephClusterSpec.cephConfig.cleanupPolicy.confirmation = "yes-really-destroy-data" in the cephcluster helm chart values, but that didn't seem to have an effect either.

Even after deleting the CephFileSystemSubVolumeGroup and CephFileSystem via the finalizers, the Cephcluster still won't delete. The following logs are in the operator.
"failed to reconcile CephCluster "prod/prod". failed to clean up CephCluster "prod/prod": failed to check if volumes exist for CephCluster in namespace "prod": waiting for csi volume attachments in cluster "prod" to be cleaned up".

Overriding the Cephcluster finalizers also deleted it. So worse case, we could override the finalizers on those 3 objects in the destroy step of Nebari stage 07 and then nebari destory would succeed, but it's a bit hacky, but it's unlikely to cause a problem since we would only do so when we are deleting the cluster anyway.

My view is still that while this is not ideal, we should still merge it in and start testing rook-ceph and fix this issue before moving ceph from expirmental to general availability.

@dcmcand
Copy link
Contributor

dcmcand commented Aug 19, 2024

@Adam-D-Lewis can you merge the lastest develop in?

Copy link
Contributor

@dcmcand dcmcand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Adam-D-Lewis I think this is great work.

Some questions though:

  1. Are you planning on a HA multiple ceph node option for this in the future? Is there a follow on ticket for that?
  2. Is there a ticket for moving node storage to ceph?
  3. It appears this is using dynamic pvc provisioning for Ceph which would allow expanding conda-store volumes in the future, correct?
  4. I didn't see any validation or data loss warning for changing an existing nfs deployment to ceph or vise versa. I think we will definitely need that, is there a follow on ticket for that?
  5. Let's spend some time thinking about how to migrate from nfs to ceph and back
  6. Let's spend some time thinking about how to use this same design pattern for nfs with dynamic provisioning, etc

@Adam-D-Lewis
Copy link
Member Author

Adam-D-Lewis commented Aug 19, 2024

Thanks for the review @dcmcand. I responded to your questions below.

@Adam-D-Lewis I think this is great work.

Some questions though:

  1. Are you planning on a HA multiple ceph node option for this in the future? Is there a follow on ticket for that?

This PR provides feature parity with the existing NFS storage state in Nebari. I'm happy to work on allowing the use of additional nodes to Ceph assuming that work remains in scope for JATIC. I have created a follow on ticket to make the number of RookCeph backing nodes variable - #2573. I assumed we would validate that Ceph storage works for a single node and then make it the default for future deployments, and only after that point would we work on increasing the number of rook/ceph nodes.

  1. Is there a ticket for moving node storage to ceph?

I don't have a ticket for this. I'm not sure how we would do this, and I can't think of any strong benefits. Can you explain your thoughts further and/or what you're looking for here?

  1. It appears this is using dynamic pvc provisioning for Ceph which would allow expanding conda-store volumes in the future, correct?

This does use dynamic provisioning of PVs since PVCs are provisioned from storage classes. I believe the PVC should be able to increase in size without data loss, but I'll test and get back to you. I'm not sure how increasing the size of the disks that back the RookCeph storage would work. I'll look into that a bit more and get back to you as well.

UPDATE: I tested increasing the shared directory and the conda storage sizes on a GCP deployment. The PVC for the disk that backs the RookCeph storage was increased without a problem (verified in GCP Console as well) and the PVCs for the shared directory and conda env storage also increased without a problem.

  1. I didn't see any validation or data loss warning for changing an existing nfs deployment to ceph or vise versa. I think we will definitely need that, is there a follow on ticket for that?

Currently, the only warning is in the documentation PR. We don't have a way to recognize changes to the nebari config file during a redeploy which makes this difficult. I did open this ticket which explains the issue further and proposes a way to add this ability. I wasn't originally planning on working on that before this PR is merged, but I'd be happy to if you think it's needed. As a workaround we could require user confirmation on any deploy which has cephfs set as the storage, but I think it's fair to just put the warning in the docs as well as listing it in the docs as an alpha feature.

  1. Let's spend some time thinking about how to migrate from nfs to ceph and back

I think it's fair to not support that transition, at least initially while we prove that RookCeph is going to work well. So in that case, users would only be expected to make the decision on the initial deployment and not change it afterwards. I do think we should run this on the Quansight deployment to help test it though. In a case like that, I would tell an admin to copy all the files in the NFS drives, tar them and copy to cloud storage. Then copy them to the new disk and untar them after we redeploy similar to what we do in the manual backup docs. We should test this before doing it (particularly for the conda-store NFS drive), but I think it should work fine.

  1. Let's spend some time thinking about how to use this same design pattern for nfs with dynamic provisioning, etc

Can you elaborate here? What else are you looking for?

@Adam-D-Lewis Adam-D-Lewis requested a review from dcmcand August 19, 2024 16:45
@dcmcand
Copy link
Contributor

dcmcand commented Aug 20, 2024

  1. Are you planning on a HA multiple ceph node option for this in the future? Is there a follow on ticket for that?

This PR provides feature parity with the existing NFS storage state in Nebari. I'm happy to work on allowing the use of additional nodes to Ceph assuming that work remains in scope for JATIC. I have created a follow on ticket to make the number of RookCeph backing nodes variable - #2573. I assumed we would validate that Ceph storage works for a single node and then make it the default for future deployments, and only after that point would we work on increasing the number of rook/ceph nodes.

👍

  1. Is there a ticket for moving node storage to ceph?

I don't have a ticket for this. I'm not sure how we would do this, and I can't think of any strong benefits. Can you explain your thoughts further and/or what you're looking for here?

The benefit would be allowing the general node to spin up in a different AZ to allow multi AZ deployments. That has caused a number of folks errors when they upgraded.
I misspoke when I said node storage, because I was remembering the issue incorrectly. Basically any PVC's made by pods in the cluster would need to move to ceph. Right now we have at least one PVC against ebs which can't cross AZ's

  1. It appears this is using dynamic pvc provisioning for Ceph which would allow expanding conda-store volumes in the future, correct?

This does use dynamic provisioning of PVs since PVCs are provisioned from storage classes. I believe the PVC should be able to increase in size without data loss, but I'll test and get back to you. I'm not sure how increasing the size of the disks that back the RookCeph storage would work. I'll look into that a bit more and get back to you as well.

👍

UPDATE: I tested increasing the shared directory and the conda storage sizes on a GCP deployment. The PVC for the disk that backs the RookCeph storage was increased without a problem (verified in GCP Console as well) and the PVCs for the shared directory and conda env storage also increased without a problem.

  1. I didn't see any validation or data loss warning for changing an existing nfs deployment to ceph or vise versa. I think we will definitely need that, is there a follow on ticket for that?

Currently, the only warning is in the documentation PR. We don't have a way to recognize changes to the nebari config file during a redeploy which makes this difficult. I did open this ticket which explains the issue further and proposes a way to add this ability. I wasn't originally planning on working on that before this PR is merged, but I'd be happy to if you think it's needed. As a workaround we could require user confirmation on any deploy which has cephfs set as the storage, but I think it's fair to just put the warning in the docs as well as listing it in the docs as an alpha feature.

I think that is fine for an alpha feature, but given the risk of user data loss, I think this is a problem we need to solve before we go mainstream with ceph, either by figuring out how to migrate, or by blocking switching

  1. Let's spend some time thinking about how to migrate from nfs to ceph and back

I think it's fair to not support that transition, at least initially while we prove that RookCeph is going to work well. So in that case, users would only be expected to make the decision on the initial deployment and not change it afterwards. I do think we should run this on the Quansight deployment to help test it though. In a case like that, I would tell an admin to copy all the files in the NFS drives, tar them and copy to cloud storage. Then copy them to the new disk and untar them after we redeploy similar to what we do in the manual backup docs. We should test this before doing it (particularly for the conda-store NFS drive), but I think it should work fine.

I think that is fair now, but I would prefer to have a switching path in the future, even if it is to use backup and restore.

  1. Let's spend some time thinking about how to use this same design pattern for nfs with dynamic provisioning, etc

Can you elaborate here? What else are you looking for?

Basically what I talked about in the meeting, removing dedicated PV's and instead creating an NFS storage class that dynamically provisions for each cloud provider. That way nfs wouuld be exandable for all providers in the future

@Adam-D-Lewis
Copy link
Member Author

Adam-D-Lewis commented Aug 20, 2024

  1. Is there a ticket for moving node storage to ceph?

I don't have a ticket for this. I'm not sure how we would do this, and I can't think of any strong benefits. Can you explain your thoughts further and/or what you're looking for here?

The benefit would be allowing the general node to spin up in a different AZ to allow multi AZ deployments. That has caused a number of folks errors when they upgraded. I misspoke when I said node storage, because I was remembering the issue incorrectly. Basically any PVC's made by pods in the cluster would need to move to ceph. Right now we have at least one PVC against ebs which can't cross AZ's

Here is the issue about moving other PVCs to use ceph for storage.

  1. Let's spend some time thinking about how to migrate from nfs to ceph and back

I think it's fair to not support that transition, at least initially while we prove that RookCeph is going to work well. So in that case, users would only be expected to make the decision on the initial deployment and not change it afterwards. I do think we should run this on the Quansight deployment to help test it though. In a case like that, I would tell an admin to copy all the files in the NFS drives, tar them and copy to cloud storage. Then copy them to the new disk and untar them after we redeploy similar to what we do in the manual backup docs. We should test this before doing it (particularly for the conda-store NFS drive), but I think it should work fine.

I think that is fair now, but I would prefer to have a switching path in the future, even if it is to use backup and restore.

Yeah, I'm happy for backup and restore to be the switching plan. I'll test this out after this is merged since I'll want us to test using Rook Ceph on the Quansight Nebari deployment after this PR is merged.

  1. Let's spend some time thinking about how to use this same design pattern for nfs with dynamic provisioning, etc

Can you elaborate here? What else are you looking for?

Basically what I talked about in the meeting, removing dedicated PV's and instead creating an NFS storage class that dynamically provisions for each cloud provider. That way nfs wouuld be exandable for all providers in the future

I doubt it would be very hard to make that change, and I'd be happy to comment on an issue with more thoughts.

@Adam-D-Lewis
Copy link
Member Author

Thanks again for the review, @dcmcand. Is there anything else you want to see before this is merged?

Copy link
Contributor

@dcmcand dcmcand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@Adam-D-Lewis
Copy link
Member Author

The failing test in the Local Integration Tests has been seen in other PRs. @viniciusdc is working on a fix, and it is unrelated to this PR, so I will merge this PR.

@Adam-D-Lewis Adam-D-Lewis merged commit 3695eb4 into develop Aug 21, 2024
27 of 28 checks passed
@Adam-D-Lewis Adam-D-Lewis deleted the rook branch August 21, 2024 16:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
project: JATIC Work item needed for the JATIC project
Projects
Status: Done 💪🏾
Development

Successfully merging this pull request may close these issues.

Investigate Ceph for use in Nebari
5 participants