Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate deployment for opam.ocaml.org from EC2 to Scaleway #19

Closed
6 tasks done
dra27 opened this issue Nov 25, 2022 · 34 comments
Closed
6 tasks done

Migrate deployment for opam.ocaml.org from EC2 to Scaleway #19

dra27 opened this issue Nov 25, 2022 · 34 comments
Assignees

Comments

@dra27
Copy link
Member

dra27 commented Nov 25, 2022

opam-2.ocaml.org and opam-3.ocaml.org are presently running on Amazon EC2 VMs which need to be decommissioned. opam-2 is also deploying manually an older version of the Docker deployment on /~https://github.com/ocaml-opam/opam2web.

Please could we have two Scaleway VMs, which will then be plumbed into https://deploy.ci.ocaml.org/?repo=ocaml-opam/opam2web and used to replace VMs behind opam-2 and opam-3.

  • Provision two Scaleway VMs with 2 vCPU, 4GiB memory and at least 100GiB storage (@avsm)
  • Deploy both live and live-staging branches to each of the new VMs from deploy.ci.ocaml.org (@mtelvers)
  • Copy SSL certs from running VMs to the new VMs to avoid a service gap (@avsm; @mtelvers)
  • Switch DNS for opam-2.ocaml.org and opam-3.ocaml.org over to the new machines (@avsm)
  • Deprovision EC2 VMs (@avsm)
  • DNS round-robin configuration for opam.ocaml.org and staging.opam.ocaml.org between opam-2 and opam3 (@avsm)

At present:

  • opam.ocaml.org is a CNAME for opam-2.ocaml.org (54.146.41.74)
  • staging.opam.ocaml.org is a CNAME for opam-3.ocaml.org (54.224.129.120)
@avsm
Copy link
Member

avsm commented Dec 9, 2022

opam-4.ocaml.org and opam-5.ocaml.org are both provisioned now @mtelvers, with IPv4 and IPv6 records and in two different availability zones (the Scaleway Amsterdam and Polish datacenters, both of which run with renewable energy). Your ssh keys are installed, so machines ready for provisioning. Note that /dev/sdb needs to be created as a data volume.

@mtelvers
Copy link
Collaborator

The two machines are now configured and respond via https://opam-4.ocaml.org and https://opam-5.ocaml.org. They also respond to the round-robin DNS entry https://opam.ocamllabs.io and https://staging.opam.ocamllabs.io.

The new deployment are currently running in parallel at https://deploy.ci.ocaml.org/?repo=ocaml-opam/opam2web&

I have documented the setup in some detail at http://infra.ocaml.org/opam-ocaml-org.

These are the DNS entries I have in the ocamllabs.io domain:

opam 300 IN A 151.115.76.159
opam 300 IN A 51.158.232.133
opam 300 IN AAAA 2001:bc8:1d80:4600::1
opam 300 IN AAAA 2001:bc8:5080:8e02::1
staging.opam 300 IN A 151.115.76.159
staging.opam 300 IN A 51.158.232.133
staging.opam 300 IN AAAA 2001:bc8:1d80:4600::1
staging.opam 300 IN AAAA 2001:bc8:5080:8e02::1

@avsm
Copy link
Member

avsm commented Dec 15, 2022

Outstanding writeup on the infra blog, thanks @mtelvers!

@avsm avsm assigned avsm and unassigned mtelvers Dec 15, 2022
@avsm
Copy link
Member

avsm commented Dec 15, 2022

Looking at the Dockerfile build:
https://deploy.ci.ocaml.org/job/2022-12-14/215510-ocluster-build-a7d05b

@mtelvers: I think that there might be too much caching there. For example, the ocaml/opam git clone is cached and therefore won't deterministically rebuild if there a push to that repo. The current deployment simply builds without caching to ensure all archives are fetched, and that would work here for now as well.

@kit-ty-kate @dra27 @rjbou: might you be able to test the opam-4.ocaml.org and see if it is suitable to switchover from the existing opam.ocaml.org archive?

@kit-ty-kate
Copy link
Member

@avsm lgtm. It's works just fine for me.

@dra27
Copy link
Member Author

dra27 commented Dec 16, 2022

Yup, looks very good to me, too, thank you!

Agree that we could temporarily turn off caching in Docker. As it happens, there's some work @rjbou was starting to do for the documentation part of the site, and that will be a perfect opportunity to plumb in the git sha of the ocaml/opam in the same way as platform-blog and opam-repository are at the moment.

@dra27
Copy link
Member Author

dra27 commented Dec 21, 2022

I'd misremembered: it's --pull that we can request in Docker build jobs, --no-cache isn't exposed. However, the main deployer is now hooked up to ocaml/opam as for ocaml/platform-blog and ocaml/opam-repository so it correctly rebuilds on documentation changes in opam (tested on live-staging).

I think this is ready for the certificate transfer?

@jpds
Copy link

jpds commented Dec 30, 2022

The new servers do not appear to respond to requests with TLSv1.3, I've tested this with:

$ curl -I -v --tlsv1.3 --tls-max 1.3 https://opam-5.ocaml.org

I think that the nginx configuration requires ssl_protocols TLSv1.2 TLSv1.3; for this to work.

Another small suggestion I would have for these boxes would be for them to use BBR as their TCP congestion algorithm:

By default, cubic is used which is a conservative algorithm sensitive to packet loss whereas BBR favours maximizing bandwidth (and ignores packet loss). This can be checked on the server doing:

$ ss --tcp -i

BBR can be set in /etc/sysctl.d/01-bbr.conf with:

net.core.default_qdisc=fq
net.ipv4.tcp_congestion_control=bbr

And then set with:

$ sudo sysctl -p /etc/sysctl.d/01-bbr.conf

The ss command above will then show new connections as having BBR in use (with stats).

@mtelvers
Copy link
Collaborator

Thanks for both of these suggestions. Both are now implemented on the new servers.

@jpds
Copy link

jpds commented Dec 30, 2022

Just spotted in the new curl output that the servers aren't doing HTTP2 - here's a guide on how to enable it on nginx: https://ubiq.co/tech-blog/how-to-enable-http2-in-nginx/

@mtelvers
Copy link
Collaborator

mtelvers commented Dec 30, 2022

HTTP2 is now enabled.

@avsm
Copy link
Member

avsm commented Jan 3, 2023

@mtelvers All looks good to me as well. We need to coordinate a certificate switch before doing the DNS swap, so they can listen on opam.ocaml.org as well. Remind me: which github repo is the nginx and other configurations checked into?

@mtelvers
Copy link
Collaborator

mtelvers commented Jan 5, 2023

@avsm the configuration is in tarides/infrastructure/ci.ocaml.org. We need to merge the live certificates for opam.ocaml.org into the Docker volume letsencrypt on both machines and then update the nginx/opam.conf to use the new file. Once that is done, we can check that it is working before changing the DNS by adding an entry to /etc/hosts on our local machines. We should also run /etc/cron.daily/letsencrypt-renew and check syslog for any errors.

@tmcgilchrist
Copy link
Collaborator

Is it possible to do this DNS switch with the current infrastructure setup? @avsm The discussion from #27 has been linked back here and looks to be developing into a larger piece of work than simply switching DNS between machines. Eventually opam2web and opam.ocaml.org will be merged into the main ocaml.org site, so we don't need a forever solution. Just something more maintainable than the current opam.ocaml.org setup.

@avsm
Copy link
Member

avsm commented Feb 28, 2023

This isn't blocked on #27 - I'll take a look at the letsencrypt switch that @mtelvers this week, hopefully.

@avsm
Copy link
Member

avsm commented Mar 5, 2023

@mtelvers I've copied the current LE certs over to opam-4 in /root/certs. That should be enough get you to bootstrap I hope. Let me know when you'd like to do the DNS switch next week. Some things to test:

  • the global nginx is necessary until we figure out Docker Swarm Services with IPv6 #30, but let's specifically test ipv6 access works.
  • what's doing host distro upgrades? opam-4 is already in the "reboot needed" phase of minor kernel upgrades. It should be ok to do regular upgrade/reboots now that the VMs are going round robin.

@mtelvers
Copy link
Collaborator

mtelvers commented Mar 6, 2023

@avsm Thank you for the certificate file. I have created the requisite entries for certbot to think it was always installed on these machines and take over the renewal.

@kit-ty-kate @dra27 Please can you do a final check? You can target a specific instance by adding one of the following lines to your /etc/hosts file and then running opam update or visiting the URL.

151.115.76.159 opam.ocaml.org
51.158.232.133 opam.ocaml.org
2001:bc8:1d80:4600::1 opam.ocaml.org
2001:bc8:5080:8e02::1 opam.ocaml.org

@kit-ty-kate
Copy link
Member

Works fine for me.

@dra27
Copy link
Member Author

dra27 commented Mar 6, 2023

LGTM too, thanks!

@mtelvers
Copy link
Collaborator

mtelvers commented Mar 7, 2023

@avsm I have additionally copied the certificate for staging.opam.ocaml.org from opam-3 over to opam-4 and opam-5. Therefore, we can stagger the switchover of the DNS entries by doing staging before live. These are the DNS entries which are needed.

opam 300 IN A 151.115.76.159
opam 300 IN A 51.158.232.133
opam 300 IN AAAA 2001:bc8:1d80:4600::1
opam 300 IN AAAA 2001:bc8:5080:8e02::1
staging.opam 300 IN A 151.115.76.159
staging.opam 300 IN A 51.158.232.133
staging.opam 300 IN AAAA 2001:bc8:1d80:4600::1
staging.opam 300 IN AAAA 2001:bc8:5080:8e02::1

@avsm
Copy link
Member

avsm commented Mar 7, 2023

Staging DNS records moved over!

@avsm
Copy link
Member

avsm commented Mar 7, 2023

I've posted a notice of the move to discuss.

@avsm
Copy link
Member

avsm commented Mar 7, 2023

@mtelvers I didn't see an answer to this:

what's doing host distro upgrades? opam-4 is already in the "reboot needed" phase of minor kernel upgrades. It should be ok to do regular upgrade/reboots now that the VMs are going round robin.

...in case it matters before we do the switch.

@mtelvers
Copy link
Collaborator

mtelvers commented Mar 7, 2023

@avsm Sorry, yes, I did these pending updates and will add these machines to the list of machines I update each month. I had thought further that we could use OCurrent to monitor a Git repo containing an Ansible script. OCurrent could run it periodically, and when we committed a change to the list of hosts.

@avsm
Copy link
Member

avsm commented Mar 7, 2023

Thanks, an Ansible ocurrent runner sounds good -- I've always found it strange that it's normal practise to run Ansible from our laptops, with all the massive key exposure that implies! (but this also seems to be common practise in the Ansible community)

@avsm
Copy link
Member

avsm commented Mar 9, 2023

Live DNS records for opam.ocaml.org also all moved over now.

@dra27
Copy link
Member Author

dra27 commented Mar 9, 2023

Thanks, @avsm! There are some users who have ended up embedding the link to opam-3 (e.g. /~https://github.com/coq-community/docker-base/blob/master/base/bare/Dockerfile#L42).

Before opam-3 is decommissioned, is it worth either CNAME-ing opam-3.ocaml.org to opam.ocaml.org (is that worth doing in general for a decommissioned server name here?) or putting a date on the shutdown of opam-3.ocaml.org?

avsm added a commit to avsm/docker-base that referenced this issue Mar 9, 2023
opam-3.ocaml.org was introduced as a temporary measure in
930e5b1 (coq-community#17), but that
server will decommissioned quite soon (ocaml/infrastructure#19)
so this will avoid build breakage for Coq.
@avsm
Copy link
Member

avsm commented Mar 9, 2023

Well spotted! The CNAME won't work due to the certificates not matching, and I'd really not have a lot of non-canonical names for the main archive that we have to maintain forever. I've opened a proposed fix for Coq in coq-community/docker-base#23

@tmcgilchrist
Copy link
Collaborator

Going through the old list of tasks here. There is one item we talked about, moving the https://hub.docker.com/r/ocurrent/opam.ocaml.org images into official ocaml ops account. If we did this a few other services on the ocaml.org deployer could move too.
Out of the things being deployed by the public ocaml.org deployer the two ocaml.org services, watch.ocaml.org, and opam2web aka opam.ocaml.org, might make sense to move all together. The other things like base-image-builder and docs-ci make less sense to change their published image location. What do people think?

@avsm
Copy link
Member

avsm commented Mar 13, 2023

Tim, opened up a new issue for that; just to have a hope of eventually closing this one ;-)

erikmd pushed a commit to coq-community/docker-base that referenced this issue Mar 14, 2023
opam-3.ocaml.org was introduced as a temporary measure in
930e5b1 (#17), but that
server will decommissioned quite soon (ocaml/infrastructure#19)
so this will avoid build breakage for Coq.
@avsm
Copy link
Member

avsm commented Mar 20, 2023

The Coq PR is now merged, so I'll decommission opam-3 next week after powering it down for a few days.

@avsm
Copy link
Member

avsm commented May 7, 2023

The opam-2 and opam-3 EC2 VMs are now shut down.

@mtelvers
Copy link
Collaborator

mtelvers commented May 7, 2023

I have merged the PR to remove opam-3 from ocurrent deployer.

@avsm
Copy link
Member

avsm commented May 16, 2023

All done here now, EC2 for OCaml is decommissioned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants