SSL certificate management: from chaos on thousands of servers to a centralized

Started by Kevin56, Aug 01, 2022, 03:37 AM

Previous topic - Next topic

Kevin56Topic starter

What can be behind the words "the largest online school in Europe"? On the one hand, this is 2,000 lessons per hour, 15,000 teachers, 120,000 students. And for me, an infrastructure engineer, it is also 200+ servers, hundreds of services (micro and not so), domain names from the 2nd to the 6th level. Everywhere you need SSL and, accordingly, a certificate for it.





For the most part, we use Let's Encrypt certificates. Their advantages are that they are free and the receipt is fully automated. On the other hand, they have a peculiarity: a short - only three months - validity period. As a result, they need to be updated frequently. We tried to automate it somehow, but there was still a lot of manual work, and something constantly broke. A 2 years ago, we came up with a simple and reliable method for updating this pile of certificates and have since forgotten about such a problem.

From one certificate on one server to hundreds in several data centers

Once upon a time there was only one server. And certbot lived on it, which worked from under the crown. Then one server stopped coping with the load, so another server appeared. And then more and more. Each of them had its own certificates with its own unique set of names, and everywhere it was necessary to configure their renewal. Somewhere, when expanding, they copied existing certificates, but forgot about updating.

To obtain a Let's Encrypt certificate, you should verify ownership of the domain name specified in the certificate. This is usually done with a reverse HTTP request.


Here are a couple of common challenges we faced as we grew:

    Not all new servers were available from the outside: some were removed behind the incoming traffic balancer and are no longer accessible from the Internet. They had to copy the certificates manually.
    There were also servers without HTTP at all. Let's say mail. Or databases. Or with some LDAP. Or something else weird. There also had to copy the certificates manually.


Some places have been getting self-signed certificates for quite some time, and this seemed like a good solution in locations where authentication is not needed - for example, for internal testing. So that the browser does not constantly report a "suspicious site", just add our root certificate to the list of trusted ones, and it's done. But later, difficulties arose here.


Finding a Solution

Surely, first of all, you need to do monitoring in order to find out about ending certificates not when they have already ended, but a little earlier. OK then. There is monitoring, we now know that certificates will soon run out here and there. And now what i can do?


How about using wildcard certificates? Let's! Let's Encrypt already gives them out. True, you will have to set up confirmation of domain ownership through DNS. And DNS lives in AWS Route53. And you have to expand the access details in AWS across all servers. And when new servers appear, copy all this economy there too.

Okay, 3rd level names are wildcarded. And what to do with the names of the 4th level and above? We have many teams that develop various services. Now it is customary to divide the frontend and backend. And if the frontend gets a 3rd-level name like service.skyeng.fr, then the backend is given the name api.service.skyeng.fr. Hmm, possibly they shouldn't do that again?
Great idea! And what to do with dozens of existing ones? Maybe with an iron fist to drive them all into one domain name? Let's replace all these names of different levels with URLs like skyeng.fr/service. Technically this is an option, but how long will it take? And how can businesses justify the need for such actions? We have 30+ development teams, go and persuade everyone - it will take at least six months. We also create a single point of failure. Whatever one may say, this is a controversial decision.

What other ideas do you have? .. possibly make one certificate, where we include everything-everything-everything? And we will install it on all servers. This could be a solution to our problems, but Let's Encrypt only allows you to have 100 names in the certificate, and we already have more different microservices.

What to do with testers? So they didn't come up with anything, but they constantly complain. Everything is bullshit except for the bees. Bees are also bullshit, but there are a lot of them. Each programmer or tester is given a test server - we call them testing. Testers are not bees, but there are already well over a hundred of them. And all projects are deployed to each. In general, everything. And if you need N certificates for selling, then there are the same number for each testing. So far, they are self-signed. It would be great to replace them with real ones ...


Two playbooks and one source of truth

Swan, crayfish and pike will not bring the cart anywhere. We need a single server flight control center. In our case, this is Ansible. Certbot on every server is evil. Let all certificates be stored in one place. If somewhere someone needs a certificate, then come to this place and take the latest version from the shelf. And we will make sure that the certificates in this store are always up-to-date.

Access credentials in AWS are also present only in this one place. Accordingly, we no longer have questions like setting up AWS CLI on a new server, who has access to Route53, and the like.

All required certificates are described in one file in Ansible in YAML format:

    certificates:
      - common_name: skyeng.fr
        alt_names:
          - *.skyeng.fr
      - common_name: olympiad.skyeng.fr
        alt_names:
          - *.olympiad.skyeng.fr
          - api.content.olympiad.skyeng.fr
          - games.skyeng.fr
      - common_name: skyeng.tech
        alt_names:
          - *.skyeng.tech

      . . .


Regularly, one playbook is launched, which goes through this list and does its hard work - in essence, everything that certbot does:

    creates an account with Let's Encrypt Certificate Authority
    generates a private key
    generates a (not yet signed) certificate - the so-called certificate signing request
    sends a signing request
    receives a DNS challenge
    puts received records in DNS
    sends a signing request again
    and, having finally received the signed certificate, places it in the store.


The playbook runs once a day. If he was unable to renew any certificates for any reason - be it network problems or some kind of error on the Let's Encrypt side - this is not a problem. Will update next time.

Now, when some service host needs SSL, you can go to this repository and take a few files from there - the simplest operation that the second playbook performs ... What certificates are needed on this host are described in the parameters of this host, in inventories/host_vars/server .yml:

    certificates:
      - common_name: skyeng.fr
        handler: reload nginx
      - common_name: crm.skyeng.fr

      . . .


If the files have changed, then Ansible pulls the hook - typically this is to reload Nginx (in our case, this is the default action). And in the same way, you can receive certificates from other CAs that use the ACME protocol.


Total

    We had many different configurations. Something was constantly breaking. Often I had to climb the servers and figure out what had fallen off there again.
    Now we have two playbooks and everything is recorded in one place. Everything works like clockwork. Life has become more boring.


Testing

Yes, but what about the testers with their testing? Each programmer or tester is given a personal test server - testing. There are currently about 200 of them. They have names like test-y123.skyeng.link, where 123 is the test number. Creation and deletion of testing is automated. One of the components of the action is to install an SSL certificate on it. The SSL certificate is pre-generated, with wildcard names:

    ssl_cert_pattern:
      - *
      - *.auth
      -*.bill

      . . .


There are about 30 names in total. So the finished certificate includes the names

    test-y123.skyeng.link
    *.test-y123.skyeng.link
    *.auth.test-y123.skyeng.link
    *.bill.test-y123.skyeng.link


and so on.


After the dismissal of a developer or tester, his testing is deleted. The certificate remains ready for using. All this is stored you know where and decomposed into hosts you know how.
  •  

berto

This is all cool, but why not lay out your role/playbook and show not parts of the code, but a complete one and how it works?
I think not only I will be interested to see it ..

In my opinion, getting LE certificates in a web server is a very dubious decision in terms of architectural elegance.
  •  

arthyk

Yes, the scheme is not the simplest, but the weakest link: programmers and testers. Although the creation of specialized test servers seems like a natural solution to the problem. In general, it seems that the whole development, i.e. the main activity of the company "dances" around certification. Three months, of course, is a very short time to regularly carry out this painstaking work manually. Well, since a solution has been found, maybe someone else will adopt this method. :D
  •