SLA will save you

Started by Charlesth, Aug 30, 2022, 06:32 AM

Previous topic - Next topic

CharlesthTopic starter

SLA, it is also a "service-level agreement" - an agreement-guarantee between the customer and the service provider about what the client will receive in terms of service. It also stipulates compensation in case of downtime due to the fault of the supplier and so on.
In essence, SLA is a credential with which a data center or hosting provider convinces a potential client that he will be treated kindly in every potential
 way and in general. The question is that you can write anything in the SLA, and the events prescribed in this document do not occur too often. SLA is far from a guideline in the selection of a data center, and you certainly shouldn't rely on it.

We are all used to signing some kind of contracts that impose certain obligations. The SLA is no exception, usually the most out of touch document imaginable. More useless, perhaps, only NDA in jurisdictions where the concept of "trade secrets" does not really exist. And the whole problem is that SLA does not help the client in any way in choosing the right supplier, but only throws dust in the eyes.

What do hosting providers most often write in the public version of SLA, which is shown to the public? Well, the first line is such a term as the "reliability" of the host - these are usually numbers from 98 to 99.999%. In fact, these figures are just a beautiful invention of marketers. Once upon a time, when hosting was young and expensive, and specialists only dreamed of clouds (as well as broadband access for everyone), the hosting server uptime indicator was extremely, extremely important. Now, when all providers use plus or minus the same equipment, sit on the same backbone networks and offer the same service packages, the uptime indicator is absolutely indicative.

Is there a "correct" SLA at all?

Of course, there are also ideal versions of SLA, but all of them are non-standard documents and are written and concluded between the client and the supplier manually. At the same time, it is this type of SLA that most often concerns some kind of contract work, rather than services.

What should be in a good SLA? If you give TLDR, then a good SLA is a document that regulates the relationship between two entities, which gives one of the parties (the customer) maximum control over the process.
That is, how it works in the real world: there is a document that describes global interaction processes and regulates the relationship between the parties. It sets boundaries, rules, and in itself becomes a leverage that both parties can use to the fullest. So, thanks to the correct SLA, the customer can simply force the contractor to work as agreed, and the contractor helps to fight off the "wants" of an overly active client that are unreasonable by the contract. It looks like this: "In our SLA it is written this way and that, go from here, we do everything as agreed."

That is, "correct SLA" = "adequate contract for the provision of services" and gives control over the situation. And this is potential only when working "on an equal footing".

What they write on the site and what awaits in reality are two different things.
generally, everything that we will discuss further is typical marketing tricks and checking for attentiveness.

 Please remember the moment about data centers, we will return to it a little later. In the meantime, let's talk about the ideal statistics of fault tolerance and what a person faces when his server still falls into "0.0000001% of falls."

With indicators from 98% and above, any fall is an event on the verge of a statistical error. The working equipment and connection either is, or they are not. You can use a hosting provider with a "reliability" index of 50% (according to its own SLA) for years without a single problem, or "fall" once a month for a couple of days with the guys, where 99.99% is declared.

When the moment of falling nevertheless comes (and we remind you that someday everyone falls), then the client is faced with an internal corporate machine called "support", and a contract for the provision of services and SLA is brought to light. What does it mean:

    most likely, for the first four hours of downtime, you won't be able to present anything at all, although some hosting providers start recalculating the tariff (payment of compensation) from the moment of the fall.
    If the server is unavailable for more time, you may be able to request a rate recalculation.
    And this is provided that the problem arose through the fault of the supplier.
    If your problem arose due to a third party (on the highway), then it's like "no one is to blame" and when the problem is solved is a matter of your luck.

That being said, it is very important to understand that you never get access to the engineering team, most often you are stopped by the first line of support, who is in correspondence with you while real engineers are trying to fix the situation. Familiar scenario?

Here, many rely on the SLA, which, it seems, should protect you from such situations. But, in fact, companies rarely go beyond the boundaries of their own document or are able to turn the situation in such a way as to minimize their own costs.
The primary task of the SLA is to lull the vigilance and convince that even in the event of an unforeseen situation, "everything will be fine." The second task of the SLA is to spell out the main critical points and give the service provider room to maneuver, that is, the ability to attribute the failure to something for which the provider is "not responsible".

At the same time, large clients, in fact, do not care at all about compensation within the SLA. "SLA compensation" is a refund within the tariff in proportion to equipment downtime, which will never cover even 1% of potential monetary and reputational losses. In this case, it is much more important for the client that the problems are fixed as soon as possible, rather than some kind of "tariff recalculation".

"Many data centers around the world" is a cause for concern

We put the situation with a large number of data centers at the service provider into a separate category, because in addition to the obvious above-described problems with communication, non-obvious problems also pop up. For instance, your service provider does not have access to "their" data centers.

In our last article, we wrote about the types of affiliate programs and mentioned the White Label model, the essence of which is the resale of other people's capacities under our own sign. The vast majority of modern hosting providers that claim to have "their own data centers" in many regions are resellers according to the White Label model. That is, physically they have nothing to do with a conditional data center in Switzerland, Germany or the Netherlands.

There are some very interesting conflicts here. Your SLA with the service provider is still working and is valid, but the provider is not able to somehow drastically influence the situation in the event of an accident. He himself is in a dependent position on his own supplier - the data center, from which the power racks were bought for resale.

Thus, if not only beautiful wording in the contract and SLA about reliability and service is important to you, but also the ability of the service provider to quickly solve problems, you should work directly with the owner of the facilities. In fact, this implies direct interaction directly with the data center.

Why are we not considering options when many DCs can actually belong to one company? But when the cases described in the SLA occur, they become exactly the same hostages of the situation as you are.

This reminds us once again that SLAs are useless if you don't understand the structure of the organization and the capacity of the supplier.

What is the result

Server crashes are always an unpleasant event and can happen to anyone, anywhere. The question is how much control over the situation you want. Now there are not too many direct capacity providers on the market, and if we talk about large players, then they own, conditionally, only one DC somewhere in Moscow out of a dozen across Europe that you can access.

First of all, it is necessary to identify whether the service seller is the direct owner of the facilities / data center. A lot of White Label resellers do their best to disguise their status, and in this case, you need to look for some indirect signs. For instance, if "their European DCs" have some specific names and logos that differ from the name of the supplier company. Or if the word "partners" flashes somewhere. Partners = White Label in 95% of cases.

Next, you need to get acquainted with the very structure of the company, and it is better to look at the equipment live. Among DCs, the practice of excursions or at least excursion articles on their own website or blog (we wrote such articles once or twice) is not new, where they talk about their data center with photos and detailed descriptions.

With many data centers, you can arrange a personal visit to the office and mini-excursions to the DC itself. There you can assess the degree of order, perhaps you will be able to communicate with one of the engineers. It is clear that no one will give you an excursion to production if you need one server for 300 RUB / month, but if you need serious capacity, then the sales department may well meet you. We, for instance, conduct this excursions.

In any case, you should be guided by common sense and the needs of the business. For instance, if you need a distributed infrastructure, it will be easier and more profitable to use the services of hosting providers that have partnerships with European DCs according to the White Label model. If your entire infrastructure is concentrated at one point, that is, in one data center, then you should spend some time looking for a supplier.

Because a typical SLA will most likely not help you. But working with the owner of the facilities, and not the reseller, will significantly speed up the solution of potential problems.


SLA is always a lot of fun, especially when a critical screw-up happens and you need to execute this same SLA.
One large company has a figure of 3 hours for critical incidents and I was very interested in how they provide support for a multilayer pie of hardware, OS, application software and their improvements,
it turned out that you had to read the fine print, this SLA covers the support response time (even if it is "your call is very important for us, we are working on your problem"), and no one limits the time for resolving the problem, if you dig deeper, the solution to the problem is also not guaranteed (if you get into a dozen clients who have a specific bug out of hundreds of thousands around the world, no one has canceled the 85/15 rule)
PS once, in the process of communicating with support on a critical issue (SLA 3 hours), the engineer disappeared for 5 days - and here everything is on fire, everyone is on fire and the level of chaos is getting higher, and then, as if nothing had happened, appeared and said that he was leaving for a religious holiday and forgot to warn. After the hysteria "what the hell was that?", the engineer was fired and sent to continue praying to the holy cow, the internal regulations were corrected and now it's better, but I completely stopped believing in SLA.


Rely on a hosting provider, but don't make a mistake yourself.

If the site uptime is critical, and the site is highly loaded, then we place two or three (or more, as much as your imagination and budget allow) physical or virtual servers in different data centers (different data center owners, different uplinks, different jurisdictions). On each server, we install Linux, Nginx and BIND, and if necessary, something else (for instance, a mail server; in that case, we indicate a different priority in MX records); configure nftables, open the necessary ports (53, 80, 443, others if necessary).
We make all name servers authoritative for the domain name, when changes are made, they are made on all name servers at once; for A and AAAA records, we specify a very low TTL (say, one minute). If one server becomes unavailable, we delete the A and AAAA records pointing to it (we do not touch the NS records and the A and AAAA records pointing to the name servers); when it becomes available again, we return the A and AAAA records in place.

If uptime is important, but there is no point in mirrors (or the budget does not allow), then we find a VPS hosting provider that allows you to keep a snapshot of a virtual server without the virtual server itself and pay only for the space occupied by the snapshot (I know one such hosting provider), create it there VPS, install Linux and Nginx (no need for BIND), place a temporary mirror, stop the VPS, take a snapshot, delete the VPS. We take cheap VPS in two or three places, install Linux and BIND, set up name servers for the domain name (always with a low TTL), open port 53, make these name servers authoritative. If the main server goes down, we deploy a temporary mirror from the snapshot and change the A and AAAA records; when the main server is available again, we change the A and AAAA records back and after a while we delete the temporary mirror.

Yes, you can still use a CDN (for instance, Cloudflare or Imperva). In that case, you will not need to specify a low TTL and administer your own name servers, but the CDN itself can become a point of failure.


So white was turned into black. The recommendation to all PM and CTO is never to sign contracts without SLA, and when signing, look at what kind of SLA it is, if 99.9%, then for what period, since a possible downtime of 8 hours a year or 15 minutes a day is usually a big difference for the end user.
Yes, SLA is about a contract, not about technology, good companies have an actual SLA much higher than stated, but if there is no SLA at all or it is too soft, it means the provider is not sure of the stability of its services and tries to remove responsibility in advance, a signal to be wary.