Aaah, the classing "you're doing <it> wrong" argument. I can come up with dozens of different environments where it is simply not feasable to rebuild an environment within two hours.
- Any infrastructure with lots of data. Data just takes time to move; backups take time to restore.
- You're on bare metal because running node on VMs isn't fast enough.
- You're in a secure environment, where the plain old bureaucracy will get in the way of a full rebuild.
- Anytime you have to change DNS. That's going to take days to get everything failed over.
- Clients (or vendors) whitelist IPs, and you have to work through with them to fix the IPs.
- Amazon gives you the dreaded "we don't have capacity to start your requested instance; give us a few hours to spin up more capacity"
> Imagine an outage in one of the 3 datacenters you are running your infra in the same region. You need to move 1/3 of the capacity to the remaining 2 datacenters.
Oh, this is very different. If your provider loses a datacenter, and your existing infrastructure can't handle it, you're already SOL - the APIs for spinning up instances and networking is going to be DDOSed to death by all of the various users.
Basic HA dictates that you provision enough spare capacity that a DC (AZ) can go down and you can still serve all of your customers.
I mostly disagree with your points, with the exception of the last one.
I used to work in the team that runs Amazon.com. All of the systems serving the site can be re-built within hours and nothing can serve the site that cannot be rebuilt within a very thing SLA. However, I understand that not all the companies have this requirement. This feature is only relevant when a site downtime hurting the company too much, so it could not be allowed.
Reflecting to your points:
- Lots of data -> use S3 with de-normalized data, or something similar
- Running a VM has 3% overhead in 2016, scalability is much more important than a single node performance
- High security environments are usually payment processing systems, downtime there can be a bit more tolerated, delaying transactions is ok
- Amazon uses DNS for everything, even for datacenter moves. It is usually done within 5 minutes
- This is a networking challenge, using something like EIP (where the public facing IP can be attached to different nodes) makes this a non-issue
- Amazon has an SLA, they extremely rarely have a full region outage, so you can juggle capacity around
Losing a dc out of 3 does not require work because you can't handle the load, it is required to have the same properties (same extra capacity for example) just like before. Spinning up instances should not DDOS anything, it is with constant load on the supporting infrastructure.
First, two important assumption I'm making when I say this (and I feel they are reasonable assumptions). I'm not just talking about bringing a production environment back up in the same or adjacent AZ; I'm talking about true DR, where you're moving regions. I'm also not limiting my discussion to AWS' infrastructure - not with Google, Rackspace, Cloudflare and others in the space as well.
> Lots of data -> use S3 with de-normalized data, or something similar
S3's use case does not match up with many different computing models (hadoop clusters, database tables, state overflowing memory), and moving data within S3 between regions is painful. Also, not all cloud providers have S3.
> Running a VM has 3% overhead in 2016, scalability is much more important than a single node performance
Not when you have a requirement to respond to _all_ requests in under 50ms (such as with an ad broker).
> High security environments are usually payment processing systems
Or HIPPA, or government.
> delaying transactions is ok
Not really. When I worked for Amazon, they were still valuing one second of downtime at around $13k in lost sales. I can't imagine this has gone down.
> Amazon uses DNS for everything, even for datacenter moves. It is usually done within 5 minutes
Amazon also implements their own DNS servers, with some dynamic lookup logic; they are an outlier. Fighting against TTL across the world is a real problem for DR type scenarios.
> EIP (where the public facing IP can be attached to different nodes) makes this a non-issue
EIPs are not only AWS specific, but they can not traverse across regions, and rely on AWS' api being up. This is not historically always the case.
> they extremely rarely have a full region outage, so you can juggle capacity around
Not always. Sometimes, you can. But not always. Some good examples from the past - anytime EBS had issues in us-east-1, the AWS API would be unavailable. When an AZ in us-east-1 went down, the API was overwhelmed and unresponsive for hours afterwards.
> Spinning up instances should not DDOS anything, it is with constant load on the supporting infrastructure.
See above. There's nothing constant about the load when there is an AWS outage; everyone is scrambling to use the APIs to get their sites backup. There's even advice to not depend on ASGs for DR, for the very same reason.
AWS is constantly getting better about this, but they are not the only VPS provider, nor are they themselves immune to outages and downtime which requires DR plans.
OP's first point is 'don't put data in docker'. Docker is not for your data. But more to the point, if you're rebuilding your data store a couple of times every day, a couple of hours downtime isn't going to be feasible.
> You're on bare metal because running node on VMs isn't fast enough
In such a situation, you should be able to image bare metal faster than 2 hours. DD a base image, run a config manager over it, and you should be done. Small shops that rarely bring up new infra wouldn't need this, but anyone running 'bare metal to scale' should.
> bureaucracy
Isn't part of the infra rebuild per se.
> Anytime you have to change DNS. That's going to take days
Depends on your DNS timeouts, but this is config, not infra. Even if it is infra, 48-hour DNS entries aren't a best-practice anymore (and if you're on AWS, most things default to a 5 min timeout)
> Clients (or vendors) whitelist IPs, and you have to work through with them to fix the IPs
I'd file this under 'bureaucracy' - it's part of your config, not part of your prod infra (which the GP was talking about).
> Amazon gives you the dreaded...
Well, yes, but this is on the same order as "what if there's a power outage at the datacentre". Every single deploy plan out there has an unknown-length outage if the 'upstream' dependencies aren't working. "What if there's a hostage event at our NOC?" blah blah.
The point is that with upstream working as normal, you should be able to cover the common SPOFs and get your prod components up in a relatively short time.
> OP's first point is 'don't put data in docker'. Docker is not for your data.
I agree, but I (and the GP, from my reading) was not speaking about only Docker infrastructure.
> Isn't part of the infra rebuild per se.
I can see your point, and perhaps these points don't belong in a discussion purely about rebuilding instances discussion. That said, I have a very hard time focusing just on the time it takes to rebuilding capacity when discussing a DC going down; there's just too many other considerations that someone in Operations must consider.
When I have my operations hat on, I consider a DC going down to be a disaster. Even if the company has followed my advice and the customers do not notice anything, we're now at a point where any other single failure will take the site down. It's imperative to get everything taken down with that DC back up; and it's going to take more than an hour or two.
- Any infrastructure with lots of data. Data just takes time to move; backups take time to restore.
- You're on bare metal because running node on VMs isn't fast enough.
- You're in a secure environment, where the plain old bureaucracy will get in the way of a full rebuild.
- Anytime you have to change DNS. That's going to take days to get everything failed over.
- Clients (or vendors) whitelist IPs, and you have to work through with them to fix the IPs.
- Amazon gives you the dreaded "we don't have capacity to start your requested instance; give us a few hours to spin up more capacity"
> Imagine an outage in one of the 3 datacenters you are running your infra in the same region. You need to move 1/3 of the capacity to the remaining 2 datacenters.
Oh, this is very different. If your provider loses a datacenter, and your existing infrastructure can't handle it, you're already SOL - the APIs for spinning up instances and networking is going to be DDOSed to death by all of the various users.
Basic HA dictates that you provision enough spare capacity that a DC (AZ) can go down and you can still serve all of your customers.