(Replying to PARENT post)
(NOTE: I am speculating here, if they do have a staging system and this wasn't reproduced there then the last sentence doesn't apply.)
(Replying to PARENT post)
Some guesses would be:
Automation/orchestration - They've migrated to k8s (I don't believe they've actually done this yet), but it could be their orchestration / automation tool automated a broken thing everywhere.
Database/Auth - Pretty much everything in gitlab will touch the database as far as I'm aware. Otherwise, how do you check whether users are auth'd to take action something. You wouldn't expect this to break the static website, i.e. the sales landing pages, but these could be based off an internal CMS, or could be checking for "guest" role session.
DNS/Service Discovery - As a sibling posted, "it's always DNS". It's good practice to use names for services instead of IP addresses, but this means your DNS needs to generally work, or everything will go down. Service Discovery could rely on DNS, but it could also be an API call that finds out DNS addresses or IP addresses directly.
CDN - You wouldn't typically put this in front of auth'd usage, and typically a CDN might not be helpful in front of something like SSH, but a quick look at fastly suggests they might support this. The main downside is sharing all the user data / auth tokens.
Security Product / CA - All you need is a requirement to encrypt internal traffic and rotate secrets, and you end up with a secret store that sits in the middle of everything.
Storage Layer - I believe they were big on Ceph for a while. If everything is backed by Ceph, everything will go down if you fail with Ceph.
Obviously, whatever it is, you'd expect them to split up their fail over plan a bit more in the future if it is something like that, but usually there's a single point of failure somewhere.
(Replying to PARENT post)
How come all of them are down all at once.