(Replying to PARENT post)

> Website, API, Git (ssh and https), Pages, Registry, CI/CD, Background Processing, Support Services, packages.gitlab.com, customers.gitlab.com, version.gitlab.com, forum.gitlab.com

How come all of them are down all at once.

👤kburman🕑6y🔼0🗨️0

(Replying to PARENT post)

The status page has updated to indicate that they misconfigured their firewall. Apparently their entire set of services go through a single firewall (or at least, multiple firewalls with the same config). It's worrying that they don't have a staging setup for these kinds of things.

(NOTE: I am speculating here, if they do have a staging system and this wasn't reproduced there then the last sentence doesn't apply.)

👤m712🕑6y🔼0🗨️0

(Replying to PARENT post)

Likely just means they have a Single Point of Failure.

Some guesses would be:

Automation/orchestration - They've migrated to k8s (I don't believe they've actually done this yet), but it could be their orchestration / automation tool automated a broken thing everywhere.

Database/Auth - Pretty much everything in gitlab will touch the database as far as I'm aware. Otherwise, how do you check whether users are auth'd to take action something. You wouldn't expect this to break the static website, i.e. the sales landing pages, but these could be based off an internal CMS, or could be checking for "guest" role session.

DNS/Service Discovery - As a sibling posted, "it's always DNS". It's good practice to use names for services instead of IP addresses, but this means your DNS needs to generally work, or everything will go down. Service Discovery could rely on DNS, but it could also be an API call that finds out DNS addresses or IP addresses directly.

CDN - You wouldn't typically put this in front of auth'd usage, and typically a CDN might not be helpful in front of something like SSH, but a quick look at fastly suggests they might support this. The main downside is sharing all the user data / auth tokens.

Security Product / CA - All you need is a requirement to encrypt internal traffic and rotate secrets, and you end up with a secret store that sits in the middle of everything.

Storage Layer - I believe they were big on Ceph for a while. If everything is backed by Ceph, everything will go down if you fail with Ceph.

Obviously, whatever it is, you'd expect them to split up their fail over plan a bit more in the future if it is something like that, but usually there's a single point of failure somewhere.

👤ownagefool🕑6y🔼0🗨️0

(Replying to PARENT post)

You can look through what happened here: https://gitlab.com/gitlab-com/gl-infra/production/issues/142...

👤ktsmith🕑6y🔼0🗨️0

(Replying to PARENT post)

presumably a single point of failure - my guess would be something at the network level.

👤beaconstudios🕑6y🔼0🗨️0

(Replying to PARENT post)

At the bottom of the page they list availability of third party services used - Fastly has a warning symbol, and I imagine they put that CDN in front of everything.

👤RossM🕑6y🔼0🗨️0

(Replying to PARENT post)

Latest tweet by @gitlabstatus said "We've identified an issue with database connectivity", which could explain why so many services are impacted.

👤k_🕑6y🔼0🗨️0

(Replying to PARENT post)

Bad firewall change...they just updated the page.

👤tyingq🕑6y🔼0🗨️0