Building a high-scale chat server on Cloud Run

👤alpb🕑4y🔼148🗨️111

(Replying to PARENT post)

I think some people are missing the point when comparing this to a traditional VM setup. Yes it is way more expensive, but it lets you deploy something that works in 10 minutes vs messing with VMs and auto-scaling groups and all that jazz.

If you are a GCP or AWS or bare metal expert that can set this thing up in their sleep, that's great but the majority of people can really benefit from a PaaS like GCR.

Because Cloud Run uses vanilla Docker containers, once you have validated the idea you can move to GKE or VMs or a server under your desk or whatever. And if it never takes off, that's fine too because you didn't spend a ton of time investing in making it work.

Ahmet if you are reading this (hi) it would be REALLY cool to see something like using GKE for the base load and dynamically bursting to GCR to fill in the gaps. Not sure if this possible with GCLB today, but would be super cool.

(Disclaimer: I used to work on this team at Google.)

👤thesandlord🕑4y🔼0🗨️0

(Replying to PARENT post)

One thing I haven't seen mentioned in the comments is how easy it is to quickly iterate on Cloud Run. I built a web app that uses Cloud Run for 90% of my server needs (Cloud Functions for the rest). One of my favorite features is that I can make some local changes, upload the container, and click a button to deploy the latest version or A/B test. Then if I notice anything weird I can just as easily revert the changes, which has saved me a few times. Admittedly, I haven't used VMs in years and it might be just as easy there these days with containerized VMs, but I don't know. Cloud Run was so trivial to set up that I didn't want to waste time on alternatives. Another really nice thing is that it integrates nicely with Firebase, which I use for hosting.

As my site matures, I might move some predictable parts to containerized VMs to save on costs. I have a crawler-type service that has extremely consistent traffic that probably shouldn't be on Cloud Run, but I did it anyway due to being able to iterate so quickly (see revisions here: https://imgur.com/fkFtIRM). Cloud Run also has a free tier which was enough when I was starting out and prototyping. It's also nice since my site gets very little traffic at night so at the moment the cost is not too bad. This is what my billing looks like: https://imgur.com/gIo3IGJ

If I was a big company or my site got super popular I might do things differently. But for a side project it has made me enjoy programming more than anything I've used before because I can focus 95% on code.

👤dealforager🕑4y🔼0🗨️0

(Replying to PARENT post)

Cloud Run has been fantastic.

I had a worker service running on Heroku. Very CPU intensive. The traffic pattern was extremely low throughout the day, but had completely unexpected surges.

On Heroku, my choices were: Paying $3k (basically paying for peak surge throughout the month) or having a lot of slow/failed responses during the surge.

Moved to Cloud Run very easily. Just a normal dockerized 12 factor app.

Now I pay ~$50/m and it automatically scales up when I need more workers.

If you want a managed app platform, it couldn't be more simple and cheap.

The scaling model of Cloud Run has been great. Now that they support websockets, I will be moving all my apps slowly to it.

My only gripe: I wish they had something for worker processes. I know that the rest of Google Cloud has solutions for it but being able to just spawn worker processes as part of the same deployment would be fantastic.

👤emilsedgh🕑4y🔼0🗨️0

(Replying to PARENT post)

Oh dear. “$87/hour, which is $62.6K/month” for only 250k concurrent connections?

Around €1500 will get you a second hand Dell R720 with 190+ GB RAM and 48 cores.

It would be cheaper to handle such scaling with a vm instead of functions. Any kernel with a bit of tweaking will easily handle that number of connections. One just needs RAM and sufficient CPU. 16GB RAM and 8 vCPUs on aws will be enough for 500k.

Sleeping would be very uncomfortable if my service using such approach started being popular and was running at increasing capacity for a week because my team needs time to move back to vms.

👤rad_gruchalski🕑4y🔼0🗨️0

(Replying to PARENT post)

I use Cloud Run for all my server rendered web apps and I _love_ it. That’s where the pricing model makes sense since as soon as the request is served I stop paying. The actual cash price is less than what I was paying to run on GKE and the total cost of ownership is far lower.

Web sockets are a handy thing to have access to but the app described is a pretty pathological case for the Cloud Run pricing model. I think people knocking the article on pricing alone are missing the point.

👤davidbanham🕑4y🔼0🗨️0

(Replying to PARENT post)

Granted, the pricing assumptions assume 100% utilization at all times and not taking advantage of scaling-to-zero, which is where Cloud Run excels.

Given that, it seems like WebSockets/persistent connections is a weird use case.

👤minimaxir🕑4y🔼0🗨️0

(Replying to PARENT post)

WhatsApp had 2 Million simultaneous chat TCP connections per server back in 2014...

Why do we now need thousands of instances to achieve the same?

👤londons_explore🕑4y🔼0🗨️0

(Replying to PARENT post)

compare that to 400k connections on a single server using erlang early this Millenium. Wirth's law in action.

https://en.wikipedia.org/wiki/Wirth%27s_law

👤toolslive🕑4y🔼0🗨️0

(Replying to PARENT post)

This is a really neat demo, but there's a few limitations that are glaring issues for a chat server or most scenarios I can think of that I would want to use websockets. As many have mentioned, the 250 connection limit per instance is crazy. You can get 200x that many easily on a single 1 VCPU box.

The second problem is that really the demo has just kicked your scaling problem over to redis. The demo isn't doing anything actually interesting, redis is doing all the work. The reality is that one of the best parts of Cloud Run over something like serverless functions is that you can have state in your server. You don't need redis at all if you are doing the same demo in Kubernetes without Cloud Run, since it's pretty easy to use channels in golang to do the same thing. However, in order for you to really make that work, you need some way to at the very least route the traffic consistently so that people watching the same resource get routed to the same server, or to be able to have cross node communication.

So this is a great start, but a little work to go before many people would be able to switch over to something like this due.

👤_jezell_🕑4y🔼0🗨️0

(Replying to PARENT post)

It's a fun example in how easy it is to setup, and I don't want to poke too hard on a hypotheticals, but I don't like two things:

1) Egress is hand-waved, when in reality the free GBs would basically exhaust in a month on just pongs and reconnection requests, so doing nothing at all. That amount of people just sending single emojis as messages for just one day (to play along "short marketing event") is already matching monthly napkin estimate stated.

2) Isn't there's a good chance Memorystore wouldn't be able to handle this at fairly charitable load interpretations under that many CCUs? 1‰ of those 250k CCUs sending a message every second means Redis needs to publish 250k messages as well.

Again though, I'm aware it's only a hypothetical scenario demonstration, and it really does look cute in how it's simple to deploy. Fun stuff.

👤synthmeat🕑4y🔼0🗨️0

(Replying to PARENT post)

> If you deploy this app with 128MB RAM and 1 vCPU today, it will cost (0.00002400 + (0.00000250/8)) * 60 * 60 = $0.0875 per hour per instance. 1 This means if you have 1,000 instances actively running and serving 250K clients, it will cost $87/hour, which is $62.6K/month.

> Any Cloud Run service, by default, can scale up to 1,000 instances. (However, by opening a support ticket, you can get this number elevated.) This means we can support 250,000 clients simultaneously without having to worry about infrastructure and scaling!

Aren't these statements at odds with one another?

👤syspec🕑4y🔼0🗨️0

(Replying to PARENT post)

I don't really get what's so special about this. But I am always surprised by how extremely scalable Redis is, scaling the stateless chat servers is trivial, scaling state is hard.

👤valzam🕑4y🔼0🗨️0

(Replying to PARENT post)

I might be missing the point of this article? Is it very difficult when using Cloud Run to figure out how to achieve a high number of concurrent connections? Is that what the article is about? Or like what is otherwise noteworthy here?

👤didibus🕑4y🔼0🗨️0

(Replying to PARENT post)

Use a dedicated server and IRC. 250k users isn't that scary anymore.

I remember that in 2012, the Rizon IRC network maxed out at 80,000 concurrent users on an AMD Bulldozer CPU.

👤fxtentacle🕑4y🔼0🗨️0

(Replying to PARENT post)

I'm trying to figure out a good architecture for a chat back end so I found this interesting. Anyone have some good pointers to other such articles?

👤intrasight🕑4y🔼0🗨️0

(Replying to PARENT post)

Why you wouldn't use AWS http events for this. Especially because connections are decoupled from processing

👤cunac🕑4y🔼0🗨️0

(Replying to PARENT post)

A little off-topic, but AWS has a service (actually 2) specifically for managing websocket communications.

API Gateway supports an unlimited (you need to ask for an increase from the 500 new connections per second rate) number of connections and is only around $0.25 per million minutes + $1 per million messages.

So just having 250,000 connections open would cost only around $270 per month (and scales up and down as you please).

👤kondro🕑4y🔼0🗨️0

(Replying to PARENT post)

I think Ahmet's arithmetic is incorrect. The limits that he says lead to the 250k figure are:

1. 250 connections per container, and

2. 1,000 containers.

However the 250 concurrency limit does not refer to connections, it refers to requests. 250 concurrent requests can actually represent thousands of clients, depending on their think time.

👤jacques_chester🕑4y🔼0🗨️0