(Replying to PARENT post)

> Context-switching between the kernel and userspace is expensive in terms of CPU cycles. > OS threads have a large pre-allocated stack, which increases per-thread memory overhead.

It feels like it would be totally possibly to improve OSes to remove this limitation. Is anyone actually working on that?

For example I don't see why you couldn't have growable stacks for threads. Or have first class hardware support for context switching. (Yes that would take a long time to arrive.)

๐Ÿ‘คIshKebab๐Ÿ•‘2y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

Iโ€™m pretty sure that modern CPUs do generally have shadow registers to make context switching faster. However, you also have to consider cache contention, the fact it needs to access an indefinite amount of machine code during the context switch, it may service other processes during the context switch, and the effects all this will have on the instruction pipeline. Not to mention bugs in the CPU that might require it to flush the pipeline to maintain complete process separation.

So you can keep dedicating more and more silicon to redundant components to get closer to physically representing each thread, or you can code more efficiently.

Eg even if you did nothing but loop over a do-nothing system call, it would still need to have two separate executable pages in the cache instead of just one.

Not only that, but often the kernel is acting as a mediator for the hardware - which could mean synchronizing between cores, which brings its own obstacles (using slower shared cache, waiting for other cores, etc)

๐Ÿ‘คspease๐Ÿ•‘2y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

One problem is that when the stack is allocated it goes into the rest of the same virtual address space, and "growing" it might result in overlapping other address space allocated to the rest of the program. That's why there are guard pages around the stack (also, worth knowing that the stack doesn't grow up on some architectures, it can also grow down like on x86). That's why the post points out that growable stacks have to be moveable.

The main solution is to just reserve a ton of virtual address space but avoid committing to it until the process actually writes to it, which is exactly what OS threads do. They reserve a large amount of virtual address space to start but it's more or less free until a thread actually uses it. However you may not see it released back to the OS until the process exits.

๐Ÿ‘คduped๐Ÿ•‘2y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

> It feels like it would be totally possibly to improve OSes to remove this limitation. Is anyone actually working on that?

This is just my guess, but I assume they don't use growable stacks for the same reason Rust doesn't. C doesn't have a garbage collector, and making a fragmented stack would incur unacceptable performance penalties for many workloads. Getting around that would require a ton of work, but maybe with hardware support, like automatically following the stack pointer to bring it into cache like a normal stack could get around that.

๐Ÿ‘คzeroCalories๐Ÿ•‘2y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

I kinda recall some comments by pcwalton about APIs that would ameliorate context-switching costs at the kernel level:

https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

๐Ÿ‘คjrvidal๐Ÿ•‘2y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

io_uring is an interesting effort to drastically reduce the context switch overhead by not context switching.
๐Ÿ‘คsmw๐Ÿ•‘2y๐Ÿ”ผ0๐Ÿ—จ๏ธ0