(Replying to PARENT post)
So you can keep dedicating more and more silicon to redundant components to get closer to physically representing each thread, or you can code more efficiently.
Eg even if you did nothing but loop over a do-nothing system call, it would still need to have two separate executable pages in the cache instead of just one.
Not only that, but often the kernel is acting as a mediator for the hardware - which could mean synchronizing between cores, which brings its own obstacles (using slower shared cache, waiting for other cores, etc)
(Replying to PARENT post)
The main solution is to just reserve a ton of virtual address space but avoid committing to it until the process actually writes to it, which is exactly what OS threads do. They reserve a large amount of virtual address space to start but it's more or less free until a thread actually uses it. However you may not see it released back to the OS until the process exits.
(Replying to PARENT post)
This is just my guess, but I assume they don't use growable stacks for the same reason Rust doesn't. C doesn't have a garbage collector, and making a fragmented stack would incur unacceptable performance penalties for many workloads. Getting around that would require a ton of work, but maybe with hardware support, like automatically following the stack pointer to bring it into cache like a normal stack could get around that.
(Replying to PARENT post)
https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...
(Replying to PARENT post)
It feels like it would be totally possibly to improve OSes to remove this limitation. Is anyone actually working on that?
For example I don't see why you couldn't have growable stacks for threads. Or have first class hardware support for context switching. (Yes that would take a long time to arrive.)