Ask HN:

"Why did stack-based CPUs lose out?"

Back in the 80s, they seemed a viable alternative for the future of computing systems. The RTX2010 [0] found wide adoption in the spaceflight industry.

Since then, however, they seemed to have completely petered out.

Is there any specific reason?

[0] https://en.wikipedia.org/wiki/RTX2010

๐Ÿ‘คoptimalsolver๐Ÿ•‘3y๐Ÿ”ผ122๐Ÿ—จ๏ธ86

(Replying to PARENT post)

Stack-based CPUs didn't lose out to accumulator-based CPUs; the x87 for example was stack-based (and was originally intended to support the hardware stack transparently overflowing to RAM -- but they didn't write the software to support that before the chip shipped).

But both of them lost out to many-general-purpose-registers based CPUs, because that's what you need in order to write code which has many instructions in flight at once (for superscalar, pipelined, or both). If you look at the FPU on the Pentium you see the nightmare you run into otherwise -- optimized Pentium x87 code is half FXCH (swap the top of stack with something else) instructions because the FPU was pipelined and so you had to keep shuffling your stack in order to avoid pipeline stalls. The Pentium FPU has a hefty register-renaming unit in order to make this work; if it had been designed with general purpose registers rather than as a stack, that would have been avoided.

๐Ÿ‘คcperciva๐Ÿ•‘2y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

Not an EE or chip designer but

1. stack based arithmetic requires a stack (duh) whereas registers are more general use. Put another way, a stack is specifically for arithmetic whereas registers can be used how you like.

2. More importantly I suspect a stack implementation puts extra pressure on the register set (edit: NOT the register set, sorry, I meant those locations/hardware used for the stack) - read item 1 and item 2 then push the result back onto where item 2 was. With registers you can put the result into any other register, which allows other instructions to execute, not stall waiting for the result to appear on the stack.

3. A stack is a bottleneck, you can have multiple ALU units wich allows multiple instructions to happen OOO when you have registers. Try doing that with a stack (see 2 again)

happy to be corrected

๐Ÿ‘ค_a_a_a_๐Ÿ•‘2y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

"My history with Forth & stack machines" at https://yosefk.com/blog/my-history-with-forth-stack-machines... hits HN every year or more - https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu... .

The most relevant parts are:

> We thought about making our own CPU. The person with the overall responsibility for the hardware design gently told me that I was out of my mind. CPUs have register files and pipeline and pipeline stalls and dependency detection to avoid those stalls and it's too complicated.

> And then I asked, how about a stack machine? No register file. Just a 3-stage pipeline โ€“ fetch, decode, execute. No problem with register dependencies, always pop inputs from the top of the stack, push the result.

> He said it sounded easy enough alright, we could do that. "It's just like my RPN calculator. How would you program it?" "In Forth!" ...

> It was among the most painful programming experiences in my life. ...

> Speaking of which. I started to miss a C compiler. I downloaded LLVM. (7000 files plus a huge precompiled gcc binary. 1 hour to build from scratch. So?) I wrote a working back-end for the stack machine within a week. Generating horrible code. Someone else wrote an optimizing back-end in about two months.

> After a while, the optimizing back-end's code wasn't any worse than my hand-coded Forth.

๐Ÿ‘คeesmith๐Ÿ•‘2y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

Since then... was a decade ago (not 30 years)

https://twitter.com/philae2014/status/427842417920712704

https://www.cpushack.com/2014/11/12/here-comes-philae-powere...

> Why was the RTX2010 chosen? Simply put the RTX2010 is the lowest power budget processor available that is radiation hardened, and powerful enough to handle the complex landing procedure. Philae runs on batteries for the first phase of its mission (later it will switch to solar/back up batteries) so the power budget is critical. The RTX2010 is a Forth based stack processor which allows for very efficient coding, again useful for a low power budget.

> Eight of the instruments are also powered by a RTX2010s, making 10 total (running at between 8-10MHz). The lander also includes an Analog Devices ADSP-21020 and a pair of 80C3x microcontrollers as well as multiple FPGAs.

However... more recently:

Insight - https://www.jpl.nasa.gov/news/press_kits/insight/launch/miss...

> The computer's core is a radiation-hardened central processor with PowerPC 750 architecture called RAD 750. This processor operates at 115.5 megahertz speed, compared with 20 megahertz speed of the RAD6000 processor used on Mars Phoenix.

https://www.engadget.com/nasa-perseverance-powerpc-750-17151...

> When NASA's Perseverance guided itself to the surface of Mars on February 18th, it did so with the help of the same processor that powered the 1998 iMac G3. You know, the colorful and bulbous all-in-one desktop that saved Apple's business and helped it become the company it is today. According to the New Scientist (via Gizmodo), NASA's latest rover features an offshoot of the PowerPC 750 CPU. That processor is about 23 years old in 2021.

The answer appears to be "more powerful processors have been designed that are radiation hardened.

That said... while the hardware isn't as common today, a lot of virtual machines are based off of a stack based architecture - from the JVM to Webassembly to the Ethereum EVM.

๐Ÿ‘คshagie๐Ÿ•‘3y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

Because unless you want to write everything in Forth, register-based architectures are faster and easier to generate optimized code for.
๐Ÿ‘คbitwize๐Ÿ•‘3y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

I think Burroughs mainframes also used stack architectures. I suspect the architecture wasn't entirely responsible for Burroughs' demise, but they seem to be an outlier.

I imagine one potential problem with stack architectures is running out of stack memory and having to spill to main memory. (Presumably using main memory for the stack would be too slow.) This seems unlikely for a single thread but I could imagine it being a big issue with thousands of threads. Context switches where you have to save the whole stack to memory are going to be more like cache misses or page faults.

Another potential issue might be dealing with data that isn't on the top of the stack. A solution could be to add a deep stack addressing mode but then you're basically mixing in a register machine and the simplicity and orthogonality might be lost.

The interesting thing to me is that several stack-oriented interpreters, compilers, and runtimes (including Forth, Pascal, Smalltalk, Java, and WebAssembly systems) have been developed, so maybe a stack-based CPU might be a good match for those systems.

๐Ÿ‘คmusicale๐Ÿ•‘2y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

Almost all problems have data dependencies as graphs rather than stacks. e.g.

  Sandwich = egg + bacon + bread
All of egg + bacon + bread are almost always independent, and can be computed independently. A graph a.k.a. a register architecture enables this independence to be represented

So register architectures are more complicated, but usefully so

๐Ÿ‘คtsegratis๐Ÿ•‘2y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

This is just economy question - registers are faster than stack, for similar price.

In 80th, transistors on VLSI where expensive, in rad-hardened design where prohibitively expensive, and stack gives possibility to make same Turing completeness as with many registers command.

So some time stack architectures performed well in microcomputers, than their area shrink to narrow-special things, and after many registers become accessible for all, stack architectures become history.

๐Ÿ‘คsimne๐Ÿ•‘2y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

Building a solid computer stack (hardware through software) is an exercise in mechanical sympathy, understanding the traits of the underlying components which enable the practitioner to get as much out of it as possible. The software should do the thing that the software is good at, and the hardware should do the things that the hardware is good at, and neither exists in a vacuum, both should be aware of the constraints of the other.

For one example, about 25 years ago Intel made a processor called Itanium which used 'VLIW' (very long instruction word) instructions as opposed to x86. In this case, a lot of the complexity of operating a processor is shifted into the compiler, since individual CPU instructions may perform complex operations. In a modern processor, instructions are 'simple', and the processor is responsible for finding the data dependencies that allow multiple instructions to be executed in parallel (this is called 'superscalar' processing). With Itanium, the compiler author can encode such individual operations into more complex instructions which can be executed in parallel, all at once. So, where in x86 I might have in 3 different instructions (add r1, r2, sub r3, r4, mul r5, r6), with Itanium I might be able to combine all of those into a single instruction at compile time. By doing this, in theory we can remove a lot of the complex superscalar logic from the processor, devote more of the chip to execution units and achieve more overall performance.

One problem here is that the compiler may not actually know what parallelism is possible. For example, the latency of a given instruction may vary from chip to chip. The number of execution units for a single operation may vary. Our code is now very highly optimised for a particular CPU, we may not see natural speedups as newer processors are made, newer processors might not even support the code we've written without a translation layer. Taken to extremes, we lose the universality of software (the ability to change the hardware without rebuilding the software). VLIW chips are viable, but they're typically used in more isolated use cases (like DSPs) where it's reasonable to recompile the software for every chip (and where perhaps there is lots of parallelism in the problem space).

State machines essentially have similar issues. High performance processors tease out the data dependencies between instructions in order to perform superscalar processing, which is key to good performance in a modern processor. Stack machines make it very difficult to understand data dependencies because in principle all data depends on all other data, a post-processing step is required (which is considerably more expensive than the processor equivalent of converting to SSA form). Modern register machines rely heavily on register renaming to ensure that both superscalar processing and pipelining work well with each other - maintaining similar performance with a stack machine would lead to similar solutions, and so very likely switching from a register machine to a stack machine would likely look like having a translation layer on the chip, from the stack machine back into a register-like language.

Stacks do exist in a few places these days. The JVM clearly does it - Java is 'register oriented' for the programmer, the JVM instruction set is stack oriented, and the runtime converts it back into a register oriented machine code. Another example is the ARM Jazelle extensions which allow some ARM processors to directly execute Java bytecode. Here, the bytecode is converted into the machine code as an extra stage in the processor pipeline, adding overhead.

BLAB: - There isn't much advantage to stack machines. It's relatively straightforward to translate between the two. - Modern processors are architecturally closer to register machines, and a lot of the techniques used to make them fast would have to be replicated on a stack machine. - Because any decisions at this level of the stack are effectively invisible to the user in modern tech, it's beneficial to go for the simplest solutions to any given problem, which here is to use a register machine at the hardware level.

๐Ÿ‘คCHY872๐Ÿ•‘3y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

I think it's a memory bandwidth thing - a shift in the ratio between on and off chip access penalty. One could imagine though, an implementation with a lot of on-chip topN-of-stack and renaming, but I'm not aware of such.
๐Ÿ‘คpinewurst๐Ÿ•‘3y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

Modern machines are optimized to run C which is in fact essentially "stack based" but registers "cache" parts of the stack (often never touching the stack in leaf procedures).

There were at least two well known machines that tried to mix the stack and registers - Sparc and my personal favorite, the am29000.

Intel was so dominant in manufacturing for so long that the objectively kind of crappy x86 ISA just didn't matter for price/performance (and of course was a plus for running legacy software).

๐Ÿ‘คjawilson๐Ÿ•‘2y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

Mainly familiarity I'd say. Realise that as early as ENIAC we've used accumulator-based ALU architecture, whereas the first Forth chips started coming out in the eighties. It's hard to overcome such momentum, even if the alternatives have some advantages.

Vaguely related: https://users.ece.cmu.edu/~koopman/stack_computers/index.htm...

๐Ÿ‘คkdklol๐Ÿ•‘3y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

40 years ago memory was faster than the CPU, so using the stack to hold operands incurred no speed penalty.

More important now is that registers are global resources, and global resources are a nightmare in multi-threaded code. Stack machines don't have global resources (except a stack pointer and ALU which are easy to manage).

But then CPU and registers got fast and register machines began to predominate. Now you have to deal with the register allocation nightmare because registers are essentially global variables.

Java takes the approach of presenting a stack machine with no registers and then letting the interpreter figure out register allocations at run time (or nearly so).

Life would be simpler if we had 1000 simple cores, each with their own small private set of registers, and they communicated by message passing. I.E. the Actor model in hardware.

๐Ÿ‘คdreamcompiler๐Ÿ•‘2y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

Much as I love Forth and stack-based architectures, I have to admit that they are a very niche technology. To understand why they never really took off you have to put a pile of computers beside a pile of cars.

The internal combustion engine is terrible. Noisy, inefficient, difficult to make, and somehow everything else we've tried is worse. More-or-less every car you've ever driven has been powered by a four-stroke four-cylinder engine. It might have had overhead valves or an overhead cam, it might have had a carburettor or electronic fuel injection, but it's still the same basic technology going back over 100 years. Why?

Because it sucks, but everything else sucks worse. Two-strokes are simple but fiddly to get running over a wide range of power settings, Wankels are light and simple and don't vibrate but are incredibly fiddly to make.

Likewise in computing, it turns out that you can do all sorts of clever tricks like hardware support for object-orientated memory like the Linn Rekursiv (the last known example of which is on the bottom of the Forth and Clyde Canal, in the north of Glasgow, which ought to tell you how successful a concept it was), or hardware stack-based processors like the RTX2010 and its predecessors or the HP3000 minicomputer, or CISC processors that took the first C to strange new places like the DEC VAX architecture that was planned to have single machine code instructions for solving polynomials and multiplying matrices. However, you almost never use these, and - if you actually look at what the compiler emits - you find you spend all your time shuffling values between registers and firing them into the ALU.

And that's where RISC comes in.

You need to move things from memory to registers (masses of registers) and back, or fire it into the ALU and get a result back quickly. Combine all these with conditionals based on the value of the status register, and you've got a very powerful machine instruction set that lends itself well to general-purpose computing. Sure, you need to write quite a lot of code to do your matrix multiply now, but that's okay because it's written in lots of small efficient steps, running on a small efficient hardware platform.

So why would you implement a stack machine these days, and how? Well, the obvious use for something like Forth is in bringing up a machine from bare metal, where with a couple of kB of assembly code (just enough to set up the hardware you want to use, and implement a couple of dozen instructions) and a couple of kB of Forth code (Forth is mostly written in Forth) you can have a working environment that you can actually write code in and control your new computer. Some CPUs are particularly well-suited to this, having enough memory index registers (or like the 6809 and friends, two stack pointers) to implement the parameter and return stack, and if you've got registers left over you can do things like top-of-stack-in-register saving a memory access or two. The trick with "Forth written in Forth" is that once you've got the inner interpreter running all you need to do to write words in Forth from your assembler is define each word's header and then a list of addresses of the words you want to call.

If you wanted to implement a stack machine in hardware, like if you were heavily into homebrew computing or paleocomputing, you could probably do it in a few dozen common discrete logic chips, a couple of 74LS181 ALUs (you can still buy these!) and a couple of EPROMs for the microcode.

๐Ÿ‘คGordonjcp๐Ÿ•‘2y๐Ÿ”ผ0๐Ÿ—จ๏ธ0