Interpreter type speed comparison


Content

Types of interpreters

There are countless types of interpreters, but in this article I will focus in three: AST walking interpreters, and 2 kinds of bytecode interpreters (stack-based and register-based).

AST walking interpreters

These are one of the simplest kind, they are probably the first thing that comes to mind when writing an interpreter. After parsing the code into an Abstract Syntax Tree (a tree representation, where each node represents a statement or expression), the tree is evaluated directly, in a recursive manner. The main advantage is its simplicity, clarity, and ease of debugging. These make AST walkers a valuable learning tool, and sometimes the best choice if performance is not a consideration.

However, performance with this kind of interpreters tends to be bad. Lots of allocations are needed to build the AST (which could potentially be prevented using an arena-like structure, but that adds some complexity to the code), and evaluation is slow (moving through a tree-like structure is not very cache friendly)

Another problem is that it offers less flexibility than a VM-based approach. For example, the interpreted language uses the native language's stack for its own function calls (and also for evaluating expressions). This limits the amount of recursion and the complexity of the expressions that can be evaluated. It also makes it extremely challenging, if not impossible, to implement some things that can be implemented with a VM with an explicit instruction pointer. For example, implementing Lua coroutines or proper tail calls would be quite challenging in an AST walker

Modern languages rarely use AST walking interpreters, though very early versions of Ruby used one, before switching to a bytecode VM.

Stack-based bytecode interpreter

In this kind of interpreter, code is compiled to an array of instructions (generally, each instruction is at most a couple of bytes long, for instance in my case they're 4 bytes long), and then the instructions are executed sequentially (think of a while loop with a switch inside, though some interpreters use more advanced instruction dispatch methods).

A stack holds local variables and temporary values. All instructions take its operands from the top of the stack, and push any results on the top of the stack. This tends to require lots of instructions to pop the operands/push the results, which has an overhead.

These interpreters usually perform better than the previous ones, since this representation of the code is more cache-friendly, and also recursion is avoided. Additionally, having a VM with an explicit instruction pointer, makes it easier to implement some features. Furthermore, I didn't find implementing a stack-based interpreter that much harder than the AST walker, so it seems to be a reasonable midpoint between performance and ease of implementation.

Many languages use stack-based interpreters, including Python and Java

Register-based bytecode interpreter

These are very similar to the previous ones, except that each instruction has encoded in it the locations of their operands, and where to store its result. This avoids a lot of the overhead of all the stack-managing instructions.

The implementation of the VM is similar to stack-based VMs, but writing the compiler was quite a bit trickier, since you have to keep track at compile time of where each value will end up at runtime. I especially found short-circuit logic operators (and/or) hard to implement.

Though they are less common than stack-based interpreters, register interpreters are used, and Lua itself is an example of a register-based interpreter.

Code comparison

A good way to compare these approaches is to compare their internal representation of the code. Below is the representation of the following code, for each of the implementation.

The code:
1
local i = 0
2
while i < 5000000 do
3
    i = i + 1
4
end
5
print(i)
6
return i
The representations:
AST Stack 1 Registers1
[
    Local(
        [ Identifier(2) ],
        [ NumberLiteral(0.0) ],
    ),
    While(
        Lt(
            (
                IdentifierExpression( Identifier(2) ),
                NumberLiteral(5000000.0),
            ),
        ),
        [
            Assign(
                [
                    IdentifierExpression( Identifier(2) ),
                ],
                [
                    Plus(
                        (
                            IdentifierExpression(
                                Identifier(2),
                            ),
                            NumberLiteral(1.0),
                        ),
                    ),
                ],
            ),
        ],
    ),
    Call(
        IdentifierExpression( Identifier(6) ),
        [
            IdentifierExpression( Identifier(2) ),
        ],
    ),
    Return(
        Some(
            IdentifierExpression( Identifier(2) ),
        ),
    ),
]
1 Number(NumberHandle(0)) ; 0
2 GetLocal(LocalHandle(0))
| Number(NumberHandle(1)) ; 5000000
| Lt
| JmpIfFalsePop(6)
3 GetLocal(LocalHandle(0))
| Number(NumberHandle(2)) ; 1
| Add
| SetLocal(LocalHandle(0))
2 Loop(8)
5 GetGlobal(StringHandle(0)) ; "print"
| GetLocal(LocalHandle(0))
| Call(1)
| Pop
6 GetLocal(LocalHandle(0))
| Return
| Pop
| ReturnNil
1  Number { dst: 0, src: NumberHandle(0) } ; 0
2  Number { dst: 1, src: NumberHandle(1) } ; 5000000
|  Lt(JmpArgs { lhs: 0, rhs: 1 })
|  Jmp(4)
3  Number { dst: 1, src: NumberHandle(2) } ; 1
|  Add(BinArgs { dst: 0, lhs: 0, rhs: 1 })
2  Jmp(-5)
5  GetGlobal { dst: 1, src: StringHandle(0) } ; "print"
|  Mv(UnArgs { dst: 2, src: 0 })
|  Call { base: 1, nargs: 1 }
6  Return { src: 0 }
|  ReturnNil

An interesting observation is that Lt works differently between the stack and register implementations. In the stack-based implementation, it pushes the boolean result to the stack, which is used and popped by JmpIfFalsePop . Meanwhile, in the register-based implementation, it just skips the next instruction if the condition is true. If it's false, then the Jmp will get executed, and it'll exit the loop. Also, all jumps are relative.

The language

The language we'll use for the comparison is a subset of Lua 2 that I implemented in Rust, called Rua 3 It's not feature complete yet, but it is complete enough to get an idea of the performance characteristics of the different implementations.

I implemented it in 3 different ways: an AST walking interpreter4 a stack-based bytecode interpreter5 and a register-based bytecode interpreter6. I won't go into much detail on each implementation, but you're free to give them a look.

One thing I consider important to mention is that, since the implementation is written in mostly safe Rust, reference counting pointers (Rc) are used, on top of real cycle-detecting GC. This brings an additional cost to copying values around, since the refcount has to be incremented (and later decremented when the reference is destroyed). I expect this cost to be especially significant in the case of the stack-based VM, since this architecture needs to copy around values very often, since it can only operate on the values on the top.

It's hard to compare how "optimized" each implementation is, given how different they are, but similar amount of effort has been given to optimize each implementation. For example, I've been careful not to perform superfluous heap allocations and cloning. On the other hand, more complex optimizations have not been implemented. This includes constant folding/propagation, function inlining, loop unrolling, special instructions for variable-constant operations, etc.

There are a couple of "big" differences though. The first one is how local variables are stored. While in both bytecode implementations they are stored directly in the stack and accessed by index, in the AST walking implementation they are stored in a FxHashMap<u32, Value> (where the u32 represents the identifier). This is because this implementation doesn't have a stack like the others do, but it makes accessing locals a bit slower. The second one is that both bytecode implementations have a real cycle-detecting garbage collector (on top of reference counting), whereas the AST walker only uses reference counting, which heavily influences benchmarks that creates lots of objects. To solve this disparity, I'll also benchmark those with GC disabled (by making the Vm::should_gc function return always false, in case you want to reproduce the results).

Though it would certainly be nice to compare fully optimized implementations, I don't have the time to dedicate to optimizing three implementations of the same interpreter. However, I will continue optimizing the register-based implementation, and may write a followup later measuring the improvements.

Benchmarks

Writing a benchmark that accurately depicts the performance of a program is always tricky. But in the case of a general purpose language interpreter, I found it to be a particular kind of impossibility. The space of possible inputs is so big and diverse, and real world use cases vary so wildly, there's no way to capture all of that in a single number.

Since any attempt I could've made would be flawed, I took the simple route and borrowed some benchmarks from Programming-Language-Benchmarks. I used binarytrees and nbody (with slight modifications since my interpreters are not 100% Lua compatible). I also added the simple loop code from before. Since the instructions are extremely simple, it may help to give an idea of the actual interpreter overhead.

Times were measured using hyperfine, with the options --warmup 3 and --min-runs 10 . Note that the results include startup, parsing and execution time, though in these benchmarks both startup and parsing times are negligible.

As for hardware, benchmarks were ran on a laptop with an Intel i5-5200U, 16GB of DDR3 RAM (1333MHz), running Arch7 Linux (kernel version 6.6.8-arch1-1)

Results

Benchmark AST Stack Registers Stack (Rc only) Registers (Rc only)
binarytrees 7.703s ± 0.077s 5.611s ± 0.045s 4.675s ± 0.027s 4.960s ± 0.032s 4.102s ± 0.040s
nbody 5.605s ± 0.101s 3.257s ± 0.033s 1.530s ± 0.024s 3.275s ± 0.048s 1.510s ± 0.011s
loop 978.7ms ± 28.4ms 696.7ms ± 6.5ms 233.2ms ± 3.4ms 693.0ms ± 4.5ms 234.8ms ± 4.9ms

As expected, the register VM is faster than the stack VM, which is faster than the AST walking interpreter. The performance difference was most noticeable in the loop benchmark (the register interpreter was 4.20 times faster than the tree walker), and least noticeable in the binarytrees benchmark (the register interpreter, even without GC, was only 1.88 times faster than the tree walker).

This makes sense, since in loop all operations are very cheap, so the interpreter overhead becomes more relevant. Using a profiler, I can see that in binarytrees, a very significant amount of time is spent reserving space for tables, and performing key lookups, which is equally costly in either interpreter.

Conclusion

There is very much a trade-off between speed and complexity here. In fact, if all you want is speed, you should be implementing a JIT interpreter, with all the complexity that brings. Therefore, which interpreter is the best very much depends on the situation.

However, for anything other than learning purposes, I probably wouldn't choose an AST walking interpreter. Implementing a stack-based interpreter is not that much more complex, and it provides a nice boost in speed, plus the already mentioned extra flexibility for implementing some more advanced features.

Finally, it's worth noting that many other things will influence an interpreter speed, so you shouldn't take the interpreter's kind as the be-all and end-all of performance. In no particular order, here's a few factors that can greatly affect an interpreter's performance:

  • What language it's implemented on
  • General compiler optimizations (constant propagation, function inlining, etc)
  • The bytecode instruction set
  • How the instruction dispatch is implemented (while/switch, direct threading, etc)
  • How values are represented (and what kinds of values the language provides)
  • Kind of garbage collector. Does it use reference counting?

  • 1 The numbers to the left are the line number each instruction corresponds to
  • 3 I know it's a bad name, I'll change it at some point
  • 7 btw