Add symbol caches to opcodes that do dynamic resolution of names #28

bamless · 2024-12-22T17:33:43Z

High level overview and preliminary tests

This technique is usually known as inline caching.

Basically, we try to optimistically cache the result of a method/field/global variable name resolution inside the runtime state of the VM in the hopes that, for the next execution of the opcode, types won't have changed and so we can directly return the cached value (the value directly in the case of methods, the field offset inside the ObjInstance for fields) instead of doing name resolution again, which involves an hashtable lookup i the J* VM.

Final Results

Achieved a ~30% speedup on code that heavily relies on methods, fields or global variable lookups. Also, a speedup is achieved across the board as most code uses at least global variables (i.e. module imports or module calls) fairly extensively.

Heavy method calls:

Summary
  ~/Workspace/c/jstar/build/Release/bin/jstar  method_call.jsr ran
    1.03 ± 0.05 times faster than lua method_call.lua
    1.24 ± 0.05 times faster than jstar method_call.jsr
    1.36 ± 0.06 times faster than ruby method_call.rb
    1.36 ± 0.06 times faster than python method_call.py

Heavy use of fields (and some method calls):

Summary
  ~/Workspace/c/jstar/build/Release/bin/jstar  binary_trees.jsr ran
    1.32 ± 0.04 times faster than python binary_trees_class.py
    1.35 ± 0.04 times faster than jstar binary_trees.jsr

Heavy use of global vars:

Summary
  ~/Workspace/c/jstar/build/Release/bin/jstar  fib.jsr ran
    1.26 ± 0.05 times faster than jstar fib.jsr
    1.70 ± 0.06 times faster than python fib.py

Moderate use of global vars:

Summary
  ~/Workspace/c/jstar/build/Release/bin/jstar  for_gen.jsr ran
    1.13 ± 0.03 times faster than jstar for_gen.jsr
    1.61 ± 0.08 times faster than python for_gen.py

(Almost) no use of global vars:

Summary
  ~/Workspace/c/jstar/build/Release/bin/jstar  for.jsr ran
    1.02 ± 0.02 times faster than jstar for.jsr
    3.06 ± 0.07 times faster than python for.py

All benchmarks above are microbenches. Here we also present a bench on a more realistic project: https://github.com/bamless/pulsar (arguments to pulsar are shortened wit ... as there are lots of them)

Summary
  ~/Workspace/c/jstar/build/Release/bin/jstar ./pulsar.jsr pulsar/*jsr ... 2>/dev/null ran
    1.16 ± 0.03 times faster than jstar ./pulsar.jsr pulsar/*jsr ... 2>/dev/null

Implementation overview

The inline caching implementation presented in this PR differs slightly from classical inline cache implementations found in other VMs (as far as I can tell).
For certain aspects, it resembles the name resolution performed by the hotspot JVM when resolving a field or class name for the first time. Differently from Java, that can check statically at compile time the type of the value, our implementation is guarded by checks on a key (typically the last type, i.e ObjClass, the opcode has seen) to prevent erroneously resolving a name from another class when the types of a variable change for the same opcode.

Implementation in detail

One key aspect in which this implementation differs from classical approaches on the fact that caches are not actually stored inline in the bytecode. Instead, a new array of Symbols has been added to the runtime representation of compiled code. These symbol function as a proxy to a constant String in the base case, for example when first trying to resolve the name. In this case, all functions as before, but with an extra level of indirection:

We get the symbol
We get the string representing the name from the constant pool using the symbol
We resolve the method following the semantics of J* on field/methods/gobal values.

Then, an extra and crucial step is added:

Register the resolved value, along with its type (field, method, bound method, global) and class inside the symbol.

Next time that opcode is executed, we first check the symbol for a cached value, and if the types match we directly return the resolved value without doint a full name resolution.
To give a better idea, this is the new code.h:

typedef struct Symbol {
    enum { SYMBOL_METHOD, SYMBOL_BOUND_METHOD, SYMBOL_FIELD, SIMBOL_GLOBAL } type;
    uint16_t constant;
    Obj* key;
    union {
        Value method;
        int offset;
    } as;
} Symbol;

typedef struct Code {
    // ... same as before
    int symbolCapacity, symbolCount;
    Symbol* symbols;
} Code;

This is the new OP_GET_FIELD implementation:

    TARGET(OP_GET_FIELD): {
        Symbol* sym = GET_SYMBOL();
        ObjString* name = AS_STRING(fn->code.consts.arr[sym->constant]);
        if(!getValueField(vm, name, sym)) {
            UNWIND_STACK();
        }
        DISPATCH();
    }

Note the extra indirection to get the field's name from the symbol, and the fact that the symbol is forwarded to getValueField.

  bool getValueField(JStarVM* vm, ObjString* name, Symbol* sym) {
    Value val = peek(vm);
    if(IS_OBJ(val)) {
        switch(AS_OBJ(val)->type) {
        case OBJ_INST: {
            ObjInstance* inst = AS_INSTANCE(val);
            ObjClass* cls = inst->base.cls;

            Value field;
            if(getCached(vm, (Obj*)cls, (Obj*)inst, sym, &field)) {
                pop(vm);
                push(vm, field);
                return true;
            }

     // ...full name resolution is performed here

The getCached performs the magic. If it finds that the key of the cache satisfies the preconditions, it returns directly the value:

static bool getCachedProperty(JStarVM* vm, Obj* key, Obj* val, const Symbol* sym, Value* out) {
    if(!isSymbolCached(key, sym)) return false;
    switch(sym->type) {
    case SYMBOL_METHOD:
        *out = sym->as.method;
        return true;
    case SYMBOL_BOUND_METHOD:
        *out = OBJ_VAL(newBoundMethod(vm, OBJ_VAL(val), AS_OBJ(sym->as.method)));
        return true;
    case SYMBOL_FIELD:
        return getFieldOffset((ObjInstance*)val, sym->as.offset, out);
    case SYMBOL_GLOBAL:
        return getGlobalOffset((ObjModule*)key, sym->as.offset, out);
    }
    return false;
}

On cache hits, this gives us a massive boost in performnce coming from not having to do an (and possible multiple) hashtable lookups.

Implications of the new implementation

One pretty big change coming with this PR is the fact that ObjectInstance and ObjModule will have a different struct layout. These two object now store their values in a plain array, indexed by an HashTable<String, int> from name to offset in this array. This means that in the case we have a cache miss, we pay the cost of an extra indirection for looking up the field (we firest look up the offset using the name, and then we index into the array storing it). This could impact performance when we have lots of cache misses. In practice though, it doesn't seem to costs us a lot, and the performance gains of cache hits vastly justify a couple of extra memory reads in the worst case.

Also, the object layout change is not a problem for binary compatibility of the library. J* never has exposed internal types to embedding libraries and instead relies fully on the stack-based protocol for embedding. This means that all should behave exactly the same as before.

The only thing that will be done, is extending the J* embedding interface with new methods that mirror jsrGetField, jsrInvoke, etc... that will take a symbol as input, in order to give to the embedder the option to cache lookups from the extension side.

Progress

Further work

Once this is fully implemented, it would probably be worthwile trying to implement quickening. The two techniques play well together and would probably result in a further speed-up.

bamless added 24 commits December 20, 2024 18:01

Inline caching W.I.P

2085ce9

Generalize Symbol for methods, globals and fields

27af55d

Prepare caching implementation for field lookup

9d3ee46

Implement inline caching for fields

049e5e8

Fixes to inline caching

c4f950e

Remove branch for bound methods in getValueField

ec54e89

Cache global lookups

2b25420

Fix some bugs in name resolution

94bacea

Fix bug in uncached bind method

cca1bda

Refactor cache lookups

ae793a9

Cleanup Symbol implementation and fix minor GC issues

293ecea

Disable debug compilation option

5bc997f

Write a specialized int hastable for field indices

7d1d60c

Formatting

0127979

Formatting

227e2cc

Documentation

003cb55

Minor changes

8b5bda4

Minor changes

94f6710

Move Symbol cache into its own struct

1e12cd6

Documentation

00cce54

Minor refactoring

e7ec5b9

Add a way to cache lookups in native API

5636af9

Documentation

5a5f27b

Add inline cache statistics

e192ad4

bamless marked this pull request as ready for review December 24, 2024 13:34

bamless added the performance label Dec 24, 2024

bamless merged commit 2668c3d into master Dec 24, 2024
3 checks passed

bamless deleted the inline-caching branch December 24, 2024 13:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add symbol caches to opcodes that do dynamic resolution of names #28

Add symbol caches to opcodes that do dynamic resolution of names #28

bamless commented Dec 22, 2024 •

edited

Loading

Add symbol caches to opcodes that do dynamic resolution of names #28

Add symbol caches to opcodes that do dynamic resolution of names #28

Conversation

bamless commented Dec 22, 2024 • edited Loading

High level overview and preliminary tests

Final Results

Implementation overview

Implementation in detail

Implications of the new implementation

Progress

Further work

bamless commented Dec 22, 2024 •

edited

Loading