Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add symbol caches to opcodes that do dynamic resolution of names #28

Merged
merged 24 commits into from
Dec 24, 2024

Conversation

bamless
Copy link
Owner

@bamless bamless commented Dec 22, 2024

High level overview and preliminary tests

This technique is usually known as inline caching.

Basically, we try to optimistically cache the result of a method/field/global variable name resolution inside the runtime state of the VM in the hopes that, for the next execution of the opcode, types won't have changed and so we can directly return the cached value (the value directly in the case of methods, the field offset inside the ObjInstance for fields) instead of doing name resolution again, which involves an hashtable lookup i the J* VM.

Final Results

Achieved a ~30% speedup on code that heavily relies on methods, fields or global variable lookups. Also, a speedup is achieved across the board as most code uses at least global variables (i.e. module imports or module calls) fairly extensively.

Heavy method calls:

Summary
  ~/Workspace/c/jstar/build/Release/bin/jstar  method_call.jsr ran
    1.03 ± 0.05 times faster than lua method_call.lua
    1.24 ± 0.05 times faster than jstar method_call.jsr
    1.36 ± 0.06 times faster than ruby method_call.rb
    1.36 ± 0.06 times faster than python method_call.py

Heavy use of fields (and some method calls):

Summary
  ~/Workspace/c/jstar/build/Release/bin/jstar  binary_trees.jsr ran
    1.32 ± 0.04 times faster than python binary_trees_class.py
    1.35 ± 0.04 times faster than jstar binary_trees.jsr

Heavy use of global vars:

Summary
  ~/Workspace/c/jstar/build/Release/bin/jstar  fib.jsr ran
    1.26 ± 0.05 times faster than jstar fib.jsr
    1.70 ± 0.06 times faster than python fib.py

Moderate use of global vars:

Summary
  ~/Workspace/c/jstar/build/Release/bin/jstar  for_gen.jsr ran
    1.13 ± 0.03 times faster than jstar for_gen.jsr
    1.61 ± 0.08 times faster than python for_gen.py

(Almost) no use of global vars:

Summary
  ~/Workspace/c/jstar/build/Release/bin/jstar  for.jsr ran
    1.02 ± 0.02 times faster than jstar for.jsr
    3.06 ± 0.07 times faster than python for.py

All benchmarks above are microbenches. Here we also present a bench on a more realistic project: https://github.com/bamless/pulsar (arguments to pulsar are shortened wit ... as there are lots of them)

Summary
  ~/Workspace/c/jstar/build/Release/bin/jstar ./pulsar.jsr pulsar/*jsr ... 2>/dev/null ran
    1.16 ± 0.03 times faster than jstar ./pulsar.jsr pulsar/*jsr ... 2>/dev/null

Implementation overview

The inline caching implementation presented in this PR differs slightly from classical inline cache implementations found in other VMs (as far as I can tell).
For certain aspects, it resembles the name resolution performed by the hotspot JVM when resolving a field or class name for the first time. Differently from Java, that can check statically at compile time the type of the value, our implementation is guarded by checks on a key (typically the last type, i.e ObjClass, the opcode has seen) to prevent erroneously resolving a name from another class when the types of a variable change for the same opcode.

Implementation in detail

One key aspect in which this implementation differs from classical approaches on the fact that caches are not actually stored inline in the bytecode. Instead, a new array of Symbols has been added to the runtime representation of compiled code. These symbol function as a proxy to a constant String in the base case, for example when first trying to resolve the name. In this case, all functions as before, but with an extra level of indirection:

  • We get the symbol
  • We get the string representing the name from the constant pool using the symbol
  • We resolve the method following the semantics of J* on field/methods/gobal values.

Then, an extra and crucial step is added:

  • Register the resolved value, along with its type (field, method, bound method, global) and class inside the symbol.

Next time that opcode is executed, we first check the symbol for a cached value, and if the types match we directly return the resolved value without doint a full name resolution.
To give a better idea, this is the new code.h:

typedef struct Symbol {
    enum { SYMBOL_METHOD, SYMBOL_BOUND_METHOD, SYMBOL_FIELD, SIMBOL_GLOBAL } type;
    uint16_t constant;
    Obj* key;
    union {
        Value method;
        int offset;
    } as;
} Symbol;

typedef struct Code {
    // ... same as before
    int symbolCapacity, symbolCount;
    Symbol* symbols;
} Code;

This is the new OP_GET_FIELD implementation:

    TARGET(OP_GET_FIELD): {
        Symbol* sym = GET_SYMBOL();
        ObjString* name = AS_STRING(fn->code.consts.arr[sym->constant]);
        if(!getValueField(vm, name, sym)) {
            UNWIND_STACK();
        }
        DISPATCH();
    }

Note the extra indirection to get the field's name from the symbol, and the fact that the symbol is forwarded to getValueField.

  bool getValueField(JStarVM* vm, ObjString* name, Symbol* sym) {
    Value val = peek(vm);
    if(IS_OBJ(val)) {
        switch(AS_OBJ(val)->type) {
        case OBJ_INST: {
            ObjInstance* inst = AS_INSTANCE(val);
            ObjClass* cls = inst->base.cls;

            Value field;
            if(getCached(vm, (Obj*)cls, (Obj*)inst, sym, &field)) {
                pop(vm);
                push(vm, field);
                return true;
            }

     // ...full name resolution is performed here

The getCached performs the magic. If it finds that the key of the cache satisfies the preconditions, it returns directly the value:

static bool getCachedProperty(JStarVM* vm, Obj* key, Obj* val, const Symbol* sym, Value* out) {
    if(!isSymbolCached(key, sym)) return false;
    switch(sym->type) {
    case SYMBOL_METHOD:
        *out = sym->as.method;
        return true;
    case SYMBOL_BOUND_METHOD:
        *out = OBJ_VAL(newBoundMethod(vm, OBJ_VAL(val), AS_OBJ(sym->as.method)));
        return true;
    case SYMBOL_FIELD:
        return getFieldOffset((ObjInstance*)val, sym->as.offset, out);
    case SYMBOL_GLOBAL:
        return getGlobalOffset((ObjModule*)key, sym->as.offset, out);
    }
    return false;
}

On cache hits, this gives us a massive boost in performnce coming from not having to do an (and possible multiple) hashtable lookups.

Implications of the new implementation

One pretty big change coming with this PR is the fact that ObjectInstance and ObjModule will have a different struct layout. These two object now store their values in a plain array, indexed by an HashTable<String, int> from name to offset in this array. This means that in the case we have a cache miss, we pay the cost of an extra indirection for looking up the field (we firest look up the offset using the name, and then we index into the array storing it). This could impact performance when we have lots of cache misses. In practice though, it doesn't seem to costs us a lot, and the performance gains of cache hits vastly justify a couple of extra memory reads in the worst case.

Also, the object layout change is not a problem for binary compatibility of the library. J* never has exposed internal types to embedding libraries and instead relies fully on the stack-based protocol for embedding. This means that all should behave exactly the same as before.

The only thing that will be done, is extending the J* embedding interface with new methods that mirror jsrGetField, jsrInvoke, etc... that will take a symbol as input, in order to give to the embedder the option to cache lookups from the extension side.

Progress

  • Add symbols to runtime repr of compiled code
  • Change compiler to emit symbols
  • Change de/se-rializers to write and read symbols
  • Cache method lookups
  • Cache field lookups
  • Cache bound method lookups
  • Cache global variable lookups
  • Replace the name lookup hashtables with a specialized int version. Better yet rewrite the hashtable implementation to be generic using macros.
  • Add new API methods to permit caching from the embedding side
  • Review and cleanup of implementation
  • Testing

Further work

Once this is fully implemented, it would probably be worthwile trying to implement quickening. The two techniques play well together and would probably result in a further speed-up.

@bamless bamless marked this pull request as ready for review December 24, 2024 13:34
@bamless bamless merged commit 2668c3d into master Dec 24, 2024
3 checks passed
@bamless bamless deleted the inline-caching branch December 24, 2024 13:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant