Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generics experimentation (squashed) #4

Draft
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

arnaud-lb
Copy link
Owner

@arnaud-lb arnaud-lb commented Jan 5, 2024

This is nikic#3, rebased from 7.x to master.

Can be built with --disable-all --enable-opcache --disable-opcache-jit.

I didn't review the rebase yet, and may have introduced errors, but Zend/tests pass.

TODO:

  • Support instantiation of generic classes with unbound types (e.g. new Foo<A,B>())
  • Minimal Opcache support (no optimization regressions)
  • Check bounds of type arguments before instantiation
  • Finish implementation of instanceof_function for generic classes
  • Fix property type checks
  • Declaration-site variance
  • Disallow use of type parameters in static context
  • Handling of cache_slot in zend_check_type_slow and type checking optimizations
  • Support for referencing generic parameters in function bodies (e.g. new T())
  • Support for self, static, parent
  • Internal API improvements
  • ext/ support
  • Type inference experiments
  • get_class() ?
  • Reflection
  • Dynamic instantiation
  • Rationalize internal APIs
  • JIT support
  • Generic functions (maybe separately)
  • Generic arrays (separately)
  • TODO comments

zend_object now has a zend_class_reference member instead of a zend_class_entry one. Existing code accessing the ce member was updated to use OBJ_CE() instead, for now.

The ZEND_NEW OP1 is now a PNR (zend_packed_name_reference) literal, to encode generic type arguments in addition to the class name.

Instantiating generic classes requires a zend_class_reference, which are allocated on the fly and cached in an hashtable (EG(generic_class_entries)) to reduce overhead and to simplify memory management. The lookup key is stored in PNRs during compilation (but will need to be computed at runtime for dynamic instantiations).

The overhead on ZEND_NEW is light, especially after the first instantiation, and negligible for non-generic instantiations. Here are the possible pathes, from the happiest to the less happy ones:

  • Use the class reference cached in the runtime cache
  • Use the class reference cached in CE_CACHE
  • Lookup in either EG(generic_class_entries) or EG(class_entries) depending on the presence of type args.
  • If no class is found in generic_class_entries we need to lookup the base class and allocate a zend_class_reference.

Type argument checks are done when allocating a zend_class_reference.

OP1+1 remains a zend_string representation of the lower case class name. We still need this in both non-generic and generic instantiations: In the former case the PNR doesn't have a key because it's just a packed zend_string, and in the latter case we need both the generic key (e.g. foo<bar>) and the base key (e.g. foo) for the unhappy path.

PNR literals have a new zval type (IS_PNR), similarly to IS_ALIAS_PTR, allowing op array consumers (e.g. Opcache) to be aware of them. Currently they are always allocated during compilation, on the arena, which simplifies memory management (but still requires some special cases in Opcache). In order to support dynamic instantiations, and to remove Opcache special cases, we need to eventually refcount or intern PNR literals.

Micro benchmark shows a small (1-2%) regression in this branch when using a simple generic class compared to the base branch when using a non-generic fully typed class. The overhead can be fully attributed to type checking (zend_check_type), and can be reversed with some type checking optimizations (which also benefits non-generic code).

On complexity:

  • Code iterating over types or manipulating types is inherently rendered a bit more complex due to this feature
  • If we update internal APIs such that they take zend_class_references consistently, generics should be transparent to code using objects, at least for the higher level use-cases (this experimentation does not attempt to improve APIs yet)

@arnaud-lb arnaud-lb force-pushed the generics-2-rebased-squashed branch from 1e0c564 to 4ff8ada Compare January 22, 2024 16:16
@arnaud-lb
Copy link
Owner Author

arnaud-lb commented Feb 15, 2024

Updated the branch with partial support for declaration-site variance, bounds, and type inference. Opcache is probably broken again.

The branch can be built without opcache. When testing, the result of inference can be observed with var_dump(), e.g.:

class C<T> { public function __construct(T $t) {} }
var_dump(new C(1)); // object(C<int>)

My goal with type inference was to find blockers and to measure the impact on performance. I didn't find blockers, but performance will be an issue and a considerable amount of work is still necessary to make it acceptable.

Here are some details on semantics and implementation:

As a reminder, type inference allows to omit type arguments in new expressions and function calls (when we introduce generic functions), so that we can write new C(1) instead of new C<int>(1), or array_map($f,$a) instead of array_map<int|string,Foo,Bar>($f,$a).

The basic principle is that we can determine types by matching the type of constructor parameters with constructor arguments.

The semantics can be described as follows:

  • If a constructor parameter's type is a type variable (e.g. T), the type variable is instantiated to the type of the argument.
  • If it's a generic type (e.g. C<T>), and the argument has the same base class or is a subclass, we match type arguments pairs
  • If it's a union, we match every member against the argument.

This is applied to all constructor parameters. Whenever a type variable is matched multiple times, its type is the union of all its instances.

Maybe this is better described as pseudo code:

// M: map of inferred types
// P: a parameter type
// A: an argument type
infer(M, P, A):
    if P is a type variable:
        M(P) = union(M(P), A)
    else if P is a generic class:
        if class(P) == class(A):
            for i in 0..|type_params(P)|:
                infer(type_params(P,i), type_args(A,i))
        else if parent(A):
            infer(P, parent(A))
    else if P is a union:
        for each elem in P:
            infer(elem, A)

M={}
for i in 0..|constructor_params(C)|:
    infer(M, constructor_params(C,i), typeof(constructor_args(i)))

Examples:

class A<T> { function __construct(T $t) {} }
new A(1); // A<int>

class B<T> { function __construct(A<T> $a) {} }
new B(new A(1)); // B<int>

class C<T> { function __construct(T $a, T $b) {} }
new C(new A(1), new A('string'); // C<int|string>

class D<T> { function __construct(T|null $a) {} }
new D($v);

The last case would result in D<typeof($v)>, but it would be useful to adjust the semantics so that infer(T|null, U) = {T: U\null}.

The end result is a zend_class_reference that can be cached in the runtime cache.

Now the non-trivial part is to find the type of the constructor arguments. Using their runtime type would be too specific. To illustrate why, consider the following snippet:

class C<T> {
    public function __construct(T $arg) {}
}
class A {}
class B extends A {}
function test(A $a): C<A> {
    return new C($a);
}

The variable $a can have any sub type of A in this function, so using its runtime type could result in types more specific that A. Given that T is invariant, the type of new C($a) would not be compatible with the return type of the function. For example, in test(new B), new C($a) would be a C<B> which is incompatible with C<A>.

That's why we need to infer types statically, ideally at compile time.

We do so with a forward data flow analysis reusing the same infrastructure as zend_inference.c. Unlike zend_inference.c however, the analysis doesn't have to be conservative, and can be designed to be unsound if it improves usability. For example we ignore references. We could also consider that arithmetic operations never promote to float, but currently these operation are always inferred to int|float unless an operand overloads it.

Since we don't have access to all function signatures and classes at compile time, we have to represent some types in an abstract way, with symbols. For example the type of f(1) is represented as fcall<f,int> (the arguments here are useful in case the function is generic).

Symbolic types can be resolved at runtime when needed after functions and classes have been loaded, at a fraction of what is would cost to run the entire analysis at runtime. This operation can be cached (in runtime cache) for the duration of the request.

These can represent an infinity of types and prevent the analysis from reaching a fixpoint. E.g. the variable $a in the following code will successively have the types fcall<f,int>, fcall<f,fcall<f,int>|int>, and so on:

$a = 1;
while ($i--) { $a = f($a); }

We handle these by using a variation of the algorithm decribed in: Weiss G, Schonberg E. Typefinding recursive structures: A data-flow analysis in the presence of infinite type sets. New York University. Courant Institute of Mathematical Sciences. Computer Science Department; 1986.

This can be summarized as follows: Operands that depend on the operation that uses them are considered recursive, and are represented by a placeholder. We use SCCs to find such operands.

After a fixpoint is reached we can resolve the placeholders. Based on the assumption that operations are indempotent (type wise), we can eliminate recursivity: fcall<f,$v1> with $v1=fcall<f,$v1>|int is equivalent to fcall<f,fcall<int>|int>. This is not entirely true, for example f<T>(T):T[] is not indempotent, but such cases are undecidable anyway, and are not useful in practice.

Now for some issues:

Symbolic types are contagious: If an operation has a symbolic operand, its type must also be represented as a symbolic type. For example, the operation f() + 1 is represented as binop<ADD,fcall<f>,int>. This can lead to large types.

I anticipated that symbolic type expressions could often be folded, but it's not the case for two reasons: First, operations can be overloaded so we can not assume basic properties about them. And second, functions may be generic so their return type depend on arguments and we have to represent them. We only need to resolve them when the function happens to be actually generic, though, so most fcall<..> are short-circuiting during resolution.

An other issue is that merging union types when resolving symbols is O(mn), or worse when intersections are involved.

There are potential mitigations to these, such as inline caches, or generalizing the link/inheritance cache to types, or eagerly loading functions/class, but these will take time to implement.

Additionally, type checking is also affected: Type variables imply that some assumptions we make in zend_type_check are not valid anymore (e.g. a type list can effectively contain simple types, and an intersection may contain unions or other intersections). Also, while checking a union type against a value is O(n), checking it against an other union type (when checking Foo<A|B> against a value of type Foo<...>) is O(mn). Again, maybe this can be mitigated with caches.

In theory more types should result in fewer type checks at runtime, which would mitigate that, but we are not there yet (one reason is we can not rely on external functions/classes at compile time).

So overall performance may be an issue, and there is a lot of work remaining.

This branch implements what is described above, but not the potential performance mitigations, and is in a PoC state. Many things will not work. Generic class properties are broken (type checking is not fully implemented). Memory management needs to be reworked (everything is allocated on the arena). Opcache is not supported.

@dtdesign
Copy link

class C<T> { function __construct(T $a, T $b) {} }
new C(new A(1), new A('string'); // C<int|string>

@arnaud-lb Can you please check this particular example again? From my understanding, inference should only work if there is no ambiguity and by assigning new A(1) to $a the type T should be narrowed to be int. Assigning new A('string') to $b should not be possible, unless the type union is explicitly defined:

class C<T> { function __construct(T $a, T $b) {} }
new C<int|string>(new A(1), new A('string'); // C<int|string>

@arnaud-lb
Copy link
Owner Author

We can experiment with different behaviors before we decide which one is the most useful and practicable. Could you elaborate on why this would be a better behavior?

TypeScript and PHPStan behave like in the original example, and from experience this is useful. From a performance perspective, this may not be the best choice however.

@dtdesign
Copy link

TypeScript […] behave like in the original example, and from experience this is useful.

TypeScript is a bit different, because in that example the type is inferred to string. If you however pass in primitives, it will rejected it entirely.

https://www.typescriptlang.org/play?#code/MYGwhgzhAECCA8AVAfNA3tYB7AdhALgE4Cuw+WhAFGAFzSICU6AvtMwFCiQwDCSqGbHiKlyVWvQA00AEZ1GLNu3Y4ApgHdoPSms2xKARgbTdcSgCIChAJY4A5uYZMA9M-rRrMWwDNVhVQAm0JDQ8Fa2dsjKptoG0pZEEY7QrtAAooSEFNKIHl44vv5BIfC2+MhAA

From my point of view, static typing is a way of avoiding errors by being explicit about the types. Inferring T is fine when all uses of T are of the same type, but once you mix types (that possibly cannot be replaced with a superset, like string) there is a significant chance that this was unintentional.

In your example it is clear that the different types are intended, but variables often originate from somewhere else so it may not always be obvious. Erring on asking the user to be explicit when the type cannot be inferred without doubt may make more sense in my opinion.

arnaud-lb pushed a commit that referenced this pull request Oct 14, 2024
even without sanitizers, it is reproducible but with the following

```
<?php
$g = gmp_init(256);
var_dump(gmp_pow($g, PHP_INT_MAX));
```

we get this

```
AddressSanitizer:DEADLYSIGNAL
=================================================================
==286922==ERROR: AddressSanitizer: FPE on unknown address 0x03e8000460ca (pc 0x7faf6c69de5c bp 0x400000000000004 sp 0x7ffe9843c740 T0)
    #0 0x7faf6c69de5c in __pthread_kill_implementation nptl/pthread_kill.c:44
    #1 0x7faf6c649c81 in __GI_raise ../sysdeps/posix/raise.c:26
    #2 0x7faf6db9386c in __gmp_exception (/lib/x86_64-linux-gnu/libgmp.so.10+0xd86c) (BuildId: 1af68a49fe041a5bb48a2915c3d47541f713bb38)
    #3 0x7faf6db938d3 in __gmp_overflow_in_mpz (/lib/x86_64-linux-gnu/libgmp.so.10+0xd8d3) (BuildId: 1af68a49fe041a5bb48a2915c3d47541f713bb38)
    #4 0x7faf6dbac95c in __gmpz_realloc (/lib/x86_64-linux-gnu/libgmp.so.10+0x2695c) (BuildId: 1af68a49fe041a5bb48a2915c3d47541f713bb38)
    #5 0x7faf6dba9038 in __gmpz_n_pow_ui (/lib/x86_64-linux-gnu/libgmp.so.10+0x23038) (BuildId: 1af68a49fe041a5bb48a2915c3d47541f713bb38)
    #6 0x5565ae1ccd9f in zif_gmp_pow /home/dcarlier/Contribs/php-src/ext/gmp/gmp.c:1286
    #7 0x5565aee96ea9 in ZEND_DO_ICALL_SPEC_RETVAL_USED_HANDLER /home/dcarlier/Contribs/php-src/Zend/zend_vm_execute.h:1312
    #8 0x5565af144320 in execute_ex /home/dcarlier/Contribs/php-src/Zend/zend_vm_execute.h:56075
    #9 0x5565af160f07 in zend_execute /home/dcarlier/Contribs/php-src/Zend/zend_vm_execute.h:60439
    #10 0x5565aed6fafe in zend_execute_scripts /home/dcarlier/Contribs/php-src/Zend/zend.c:1842
    php#11 0x5565aeae70a8 in php_execute_script /home/dcarlier/Contribs/php-src/main/main.c:2578
    php#12 0x5565af532f4e in do_cli /home/dcarlier/Contribs/php-src/sapi/cli/php_cli.c:964
    php#13 0x5565af535877 in main /home/dcarlier/Contribs/php-src/sapi/cli/php_cli.c:1334
    php#14 0x7faf6c633d67 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
    php#15 0x7faf6c633e24 in __libc_start_main_impl ../csu/libc-start.c:360
    php#16 0x5565adc04040 in _start (/home/dcarlier/Contribs/php-src/sapi/cli/php+0x2604040) (BuildId: 949049955bdf8b7197390b1978a1dfc3ef6fdf38)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: FPE nptl/pthread_kill.c:44 in __pthread_kill_implementation
==286922==ABORTING
```
arnaud-lb pushed a commit that referenced this pull request Oct 29, 2024
even without sanitizers, it is reproducible but with the following

```
<?php
$g = gmp_init(256);
var_dump(gmp_pow($g, PHP_INT_MAX));
```

we get this

```
AddressSanitizer:DEADLYSIGNAL
=================================================================
==286922==ERROR: AddressSanitizer: FPE on unknown address 0x03e8000460ca (pc 0x7faf6c69de5c bp 0x400000000000004 sp 0x7ffe9843c740 T0)
    #0 0x7faf6c69de5c in __pthread_kill_implementation nptl/pthread_kill.c:44
    #1 0x7faf6c649c81 in __GI_raise ../sysdeps/posix/raise.c:26
    #2 0x7faf6db9386c in __gmp_exception (/lib/x86_64-linux-gnu/libgmp.so.10+0xd86c) (BuildId: 1af68a49fe041a5bb48a2915c3d47541f713bb38)
    #3 0x7faf6db938d3 in __gmp_overflow_in_mpz (/lib/x86_64-linux-gnu/libgmp.so.10+0xd8d3) (BuildId: 1af68a49fe041a5bb48a2915c3d47541f713bb38)
    #4 0x7faf6dbac95c in __gmpz_realloc (/lib/x86_64-linux-gnu/libgmp.so.10+0x2695c) (BuildId: 1af68a49fe041a5bb48a2915c3d47541f713bb38)
    #5 0x7faf6dba9038 in __gmpz_n_pow_ui (/lib/x86_64-linux-gnu/libgmp.so.10+0x23038) (BuildId: 1af68a49fe041a5bb48a2915c3d47541f713bb38)
    #6 0x5565ae1ccd9f in zif_gmp_pow /home/dcarlier/Contribs/php-src/ext/gmp/gmp.c:1286
    #7 0x5565aee96ea9 in ZEND_DO_ICALL_SPEC_RETVAL_USED_HANDLER /home/dcarlier/Contribs/php-src/Zend/zend_vm_execute.h:1312
    #8 0x5565af144320 in execute_ex /home/dcarlier/Contribs/php-src/Zend/zend_vm_execute.h:56075
    #9 0x5565af160f07 in zend_execute /home/dcarlier/Contribs/php-src/Zend/zend_vm_execute.h:60439
    #10 0x5565aed6fafe in zend_execute_scripts /home/dcarlier/Contribs/php-src/Zend/zend.c:1842
    php#11 0x5565aeae70a8 in php_execute_script /home/dcarlier/Contribs/php-src/main/main.c:2578
    php#12 0x5565af532f4e in do_cli /home/dcarlier/Contribs/php-src/sapi/cli/php_cli.c:964
    php#13 0x5565af535877 in main /home/dcarlier/Contribs/php-src/sapi/cli/php_cli.c:1334
    php#14 0x7faf6c633d67 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
    php#15 0x7faf6c633e24 in __libc_start_main_impl ../csu/libc-start.c:360
    php#16 0x5565adc04040 in _start (/home/dcarlier/Contribs/php-src/sapi/cli/php+0x2604040) (BuildId: 949049955bdf8b7197390b1978a1dfc3ef6fdf38)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: FPE nptl/pthread_kill.c:44 in __pthread_kill_implementation
==286922==ABORTING
```

close phpGH-16384
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment