Supplementary materials to the article presented at 32nd Workshop on Languages and Compilers for Parallel Computing LCPC 2019:
https://dx.doi.org/10.1007/978-3-030-72789-5_6
Contact: kiepas.patryk at gmail.com
Using the model presented in the paper, we will make several predictions about the execution of MATLAB R2018b codes. The prediction is automatic and uses our Java 1.8 implementation of the model located inside execution_model directory. You can build the project yourself (using Maven) or use the prebuilt binary execution_model-1.0.jar. Then, run the model using one of two ways:
./predict_expr_cli.sh "sqrt(A1(1:N)) + abs(A2(1:N).*A3)"
— predict the execution of a given MATLAB expression./predict_examples.sh
— run prediction of examples from Example predictions section below
The model in four steps:
- Represent MATLAB expression as the abstract-syntax tree (AST)
- Create an initial instruction tree (
blue circle
— instruction block,red square
— array copy,white square
— array reference) - Mark instruction blocks for merging
- Merge the blocks! The resulting tree is our execution prediction:
The model predicts execution for codes which use following arithmetic operations and built-in functions:
abs, acos, acosh, asin, asinh, atan, atan2, atanh, ceil, colon, cos, cosh, ctranspose, cumprod, cumsum, det, diff, eig, exp, expm1, fft, fix, fliplr, floor, gamma, ifft, imag, ldivide, log, log10, log1p, log2, max, mean, min, minus, mod, mtimes, nextpow2, norm, ones, plus, pow2, power, prod, rand, randn, rdivide, real, rem, round, sign, sin, sinh, sqrt, sum, tan, tanh, times, transpose, uminus, uplus, zeros
The table below presents the predicted execution of several MATLAB codes. First, look at the legend for the execution prediction column in the table:
[]
— an instruction block (has at least one instruction)[]{A, B, C}
— references to arrays used in the instruction block!A
,!B
,!C
— explicit copy of arrays (array slicing)✔
— combinable instruction (which can coexist with other instructions in the same block)✘
— non-combinable/single instruction (such instruction requires the whole block for itself)
id | MATLAB code | Execution prediction |
---|---|---|
mit_1 | sum(round(A)) |
[round ✔]{A} ⟹ [sum ✘] |
mit_2 | floor(A) + sqrt(fix(B .* C)) |
[floor ✔]{A} ⟹ [fix ✔, times ✔]{B, C} ⟹ [sqrt ✘] ⟹ [plus ✔] |
mit_2_reversed | sqrt(fix(B .* C)) + floor(A) |
[fix ✔, times ✔]{B, C} ⟹ [sqrt ✘] ⟹ [plus ✔, floor ✔]{A} |
mit_3 | floor(A) + sin(fix(B .* C)) |
[plus ✔, sin ✔, fix ✔, times ✔, floor ✔]{A, B, C} |
mit_3_reversed | sin(fix(B .* C)) + floor(A) |
[plus ✔, floor ✔, sin ✔, fix ✔, times ✔]{A, B, C} |
mit_4 | exp((A.^D + B.^E) ./ (C.^F)) |
[power ✘]{A, D} ⟹ [power ✘]{B, E} ⟹ [plus ✔] ⟹ [power ✘]{C, F} ⟹ [exp ✔, rdivide ✔] |
mit_5 | A(1:LEN_1D) .* atan2(B(1:LEN_1D), C) |
!A ⟹ !B ⟹ [times ✔, atan2 ✔]{C} |
mit_5_reversed | atan2(B(1:LEN_1D), C) .* A(1:LEN_1D) |
!B ⟹ [atan2 ✔]{C} ⟹ !A ⟹ [times ✔] |
mit_6 | log(A) + B + C(1:LEN_1D) |
[log ✘]{A} ⟹ [plus ✔]{B} ⟹ !C ⟹ [plus ✔] |
mit_6_reversed | log(A) + C(1:LEN_1D) + B |
[log ✘]{A} ⟹ !C ⟹ [plus ✔ x 2]{B} |
mit_7 | fix(A(1:LEN_1D)) + (B(1:LEN_1D) .* C(1:LEN_1D)) |
!A ⟹ [fix ✔] ⟹ !B ⟹ !C ⟹ [plus ✔, times ✔] |
mit_7_reversed | (B(1:LEN_1D) .* C(1:LEN_1D)) + fix(A(1:LEN_1D)) |
!B ⟹ !C ⟹ [times ✔] ⟹ !A ⟹ [plus ✔, fix ✔] |
mit_8 | A(1:LEN_1D) + (B(1:LEN_1D) + C(1:LEN_1D)) |
!A ⟹ !B ⟹ !C ⟹ [plus ✔ x 2] |
mit_8_reversed | (B(1:LEN_1D) + C(1:LEN_1D)) + A(1:LEN_1D) |
!B ⟹ !C ⟹ [plus ✔] ⟹ !A ⟹ [plus ✔] |
By reversing some of the above expressions, we affect how they are compiled by the Just-In-Time (JIT) compiler, thus, changing their execution. The model allows us to make a conscious decision about which part of the expression, and how, to transform to increase code performance.
Just-In-Time (JIT) compiler in MATLAB (and in other environments too) generates machine code of an input code in chunks. We call these chunks instruction blocks as they contain instructions ready to be transformed from a high-level MATLAB code into the low-level machine code. The remaining question is when an input code is divided into chunks (instruction blocks)?
To answer this question, we have built every possible performance event profiles (PEP) of the expression sqrt(A(1:LEN_1D)) + B .* C(1:LEN_1D)
.
Values in profiles are sampled every 100000
retired instructions (INST_RETIRED:ANY_P
).
The above plot shows three performance event profiles for this expression.
One of them is a good candidate for the detection (L2_RQSTS:ALL_CODE_RD
) because it marks only the transition points between two instruction blocks. Hence, it allows for easy detection.
Other two performance events (MEM_LOAD_RETIRED:L2_HIT
, SW_PREFETCH_ACCESS:PREFETCHW
), either show no change between blocks or they show a change in the value level, which usually indicates blocks perform vastly different computations.
Plot performance_event_profiles.pdf presents results of all performance events where some of them indicates a clear division between two instruction blocks marked as spikes aligned with the red dashed line. To re-generate the plot execute plot_result.R script.
Read more about the results here.