Skip to content

List append benchmark #857

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
certik opened this issue Jul 31, 2022 · 15 comments
Open

List append benchmark #857

certik opened this issue Jul 31, 2022 · 15 comments

Comments

@certik
Copy link
Contributor

certik commented Jul 31, 2022

Here is a simple benchmark for appending to a list in Python:

from ltypes import i32

def test_list():
    a: list[i32] = [0, 1, 2, 3, 4]
    n: i32 = 100000000
    i: i32
    for i in range(n):
        a.append(i + 5)
    print(a[n])

test_list()

and C++:

#include <vector>
#include <iostream>

int main() {
    std::vector<int32_t> a = {0, 1, 2, 3, 4};
    int32_t n = 100000000;
    for (int32_t i = 0; i < n; i++) {
        a.push_back(i + 5);
    }
    std::cout << a[n] << std::endl;
    return 0;
}

Results on Apple M1 Max (I ran each benchmark many times, took the lowest numbers):

$ clang++ -std=c++17 -O3 a.cpp
$ time ./a.out
100000000
./a.out  0.07s user 0.09s system 94% cpu 0.170 total
$ g++ -O3 -march=native  a.cpp
$ time ./a.out
100000000
./a.out  0.06s user 0.09s system 93% cpu 0.164 total
$ lpython --fast a.py
$ time ./a.out
100000000
./a.out  0.05s user 0.03s system 93% cpu 0.089 total
$ time PYTHONPATH=../src/runtime/ltypes python a.py                 
100000000
PYTHONPATH=../src/runtime/ltypes python a.py  4.21s user 0.51s system 99% cpu 4.730 total
Compiler Time Relative
LPython 0.089 1.0
g++ 0.164 1.84
Clang++ 0.170 1.91
Python 4.730 53.15

Versions:

$ clang++ --version
Apple clang version 13.0.0 (clang-1300.0.29.30)
Target: arm64-apple-darwin21.3.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin
$ g++ --version              
g++ (Spack GCC) 11.3.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ lpython --version
LPython version: 0.3.0-350-g6d29003ea
Platform: macOS ARM
Default target: arm64-apple-darwin21.3.0
$ python --version
Python 3.10.2

Thanks @czgdp1807 for implementing lists in our LLVM backend (#835)! This is just a first implementation, but I already like the results. :)

@certik
Copy link
Contributor Author

certik commented Jul 31, 2022

I think we currently do not deallocate at the end, so one can also benchmark just the append cycle:

#include <vector>
#include <iostream>
#include <chrono>

int main() {
    std::vector<int32_t> a = {0, 1, 2, 3, 4};
    int32_t n = 100000000;
    auto t1 = std::chrono::high_resolution_clock::now();
    for (int32_t i = 0; i < n; i++) {
        a.push_back(i + 5);
    }
    auto t2 = std::chrono::high_resolution_clock::now();
    std::cout << a[n] << std::endl;
    std::cout << "Time: " << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() << std::endl;
    return 0;
}

and timing:

$ clang++ -std=c++17 -Ofast a.cpp
$ time ./a.out
100000000
Time: 153
./a.out  0.07s user 0.10s system 93% cpu 0.177 total

So I think our benchmark above is probably quite solid, LPython seems faster on my computer.

@certik
Copy link
Contributor Author

certik commented Aug 1, 2022

I also tried Ubuntu 18.04 on Intel(R) Xeon(R) Gold 6230R CPU @ 2.10GHz:

Compiler Time [s] Relative
LPython 0.3.0-350-g6d29003ea --fast 0.202 1.0
LPython 0.3.0-350-g6d29003ea 0.456 2.26
g++ 7.5.0 -O3 -march=native -funroll-loops 0.600 2.97
g++ 7.5.0 2.453 12.14
g++ 12.1.0 -O3 -march=native -funroll-loops 0.654 3.24
g++ 12.1.0 5.026 24.88
Clang++ 14.0.6 -O3 -march=native -funroll-loops 0.683 3.38
Python 3.10.2 7.443 36.85

@czgdp1807
Copy link
Collaborator

Let's try without fast mode? May be that will show if the speed up is due to our implementation or LLVM's optimisation algorithms.

@certik
Copy link
Contributor Author

certik commented Aug 1, 2022

I added results without --fast for most compilers. The main speedup seems to be from how we (you!) implemented lists.

@czgdp1807
Copy link
Collaborator

Great. Makes sense. If we are beating other compilers without fast mode then we did the right thing for lists.

@czgdp1807
Copy link
Collaborator

czgdp1807 commented Aug 1, 2022

I also tried Ubuntu 18.04.6 on Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz:

Compiler Time [s] Relative
LPython (list05) 0.318 1.67
LPython 6d29003 0.318 1.67
clang++ 6.0.0-1ubuntu2 1.755 9.23
g++ (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 1.800 9.47
LPython (list05) --fast 0.183 0.96
LPython 6d29003 --fast 0.190 1.0
clang++ 6.0.0-1ubuntu2 -O3 -march=native -funroll-loops 0.523 2.75
g++ 7.5.0 -O3 -march=native -funroll-loops 0.513 2.7
Python 3.10.5 6.737 35.45

From Apple M1 Macbook Pro macOS Monterey 12.5

Compiler Time [s] Relative
LPython (list_ret) 0.297 2.63
LPython (list05) 0.302 2.67
LPython main 0.302 2.67
LPython (list04) 0.358 3.16
clang++ 11.1.0 arm64-apple-darwin21.6.0 3.021 26.73
LPython (list_ret) -- fast 0.106 0.94
LPython (list05) --fast 0.113 1.0
LPython main --fast 0.113 1.0
clang++ 11.1.0 arm64-apple-darwin21.6.0 -O3 -march=native -funroll-loops 0.183 1.62
Python 3.10.5 5.374 47.56
codon 0.15.5 (codon build -release -exe and time ./executable) 0.531 4.7

First 3 are normal mode compilation, second 3 are optimizations enabled and the last one is Python. So, all in all top is lpython --fast, second is lpython and then the rest.

@Shaikh-Ubaid
Copy link
Collaborator

Shaikh-Ubaid commented Aug 1, 2022

Please, could someone possibly share how we are computing the Relative value? Like Relative with respect to?

@czgdp1807
Copy link
Collaborator

Dividing with the smallest time from across all the results you have computed on your machine.

@namannimmo10
Copy link
Collaborator

Benchmark on Apple M1 Air 2020 (Monterey):

Compiler Time [s] Relative
lpython --fast 0.074 1.0
lpython 0.284 3.837
g++ 2.786 37.648
clang++ 2.764 37.351
python 3.10.4 5.230 70.67

@Shaikh-Ubaid
Copy link
Collaborator

Shaikh-Ubaid commented Aug 1, 2022

From the output of time a.out, which parameter from real, user, system do we need to consider/note?

@czgdp1807
Copy link
Collaborator

real parameter.

@Thirumalai-Shaktivel
Copy link
Collaborator

Result on Intel® Core™ i5-8250U CPU @ 1.60GHz × 8 (OS: Ubuntu 22.04 LTS)

Compiler Time [s] Relative
lpython --fast 0.184 1.0
lpython 0.356 1.934
g++ -O3 -march=native 0.574 3.119
clang++ -O3 -march=native 0.578 3.141
g++ 4.335 23.559
clang++ 2.079 11.298
python 3.10.4 11.506 62.53

@Smit-create
Copy link
Collaborator

Smit-create commented Aug 1, 2022

Machine: Mac Air M1(2020)

Compiler Time [s] Relative Version
lpython --fast 0.08 1.0 0.3.0-233-g9ae282a29-dirty
g++ 0.08 1.0 11.3.0
clang++ 0.08 1.0 12.0.5
lpython 0.26 3.25 0.3.0-233-g9ae282a29-dirty
python 7.83 97.87 3.10.2

@Shaikh-Ubaid
Copy link
Collaborator

Shaikh-Ubaid commented Aug 1, 2022

Result on AMD Ryzen 5 2500U with Radeon Vega Mobile Gfx @1.600 GHz, Ubuntu 20.04.4 LTS

Compiler Time [s] Relative
lpython --fast 0.194 1.00
lpython 0.427 2.20
g++ -O3 -march=native 0.481 2.48
clang++ -O3 -march=native 0.485 2.50
g++ 2.0403 10.52
clang++ 2.13 10.98
python 3.9.7 8.975 46.26
LPython version: 0.3.0-350-g6d29003ea
Platform: Linux
Default target: x86_64-unknown-linux-gnu

g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


clang version 10.0.0-4ubuntu1 
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

Python 3.9.7

@certik
Copy link
Contributor Author

certik commented Aug 2, 2022

Numba benchmark:

from timeit import default_timer as clock
from numba import njit

@njit(nogil=True, cache=True)
def test_list():
    a = [0, 1, 2, 3, 4]
    n = 100000000
    for i in range(n):
        a.append(i + 5)
    print(a[n])

test_list()

test_list()

test_list()

t1 = clock()
test_list()
t2 = clock()

print(t2-t1)

On my computer:

$ python b.py
100000000
100000000
100000000
100000000
0.2843555830186233

So it takes 0.28s.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants