Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use for comprehension to speed up python code #191

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mgzuber
Copy link

@mgzuber mgzuber commented Dec 5, 2024

Hi,

I noticed your python could use with some simple improvements to speed up execution time. For loops in python are slow, but for comprehensions help out a lot. On my machine, this implementation speed things up by about 2x.

@bddicken
Copy link
Owner

bddicken commented Dec 6, 2024

Thanks!

Can you (or someone) provide a deeper explanation of this change?

@mmurrian
Copy link

mmurrian commented Dec 6, 2024

Maybe it would be helpful to see a literal C translation of the given Python algorithm? (...or maybe not at all)

The operational change is this: a = [sum([j % u for j in range(100_000)]) + r for _ in range(10_000)]

And breaking it down from inner-most scope outwards...

[j % u for j in range(100_000)] has this literal C equivalent:

int jmodu[100000];
for (int j = 0; j < 100000; j++) {
  jmodu[j] = j % u;
}

And sum(...) is this:

int sum = 0;
for (int j = 0; j < 100000; j++) {
  sum += jmodu[j];
}

Finally, a = [sum(...) + r for _ in range(10_000)] is this:

int a[10000];
for (int i = 0; i < 10000; i++) {
  a[i] = sum + r;
}

I will say that it appears Python recomputes sum 10,000 times, instead of only once. So, there is at least no unfair advantage hiding there.

Altogether, the literal C translation might look like this:

int a[10000] = {0};
for (int i = 0; i < 10000; i++) {
  int jmodu[100000];
  for (int j = 0; j < 100000; j++) {
    jmodu[j] = j % u;
  }
  for (int j = 0; j < 100000; j++) {
    a[i] += jmodu[j];
  }
  a[i] += r;
}

@poudro
Copy link

poudro commented Dec 7, 2024

Python optimises list comprehensions compared to for loops, here is a video that describes this https://www.youtube.com/watch?v=U88M8YbAzQk

Using the dis (disassemble) package we can compare the bytecode between

Your code:

    a = [0] * 10000  # Array of 10k elements initialized to 0
    for i in range(10000):  # 10k outer loop iterations
        for j in range(100000):  # 100k inner loop iterations, per outer loop iteration
            a[i] += j%u  # Simple sum
        a[i] += r  # Add a random value to each element in array

which is

        >>  136 FOR_        >>  136 FOR_ITER                47 (to 234)
            140 STORE_FAST               3 (i)

 10         142 LOAD_GLOBAL             11 (NULL + range)
            152 LOAD_CONST               4 (100000)
            154 CALL                     1
            162 GET_ITER
        >>  164 FOR_ITER                18 (to 204)
            168 STORE_FAST               4 (j)

 11         170 LOAD_FAST                2 (a)
            172 LOAD_FAST                3 (i)
            174 COPY                     2
            176 COPY                     2
            178 BINARY_SUBSCR
            182 LOAD_FAST                4 (j)
            184 LOAD_FAST                0 (u)
            186 BINARY_OP                6 (%)
            190 BINARY_OP               13 (+=)
            194 SWAP                     3
            196 SWAP                     2
            198 STORE_SUBSCR
            202 JUMP_BACKWARD           20 (to 164)

 10     >>  204 END_FOR

 12         206 LOAD_FAST                2 (a)
            208 LOAD_FAST                3 (i)
            210 COPY                     2
            212 COPY                     2
            214 BINARY_SUBSCR
            218 LOAD_FAST                1 (r)
            220 BINARY_OP               13 (+=)
            224 SWAP                     3
            226 SWAP                     2
            228 STORE_SUBSCR
            232 JUMP_BACKWARD           49 (to 136)

  9     >>  234 END_FOR

and the list comprehension that has the same output

a = [sum(j%u for j in range(100000)) + r for i in range(10000)]

which is

        >>  134 FOR_ITER                34 (to 206)
            138 STORE_FAST               1 (i)
            140 LOAD_GLOBAL             13 (NULL + sum)
            150 LOAD_CLOSURE             3 (u)
            152 BUILD_TUPLE              1
            154 LOAD_CONST               4 (<code object <genexpr> at 0x1021ccc60, file "languages/loops/py/code.py", line 8>)
            156 MAKE_FUNCTION            8 (closure)
            158 LOAD_GLOBAL             11 (NULL + range)
            168 LOAD_CONST               5 (100000)
            170 CALL                     1
            178 GET_ITER
            180 CALL                     0
            188 CALL                     1
            196 LOAD_FAST                0 (r)
            198 BINARY_OP                0 (+)
            202 LIST_APPEND              2
            204 JUMP_BACKWARD           36 (to 134)
        >>  206 END_FOR

with the call to inner loop

Disassembly of <code object <genexpr> at 0x1021ccc60, file "languages/loops/py/code.py", line 8>:
              0 COPY_FREE_VARS           1

  8           2 RETURN_GENERATOR
              4 POP_TOP
              6 RESUME                   0
              8 LOAD_FAST                0 (.0)
        >>   10 FOR_ITER                 9 (to 32)
             14 STORE_FAST               1 (j)
             16 LOAD_FAST                1 (j)
             18 LOAD_DEREF               2 (u)
             20 BINARY_OP                6 (%)
             24 YIELD_VALUE              1
             26 RESUME                   1
             28 POP_TOP
             30 JUMP_BACKWARD           11 (to 10)
        >>   32 END_FOR
             34 RETURN_CONST             0 (None)
        >>   36 CALL_INTRINSIC_1         3 (INTRINSIC_STOPITERATION_ERROR)
             38 RERAISE                  1

The extra optimisation makes the comprehension about twice as fast. Most python practitionners know this optimisation pretty well so will prefer to write comprehensions whenever possible.

Obviously it's still a long way away from C/Rust 😅

@mgzuber
Copy link
Author

mgzuber commented Dec 7, 2024

Thank you @mmurrian and @poudro for those wonderful explanations! I'll add my bit.

A python list comprehension of the form:

my_list = [item for item in other_list]

has the same effect as

my_list = []
for item in other_list:
    my_list.append(item)

So starting from the inner list:

[j % u for j in range(100_000)]

has the same effect as

my_list = []
for j in range(100_000):
    my_list.append(j % u)

Taking the sum of this list is therefore the same as:

val = 0
for j in range(100_000):
    val += j % u

Which is the same as the inner loop in the original code, where val = a[i]. Adding r to the sum is therefore the same as a[i] += r. This constructs each element in the array, so the outer list comprehension replaces the outer loop. We don't actually need to keep track of the i variable, so it is replaced with _ to give:

a = [sum([j % u for j in range(100_000)]) + r for _ in range(10_000)]

As explained by @poudro, most python practitioners are well versed in the fact list comprehensions are faster than python for loops, and so will use these whenever possible. In this case, it leads to less code (only 1 line!) and faster performance.

@axman6
Copy link

axman6 commented Dec 8, 2024

It's fun seeing how much this looks like the (un-needlessly obfuscated) Haskell implementation.

@artemisart
Copy link

You can use sum(...) instead of sum([...]) to avoid allocating the list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants