Nightly bigint conversation performance is different than I reported #2146

ronawho · 2023-02-15T15:03:00Z

Nightly 16-node-cs-hdr bigint conversation performance is ~8 and ~9 GiB/s, but I reported ~66 and ~113 GiB/s in #2140.

That was for a larger problem size, but building as myself with today's master I still better performance for the default problem size:

(master) $ ./benchmarks/run_benchmarks.py bigint_conversion -nl 16

Client Version: v0.0.9-2019-10-21+2774.g1b827c0b.dirty
array size = 100,000,000
number of trials =  6
>>> arkouda uint arrays from bigint array
numLocales = 16, N = 1,600,000,000
bigint_from_uint_arrays Average time = 0.1865 sec
bigint_from_uint_arrays Average rate = 127.8415 GiB/sec

>>> arkouda bigint array to uint arrays
bigint_to_uint_arrays Average time = 0.2782 sec
bigint_to_uint_arrays Average rate = 85.7139 GiB/sec

The text was updated successfully, but these errors were encountered:

ronawho · 2023-02-15T15:04:28Z

I'm using the gnu backend instead of llvm, but I'd be surprised if that had such a large impact. I'm also using minimal modules, but not clear why that would matter:

BigIntMsg
RandMsg
ReductionMsg
OperatorMsg

@bmcdonald3 do you have some time to see if you can reproduce the better numbers I'm seeing and chase down the difference between that and nightly?

bmcdonald3 · 2023-02-15T15:45:31Z

Sure, can take a look today.

I have some speculations, but I will refrain from voicing them until I have proof since I have been fighting with the CS all morning.

bmcdonald3 · 2023-02-15T16:45:18Z

Hmm, ya, I am seeing numbers closer to nightly with both LLVM and C backend, though, the C backend numbers are slightly better. This may be a stupid question, but are you doing CPU specialization? (and do oyu usually when running on CS? I've got my fingers crossed that this is where my problems are coming from)

LLVM:

array size = 100,000,000
number of trials =  6
>>> arkouda uint arrays from bigint array
numLocales = 16, N = 1,600,000,000
bigint_from_uint_arrays Average time = 2.6833 sec
bigint_from_uint_arrays Average rate = 8.8853 GiB/sec

>>> arkouda bigint array to uint arrays
bigint_to_uint_arrays Average time = 3.2483 sec
bigint_to_uint_arrays Average rate = 7.3398 GiB/sec

C Backend:

array size = 100,000,000
number of trials =  6
>>> arkouda uint arrays from bigint array
numLocales = 16, N = 1,600,000,000
bigint_from_uint_arrays Average time = 2.1787 sec
bigint_from_uint_arrays Average rate = 10.9431 GiB/sec

>>> arkouda bigint array to uint arrays
bigint_to_uint_arrays Average time = 2.6266 sec
bigint_to_uint_arrays Average rate = 9.0771 GiB/sec

printchplenv output

LLVM env:

$ printchplenv --anonymize
CHPL_TARGET_PLATFORM: cray-cs
CHPL_TARGET_COMPILER: llvm
CHPL_TARGET_ARCH: x86_64
CHPL_TARGET_CPU: none *
CHPL_LOCALE_MODEL: flat
CHPL_COMM: gasnet
  CHPL_COMM_SUBSTRATE: ibv
  CHPL_GASNET_SEGMENT: large
CHPL_TASKS: qthreads
CHPL_LAUNCHER: slurm-gasnetrun_ibv
CHPL_TIMERS: generic
CHPL_UNWIND: none
CHPL_MEM: jemalloc
CHPL_ATOMICS: cstdlib
  CHPL_NETWORK_ATOMICS: none
CHPL_GMP: bundled
CHPL_HWLOC: bundled
CHPL_RE2: bundled
CHPL_LLVM: system *
CHPL_AUX_FILESYS: none

C-backend env:

$ printchplenv --anonymize
CHPL_TARGET_PLATFORM: cray-cs
CHPL_TARGET_COMPILER: gnu
CHPL_TARGET_ARCH: x86_64
CHPL_TARGET_CPU: none *
CHPL_LOCALE_MODEL: flat
CHPL_COMM: gasnet
  CHPL_COMM_SUBSTRATE: ibv
  CHPL_GASNET_SEGMENT: large
CHPL_TASKS: qthreads
CHPL_LAUNCHER: slurm-gasnetrun_ibv
CHPL_TIMERS: generic
CHPL_UNWIND: none
CHPL_MEM: jemalloc
CHPL_ATOMICS: cstdlib
  CHPL_NETWORK_ATOMICS: none
CHPL_GMP: bundled
CHPL_HWLOC: bundled
CHPL_RE2: bundled
CHPL_LLVM: none *
CHPL_AUX_FILESYS: none

ronawho · 2023-02-15T18:42:09Z

No, not doing CPU specialization.

Ah.. it turns out I had a patch for parallel deinit applied (for chapel-lang/chapel#15215 / #2088 (comment)) and that was resulting in my better performance. Without that I see performance more in line with you and nightly:

array size = 100,000,000
number of trials =  6
>>> arkouda uint arrays from bigint array
numLocales = 16, N = 1,600,000,000
bigint_from_uint_arrays Average time = 2.1902 sec
bigint_from_uint_arrays Average rate = 10.8856 GiB/sec

>>> arkouda bigint array to uint arrays
bigint_to_uint_arrays Average time = 2.6303 sec
bigint_to_uint_arrays Average rate = 9.0644 GiB/sec

Given how big the performance benefit is from the parallel deinit, we'll definitely want to make sure that gets into our next release.

ronawho · 2023-02-15T18:42:39Z

Closing, we're tracking this internally.

ronawho closed this as completed Feb 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nightly bigint conversation performance is different than I reported #2146

Nightly bigint conversation performance is different than I reported #2146

ronawho commented Feb 15, 2023

ronawho commented Feb 15, 2023

bmcdonald3 commented Feb 15, 2023

bmcdonald3 commented Feb 15, 2023

ronawho commented Feb 15, 2023

ronawho commented Feb 15, 2023

Nightly bigint conversation performance is different than I reported #2146

Nightly bigint conversation performance is different than I reported #2146

Comments

ronawho commented Feb 15, 2023

ronawho commented Feb 15, 2023

bmcdonald3 commented Feb 15, 2023

bmcdonald3 commented Feb 15, 2023

ronawho commented Feb 15, 2023

ronawho commented Feb 15, 2023