Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nightly bigint conversation performance is different than I reported #2146

Closed
ronawho opened this issue Feb 15, 2023 · 5 comments
Closed

Nightly bigint conversation performance is different than I reported #2146

ronawho opened this issue Feb 15, 2023 · 5 comments

Comments

@ronawho
Copy link
Contributor

ronawho commented Feb 15, 2023

Nightly 16-node-cs-hdr bigint conversation performance is ~8 and ~9 GiB/s, but I reported ~66 and ~113 GiB/s in #2140.

That was for a larger problem size, but building as myself with today's master I still better performance for the default problem size:

(master) $ ./benchmarks/run_benchmarks.py bigint_conversion -nl 16

Client Version: v0.0.9-2019-10-21+2774.g1b827c0b.dirty
array size = 100,000,000
number of trials =  6
>>> arkouda uint arrays from bigint array
numLocales = 16, N = 1,600,000,000
bigint_from_uint_arrays Average time = 0.1865 sec
bigint_from_uint_arrays Average rate = 127.8415 GiB/sec

>>> arkouda bigint array to uint arrays
bigint_to_uint_arrays Average time = 0.2782 sec
bigint_to_uint_arrays Average rate = 85.7139 GiB/sec
@ronawho
Copy link
Contributor Author

ronawho commented Feb 15, 2023

I'm using the gnu backend instead of llvm, but I'd be surprised if that had such a large impact. I'm also using minimal modules, but not clear why that would matter:

BigIntMsg
RandMsg
ReductionMsg
OperatorMsg

@bmcdonald3 do you have some time to see if you can reproduce the better numbers I'm seeing and chase down the difference between that and nightly?

@bmcdonald3
Copy link
Contributor

Sure, can take a look today.

I have some speculations, but I will refrain from voicing them until I have proof since I have been fighting with the CS all morning.

@bmcdonald3
Copy link
Contributor

Hmm, ya, I am seeing numbers closer to nightly with both LLVM and C backend, though, the C backend numbers are slightly better. This may be a stupid question, but are you doing CPU specialization? (and do oyu usually when running on CS? I've got my fingers crossed that this is where my problems are coming from)

LLVM:

array size = 100,000,000
number of trials =  6
>>> arkouda uint arrays from bigint array
numLocales = 16, N = 1,600,000,000
bigint_from_uint_arrays Average time = 2.6833 sec
bigint_from_uint_arrays Average rate = 8.8853 GiB/sec

>>> arkouda bigint array to uint arrays
bigint_to_uint_arrays Average time = 3.2483 sec
bigint_to_uint_arrays Average rate = 7.3398 GiB/sec

C Backend:

array size = 100,000,000
number of trials =  6
>>> arkouda uint arrays from bigint array
numLocales = 16, N = 1,600,000,000
bigint_from_uint_arrays Average time = 2.1787 sec
bigint_from_uint_arrays Average rate = 10.9431 GiB/sec

>>> arkouda bigint array to uint arrays
bigint_to_uint_arrays Average time = 2.6266 sec
bigint_to_uint_arrays Average rate = 9.0771 GiB/sec
printchplenv output

LLVM env:

$ printchplenv --anonymize
CHPL_TARGET_PLATFORM: cray-cs
CHPL_TARGET_COMPILER: llvm
CHPL_TARGET_ARCH: x86_64
CHPL_TARGET_CPU: none *
CHPL_LOCALE_MODEL: flat
CHPL_COMM: gasnet
  CHPL_COMM_SUBSTRATE: ibv
  CHPL_GASNET_SEGMENT: large
CHPL_TASKS: qthreads
CHPL_LAUNCHER: slurm-gasnetrun_ibv
CHPL_TIMERS: generic
CHPL_UNWIND: none
CHPL_MEM: jemalloc
CHPL_ATOMICS: cstdlib
  CHPL_NETWORK_ATOMICS: none
CHPL_GMP: bundled
CHPL_HWLOC: bundled
CHPL_RE2: bundled
CHPL_LLVM: system *
CHPL_AUX_FILESYS: none

C-backend env:

$ printchplenv --anonymize
CHPL_TARGET_PLATFORM: cray-cs
CHPL_TARGET_COMPILER: gnu
CHPL_TARGET_ARCH: x86_64
CHPL_TARGET_CPU: none *
CHPL_LOCALE_MODEL: flat
CHPL_COMM: gasnet
  CHPL_COMM_SUBSTRATE: ibv
  CHPL_GASNET_SEGMENT: large
CHPL_TASKS: qthreads
CHPL_LAUNCHER: slurm-gasnetrun_ibv
CHPL_TIMERS: generic
CHPL_UNWIND: none
CHPL_MEM: jemalloc
CHPL_ATOMICS: cstdlib
  CHPL_NETWORK_ATOMICS: none
CHPL_GMP: bundled
CHPL_HWLOC: bundled
CHPL_RE2: bundled
CHPL_LLVM: none *
CHPL_AUX_FILESYS: none

@ronawho
Copy link
Contributor Author

ronawho commented Feb 15, 2023

No, not doing CPU specialization.

Ah.. it turns out I had a patch for parallel deinit applied (for chapel-lang/chapel#15215 / #2088 (comment)) and that was resulting in my better performance. Without that I see performance more in line with you and nightly:

array size = 100,000,000
number of trials =  6
>>> arkouda uint arrays from bigint array
numLocales = 16, N = 1,600,000,000
bigint_from_uint_arrays Average time = 2.1902 sec
bigint_from_uint_arrays Average rate = 10.8856 GiB/sec

>>> arkouda bigint array to uint arrays
bigint_to_uint_arrays Average time = 2.6303 sec
bigint_to_uint_arrays Average rate = 9.0644 GiB/sec

Given how big the performance benefit is from the parallel deinit, we'll definitely want to make sure that gets into our next release.

@ronawho
Copy link
Contributor Author

ronawho commented Feb 15, 2023

Closing, we're tracking this internally.

@ronawho ronawho closed this as completed Feb 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants