Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use FNV-1a for kernel indexing instead of md5 #44

Closed
wants to merge 14 commits into from

Conversation

yan-ming
Copy link
Contributor

this will improve hcc runtime performance when multiple kernels are used
in a program

make test:

********************
Testing Time: 633.70s
********************
Failing Tests (7):
    CPPAMP :: Unit/AmpShortVectors/hc_short_vector_device.cpp
    CPPAMP :: Unit/HC/memcpy_symbol1.cpp
    CPPAMP :: Unit/HC/memcpy_symbol3.cpp
    CPPAMP :: Unit/HC/wg_size.cpp
    CPPAMP :: Unit/HSAIL/shfl_xor.cpp
    CPPAMP :: Unit/SharedLibrary/shared_library2.cpp
    CPPAMP :: Unit/SharedLibrary/shared_library3.cpp

  Expected Passes    : 661
  Expected Failures  : 25
  Unsupported Tests  : 10
  Unexpected Failures: 7

Andres-design16 and others added 9 commits April 22, 2016 11:33
Instead of hardcoding the HSA_AMDGPU_GPU_TARGET at compile time,
autodetect it at runtime from the KFD topology.

Change-Id: I00af68084869ab4d439e70cf8816c1c8868f224d
[CMake] Autodetect HSA_AMDGPU_GPU_TARGET
Use new workitem intrinsics + range metadata, correct
some attributes on functions, and canonicalize.

Correct range metadata to be maximum theoretical workgroup size.

Change-Id: I9dedbe2dd62753858ccd0eb7841e228873a2c031
this will improve hcc runtime performance when multiple kernels are used
in a program
@yan-ming
Copy link
Contributor Author

Hey @whchung,

Could you please review this PR?

From my experiments of this patch in the gpu_deconvolution work, the total execution time has been reduced from 3m10s to 1m59s, which is a big improvement.

@whchung
Copy link
Collaborator

whchung commented Apr 29, 2016

@yan-ming thanks for this contribution. Could you alter this PR so it goes to "develop" branch? We are trying to adopt a new branching strategy now.


const char *str = static_cast<const char *>(source);

// 104 is the proper size from Jack's research
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's 140, not 104

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I just quoted that directly from your mail. Let me fix it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, 104 should be the correct number.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a second thought, could you make it a macro? The rationale is because a new code object format is being developed so this value may have to be changed soon.

Copy link
Contributor Author

@yan-ming yan-ming Apr 29, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you give me the desired name of the macro? Or I just use FNV1A_CUTOFF_SIZE?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not very comfortable at merely passing the size of header to the hash algorithm. Take BrigModuleHeader as an example, originally it shall contain the hash of the BRIG module itself. But it's all filled by 0 in the current implementation. Also in LC backend we are moving toward completely ELF-compatible format so it's likely we run into same ELF headers for similar kernels.

Could you help study the impact to performance if we don't use this 104 heuristic?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference purpose, here are sources where I got this 104 magic number. It's the larger value of these 2 kinds of headers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the gpu_deconvolution work is actually using LC backend (due to hcFFT). If I use the original size in FNV-loop, the execution time is about 2m37s.

Here's the comparison of each condition:

  • original: 3m10s
  • FNV hash: 2m37s
  • FNV hash + 104 cutoff: 1m59s

Personally I would prefer having FNV hash with some fixed size cutoff for the best performance.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for providing the values. To boost performance let's make a macro to hold this magic number then. How about call it "KERNEL_CODE_OBJECT_HEADER_SIZE"? And put comments so we know how it's derived (bigger value of HSA BrigModuleHeader and Elf64_Ehdr).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yan-ming what you proposed ( FNV1A_CUTOFF_SIZE ) may be a better name after all.

@yan-ming
Copy link
Contributor Author

Hi @whchung,

Sure let me change the PR destination after I make sure everything is okay.

@yan-ming yan-ming closed this Apr 29, 2016
@yan-ming
Copy link
Contributor Author

@whchung it seems like github doesn't allow me to change PR destination branch directly, please look to PR #45, sorry for the inconvenience.

@whchung
Copy link
Collaborator

whchung commented Apr 29, 2016

@yan-ming it's alright. github UI is less favorable compared to Bitbucket in this regard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants