Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a pool wrapper of DynamicSparseNumberArray #29

Closed
wants to merge 5 commits into from

Conversation

lindsayad
Copy link
Member

@lindsayad lindsayad commented Jul 8, 2022

One of our tasks for NEAMS this year is to test pool-based AD

@lindsayad
Copy link
Member Author

Running:

lindad@pop-os:~/projects/moose/modules/tensor_mechanics/test/tests/ad_isotropic_elasticity_tensor(devel)$ ../../../tensor_mechanics-oprof -i 2D-axisymmetric_rz_test.i Mesh/uniform_refine=6

I get these profiles (click-in to see closely). The timings are

  • PoolDynamicSparseNumberArray: 22 seconds
  • DynamicSparseNumberArray: 10 seconds (it seems like this has gotten a lot better?!)
  • SemiDynamicSparseNumberArray (our currently used container): 7 seconds

PoolDynamicSparseNumberArray

testing-ad-pool

DynamicSparseNumberArray

dsna

SemiDynamicSparseNumberArray

classic

@lindsayad
Copy link
Member Author

It looks like smart pointer counting is the primary expense. Using this becomes trickier if we are running with multithreading. Just doing a thread_local pool global isn't safe enough

@lindsayad
Copy link
Member Author

Hmm actually running the same test with the same refinement as above with --n-threads=4 I get segmentation faults with an openmp thread model with thread_local pool, but a clean run with a pthread thread model. For the latter I run with -fsanitize=thread and no errors. So maybe openmp doesn't work with thread_local?

@lindsayad
Copy link
Member Author

Anyway maybe @permcody @friedmud @roystgnr and I should meet and talk about whether we want to try and drill down on this PR more. I think I'd like to merge something, but what that is is up for discussion. (And then what the end result is will determine whether we push this downstream at all. As it stands now, we would not push this downstream)

@lindsayad
Copy link
Member Author

@roystgnr what are your thoughts on this PR?

@roystgnr
Copy link
Member

For merging? IMHO adding this would be something in between a red herring and an outright trap for users; if we're 120% slower than non-pool DSNA then there's nothing to justify the extra complication.

For improvement? I'd try, instead of having a PDSNA that pulls from a pool-of-DSNA and accesses the memory via pointers, have a PDSNA that pulls from a pool-of-vector and a pool-of-vector and accesses the memory by std::move'ing it directly into (and out from, when returning it to the pool) its own _data and _indices. I'm not confident that this would be faster than pure DSNA either, but it would at least get rid of the extra indirection layer that the PDSNA here requires.

@lindsayad lindsayad marked this pull request as draft July 19, 2022 16:45
@lindsayad
Copy link
Member Author

Looks like slower in the naive implementation I just tried (13 seconds)

dsna-with-stack

@lindsayad
Copy link
Member Author

With a vector backing the stack

dsna-with-vector-stack

@lindsayad
Copy link
Member Author

Yea I think resize is a small enough contributor to the overall simulation time at this point, that spending more development time trying to abate its effect seems like time poorly spent. Probably better time would be spent figuring out if there are any possible algorithmic improvements in sparsity_union. I'm leaning towards closing this

@roystgnr
Copy link
Member

11.88-12.02% in resize() and dependencies isn't trivial. We could probably cut that in half if we put our data and indices into a single vector<pair> so every two resizes becomes one. Not sure if that would be worth the pain, though.

Or ... perhaps when we first find ourselves having to resize() above zero, we could first reserve() more than the std::vector default, to reduce the odds of having to hit the heap again later? If we were to bump the initial reserved size up high enough for most MOOSE users to never go beyond it, our speed would probably be within spitting distance of non-heap-allocated SDSNA, but then the exceptional MOOSE apps would just run a little slower when they went beyond the default max, rather than seeing an error and having to reconfigure and recompile.

The 15.48-15.84% in sparsity_union does look serious. I'd love to see line-by-line profiling of that. I wish there were some way to at least make the if(!unseen_indices) return; path faster, but I don't see one ... and us spending non-trivial time in resize() probably indicates that the early-return case isn't ubiquitous enough to focus on anyway.

I'm leaning towards closing this

Sad, but "we tried X and it didn't work" is a useful test outcome too.

@lindsayad
Copy link
Member Author

So I did some work on reserving up front for what the maximum sparsity pattern size might be in roystgnr#1 (comment) and at that time, it was slower than vanilla DynamicSparseNumberArray by a significant margin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants