-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixing up of SlurpIn #168
Comments
Here is at least an outline of the specific problem, as well as the proposed fix. The goal of SlurpIn is to vgather data from various tracks or hits into a Matriplex. The vgather intrinsic function takes a single base address and a vector of offsets to read (say) 16 floats from 16 different addresses. You can then advance the base pointer and vgather the next 16 floats, then the next 16, etc., storing them sequentially. The resulting data array will then be ordered correctly for a Matriplex. However, these direct vgathers only work for byte offsets <2GB. (In the current code, it's a problem even if intrinsics are not used, because the byte offsets are stored as a vector of ints.) Proposed fix: rather than directly vgathering data from objects stored at arbitrary memory addresses, do this in two stages. First, copy strips of memory from the desired objects sequentially (such copy operations are vectorizable), and store them into a single flat array big enough to hold (say) 16 such strips. Second, do vgathers as before, but now from the flat array; this flat array will always be limited in size, so there is no risk of having excessively large offsets from the base address. This method was in fact implemented in mtorture a long time ago (in a more limited way), and the favorable performance results from it were presented at CHEP2015. The very simple CopyInContig/Plexify code still exists in the plexify branch of mtorture. For mkFit, it is likely that some "PackerHelper" class needs to be developed to generalize the process of assembling the strips into the flat array, so that Matriplex itself does not need to worry about whether the strips of data come from Track objects or Hit objects or whatever. Some sort of Plexify method still exists in Matriplex(Sym).h, and this may be able to perform the final step of turning the flat array into a Matriplex (using intrinsics or not, depending on ifdef's). |
I started a branch with this: And the only commit: Please take a look, esp. the Cornellians ... then I have to redo this for Hits ... and Steve needs to provide code for the XyzzPackerPlexify versions. I plan to template those to Matriplex::value_type ... and use that as the type of m_base and sizeof(T) as scale for vgather. But I need to change this in Matriplex, too, and have AVX2 changes pending there. |
I have updated the packer branch with a more general implementation of the SlurpIn packer. Note that I only use this in MkFinder::BkFitInputTracks() ... as a demo how it would look. If I start changing this everywhere I'll get conflicts when merging with the AVX2 branch. The best way to view the whole thing is now: |
OK, new version, with changes to all usages of SlurpIn in MkFinder/Fitter. I had to move to a new branch packer-1 as I did rebase onto devel after AVX2 merge. Compare with devel: The branch: |
Hi @osschar This mostly looks pretty good, I think. Two specific comments and one more general one. In MkFitter.cc line 209,
On the general design, the one thing I don't really like is having
and usage (using the aforementioned line 209) would look like
|
OK, I pushed rebased branch to cerati/packer Note that I also somewhat fixed the backward fit by checking the offset between track vectors and splicing them up. I say somewhat as I would expect the ptr-diff of 0x7fffffff or at least 0x1fffffff would work ... instead things start to work when diff is limited to 0x1fffff. Go figure. With this change I can now run, with root validation, over all 5000 event in the hih statistics sample:
|
@osschar I've been staring at your new classes in MatriplexPackers.h for quite a while and I think I now have a pretty clear idea of what I want to do in order to implement my alternate packing method. But I have a question about it as well... I'll start by sketching what you've done so far, and then tell you what I'm thinking of adding...
That's a long-winded intro, but anyway, here's my question for you. Where and how do I define the temp array? The problem is, its size will depend partly on the size of T, which can change. But I don't want to keep creating and destroying such an array every time I go to Pack a Matriplex of a given size. I think it would be better to put some kind of singleton in MatriplexPackers.h (I guess as part of my new base class template), making it large enough to hold the fArray member of any Matriplex I am going to Pack at any point in the code. Would this be a good approach? Or do you think it's too hacky? |
Yes, you got it right :) In constructor we pass in T& as the "base" argument. You can use this to extract the size of the target matriplexes (something like base.errMatrix().GetNElements()) ...that's for ErrPar. We can also add a second parameter for the "payload size" to the constructor that gets ignored in the SlurpIn implementation (as this information is then implicit in Matriplexes passed into Pack). |
OK, I'll go ahead and code it up and we should be able to converge from there :-) |
As reported in various threads over the last weeks, we have seen crashes within the simtrack validation during standard building. The crash would occur with many threads and MEIF, falling down at some "fixed" number of events.
The cause has been tracked down to an abuse of SlurpIn within the BackwardFit leading to undefined behavior. Namely, when computing offsets to read in the tracks to fit (which come from vector of vectors), there is the possibility the offsets become too large. Computing offsets from different allocations brings SlurpIn crashing down.
We have been burned by this before, so this is a call to fix this up for good. So we have a short term and long term plan.
Short term
Since we know this crash is triggered by large amounts of memory in play, we can simply avoid triggering the crash by limiting the number of events processed. Given we have a few open PRs (and more to come), the recommendation for now is to simply change the number of events to process in the validation from 500 to 100.
Long Term
@osschar has volunteered to give this a try by designing some interfaces for some helper packer for the Matriplexes to avoid total rewrites of I/O tracks/hits to matriplexes and related routines.
@srlantz and @dan131riley have also agreed to think on this.
The text was updated successfully, but these errors were encountered: