Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing up of SlurpIn #168

Open
kmcdermo opened this issue Oct 3, 2018 · 9 comments
Open

Fixing up of SlurpIn #168

kmcdermo opened this issue Oct 3, 2018 · 9 comments
Assignees

Comments

@kmcdermo
Copy link
Collaborator

kmcdermo commented Oct 3, 2018

As reported in various threads over the last weeks, we have seen crashes within the simtrack validation during standard building. The crash would occur with many threads and MEIF, falling down at some "fixed" number of events.

The cause has been tracked down to an abuse of SlurpIn within the BackwardFit leading to undefined behavior. Namely, when computing offsets to read in the tracks to fit (which come from vector of vectors), there is the possibility the offsets become too large. Computing offsets from different allocations brings SlurpIn crashing down.

We have been burned by this before, so this is a call to fix this up for good. So we have a short term and long term plan.

Short term

Since we know this crash is triggered by large amounts of memory in play, we can simply avoid triggering the crash by limiting the number of events processed. Given we have a few open PRs (and more to come), the recommendation for now is to simply change the number of events to process in the validation from 500 to 100.

diff --git a/val_scripts/validation-cmssw-benchmarks.sh b/val_scripts/validation-cmssw-benchmarks.sh
index 595f388..42b2c97 100755
--- a/val_scripts/validation-cmssw-benchmarks.sh
+++ b/val_scripts/validation-cmssw-benchmarks.sh
@@ -17,7 +17,7 @@ source xeon_scripts/init-env.sh
 dir=/data2/slava77/samples/2017/pass-c93773a/initialStep
 subdir=PU70HS/10224.0_TTbar_13+TTbar_13TeV_TuneCUETP8M1_2017PU_GenSimFullINPUT+DigiFullPU_2017PU+RecoFullPU_2017PU+HARVESTFullPU_2017PU
 file=memoryFile.fv3.clean.writeAll.CCC1620.recT.082418-25daeda.bin
-nevents=500
+nevents=100
 
 ## Common executable setup
 maxth=64

Long Term

@osschar has volunteered to give this a try by designing some interfaces for some helper packer for the Matriplexes to avoid total rewrites of I/O tracks/hits to matriplexes and related routines.

@srlantz and @dan131riley have also agreed to think on this.

@srlantz
Copy link
Collaborator

srlantz commented Oct 3, 2018

Here is at least an outline of the specific problem, as well as the proposed fix. The goal of SlurpIn is to vgather data from various tracks or hits into a Matriplex. The vgather intrinsic function takes a single base address and a vector of offsets to read (say) 16 floats from 16 different addresses. You can then advance the base pointer and vgather the next 16 floats, then the next 16, etc., storing them sequentially. The resulting data array will then be ordered correctly for a Matriplex. However, these direct vgathers only work for byte offsets <2GB. (In the current code, it's a problem even if intrinsics are not used, because the byte offsets are stored as a vector of ints.)

Proposed fix: rather than directly vgathering data from objects stored at arbitrary memory addresses, do this in two stages. First, copy strips of memory from the desired objects sequentially (such copy operations are vectorizable), and store them into a single flat array big enough to hold (say) 16 such strips. Second, do vgathers as before, but now from the flat array; this flat array will always be limited in size, so there is no risk of having excessively large offsets from the base address. This method was in fact implemented in mtorture a long time ago (in a more limited way), and the favorable performance results from it were presented at CHEP2015. The very simple CopyInContig/Plexify code still exists in the plexify branch of mtorture.

For mkFit, it is likely that some "PackerHelper" class needs to be developed to generalize the process of assembling the strips into the flat array, so that Matriplex itself does not need to worry about whether the strips of data come from Track objects or Hit objects or whatever. Some sort of Plexify method still exists in Matriplex(Sym).h, and this may be able to perform the final step of turning the flat array into a Matriplex (using intrinsics or not, depending on ifdef's).

@osschar
Copy link
Collaborator

osschar commented Oct 8, 2018

I started a branch with this:
https://github.com/osschar/mictest/tree/packer

And the only commit:
https://github.com/osschar/mictest/commit/e4e7af3cdb5a6720ce7227f65415d3d5b31cda41

Please take a look, esp. the Cornellians ... then I have to redo this for Hits ... and Steve needs to provide code for the XyzzPackerPlexify versions.

I plan to template those to Matriplex::value_type ... and use that as the type of m_base and sizeof(T) as scale for vgather. But I need to change this in Matriplex, too, and have AVX2 changes pending there.

@osschar
Copy link
Collaborator

osschar commented Oct 10, 2018

I have updated the packer branch with a more general implementation of the SlurpIn packer.

Note that I only use this in MkFinder::BkFitInputTracks() ... as a demo how it would look. If I start changing this everywhere I'll get conflicts when merging with the AVX2 branch.

The best way to view the whole thing is now:
https://github.com/cerati/mictest/compare/devel...osschar:packer?expand=1

@osschar
Copy link
Collaborator

osschar commented Oct 11, 2018

OK, new version, with changes to all usages of SlurpIn in MkFinder/Fitter. I had to move to a new branch packer-1 as I did rebase onto devel after AVX2 merge.

Compare with devel:
https://github.com/cerati/mictest/compare/devel...osschar:packer-1?expand=1

The branch:
https://github.com/osschar/mictest/tree/packer-1

@dan131riley
Copy link
Collaborator

Hi @osschar

This mostly looks pretty good, I think. Two specific comments and one more general one.

In MkFitter.cc line 209, mhp.Reset(); should be mhp.ResetBase(), since when hi in the outer loop changes, layerHits[hi][hidx] is getting from a different array.

AddInputAt() filling with zeroes means it will copy whatever is at the base address--is that the intended behavior?

On the general design, the one thing I don't really like is having AddInput implicitly set m_base if it is zero. I'd rather see the base address set in the constructor, and get rid of ResetBase(). Then the constructors would look like

   MatriplexPackerSlurpIn(const D* d) :
      m_base (d),
      m_pos  (0)
   {}

   MatriplexErrParPackerSlurpIn(const T& t) :
      MatriplexPackerSlurpIn<D>(t.errArray()),
      m_off_param(t.posArray() - this->m_base)
   {}

and usage (using the aforementioned line 209) would look like

  for (int hi = 0; hi < Nhits; ++hi)
  {
    MatriplexHitPacker mhp(layerHits[hi][0]);

    for (int i = beg; i < end; ++i)
    {
      const int   hidx = tracks[i].getHitIdx(hi);
      const Hit  &hit  = layerHits[hi][hidx];

      HoTArr[hi](i - beg, 0, 0) = tracks[i].getHitOnTrack(hi);
      mhp.AddInput(hit);
    }

    mhp.Pack(msErr[hi], msPar[hi]);
  }

@osschar
Copy link
Collaborator

osschar commented Oct 12, 2018

OK, I pushed rebased branch to cerati/packer

Note that I also somewhat fixed the backward fit by checking the offset between track vectors and splicing them up. I say somewhat as I would expect the ptr-diff of 0x7fffffff or at least 0x1fffffff would work ... instead things start to work when diff is limited to 0x1fffff. Go figure. With this change I can now run, with root validation, over all 5000 event in the hih statistics sample:

./mkFit --cmssw-n2seeds --num-thr 64 --num-thr-ev 32 --input-file /data2/slava77/samples/2017/pass-c93773a/initialStep/PU70HS/10224.0_TTbar_13+TTbar_13TeV_TuneCUETP8M1_2017PU_GenSimFullINPUT+DigiFullPU_2017PU+RecoFullPU_2017PU+HARVESTFullPU_2017PU/memoryFile.fv3.clean.writeAll.CCC1620.recT.082418-25daeda.bin --sim-val --try-to-save-sim-info --backward-fit --backward-fit-pca --build-std

@srlantz
Copy link
Collaborator

srlantz commented Oct 16, 2018

@osschar I've been staring at your new classes in MatriplexPackers.h for quite a while and I think I now have a pretty clear idea of what I want to do in order to implement my alternate packing method. But I have a question about it as well... I'll start by sketching what you've done so far, and then tell you what I'm thinking of adding...

  • Currently your AddInput method in your MatriplexPackerSlurpIn class template just manipulates addresses. It computes an offset from the current base (note, overflow is possible), and appends that offset to the current NN-vector of offsets. When SlurpIn is called (via Pack), it loads the current NN-vector of offsets into a register and vgathers from the corresponding addresses in a loop, incrementing the base address until the Matriplex reaches the correct size.
  • The alternate method I have in mind would define a couple of new class templates, say MatriplexPackerChainInAndPlexify and MatriplexErrParPackerChainInAndPlexify, which would re-implement all the same interfaces as your classes but give them different actions. In particular, AddInput would actually go ahead and copy data from the supplied address into a temp array, advancing the pointer to the next input slot in the temp array each time it is called. At a later point, the Pack method would be handed the address of the filled temp array so it could do its vgathers just from that array (where the offsets are all guaranteed to be small).
  • I thought at first I would also need a corresponding special method in Matriplex, like the old intrinsics-based Plexify. But I suppose this really just amounts to a special case of SlurpIn--so it might not even need its own code in Matriplex[Sym].h.
  • Conveniently, at the bottom of MatriplexPackers.h you have already laid out several "using" constructs that ought to make it easy for us to flip from one packing method to the other, via suitable ifdef's.

That's a long-winded intro, but anyway, here's my question for you. Where and how do I define the temp array? The problem is, its size will depend partly on the size of T, which can change. But I don't want to keep creating and destroying such an array every time I go to Pack a Matriplex of a given size. I think it would be better to put some kind of singleton in MatriplexPackers.h (I guess as part of my new base class template), making it large enough to hold the fArray member of any Matriplex I am going to Pack at any point in the code. Would this be a good approach? Or do you think it's too hacky?

@osschar
Copy link
Collaborator

osschar commented Oct 16, 2018

Yes, you got it right :)

In constructor we pass in T& as the "base" argument. You can use this to extract the size of the target matriplexes (something like base.errMatrix().GetNElements()) ...that's for ErrPar. We can also add a second parameter for the "payload size" to the constructor that gets ignored in the SlurpIn implementation (as this information is then implicit in Matriplexes passed into Pack).

@srlantz
Copy link
Collaborator

srlantz commented Oct 16, 2018

OK, I'll go ahead and code it up and we should be able to converge from there :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants