New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Vectorized Defragmentation #439

Draft

dipkakwani wants to merge 9 commits into main from vectorized-defragmentation

+203 −58

Contributor

dipkakwani commented Nov 20, 2024

The existing de-fragmentation algorithm in SlottedArray scans all the slots sequentially and defragments the data structure by moving non-deleted slots together. This PR re-writes the algorithm with a new vectorized implementation.

dipkakwani added 3 commits

November 20, 2024 16:11


          initial commit

3df8c52


          Merge branch 'main' into vectorized-defragmentation

e2c28f3


          remove old defrag function

6b8dad0

Scooletz reviewed

View reviewed changes

Contributor

Scooletz left a comment

I think the main issue is with the masking of slots while it should not be done. Provided comments.

src/Paprika/Data/SlottedArray.cs Outdated

+                               // Ignore all the even entries from the vector as the preamble data is contained within
+                               // the first 2 bytes (ushort) out of the 4 bytes slot.
+                               slotData = Avx2.BlendVariable(slotData, Vector256<ushort>.Zero, Vector256.Create((ushort)0b_1010_1010_1010_1010));

Contributor

Scooletz Nov 20, 2024

Why is this needed? What does it do? Isn't it a bug? There's BitwiseAnd underneath so it should remove what is not needed? The comment states that the slot is 4 bytes long, but after #376 this is no longer the case in a sense that the preambles and hashes are grouped into vector sizes, see: https://github.com/NethermindEth/Paprika/blob/main/docs/design.md#slottedarray-layout

Contributor Author

dipkakwani Nov 21, 2024

Ah, I earlier assumed it to be 2 bytes and recently added this mask. I think I got misled by the total size parameter within Slot:

public const int TotalSize = Size + sizeof(ushort);

Just realized that it mentions in the comment that it's total size of hash and slot together, missed that completely. Maybe I will rename the parameter to TotalSizeWithHash for better clarity? From the current name TotalSize and Size look exactly similar 😅

src/Paprika/Data/SlottedArray.cs Outdated


		var preambleData = Vector256.BitwiseAnd(slotData, preambleMask);

		if (Vector256.EqualsAny(preambleData, deletedMask))

Contributor

Scooletz Nov 20, 2024

I'd benchmark it. There was the case that xoring and asserting whether equals zero was a bit faster

src/Paprika/Data/SlottedArray.cs Outdated

+                               if (Vector256.EqualsAny(preambleData, deletedMask))
+                               {
+                                   // Some slots are deleted in this batch, process individually
+                                   for (var j = 0; j < SlotsPerVector; j++)

Contributor

Scooletz Nov 20, 2024

This loop could be replaced. What we could do is to extract most significant bits after equality and move in larger chunks like we do in TryFind. Also, it copies slot by slot with no considerations for continuous deleted chunks?

src/Paprika/Data/SlottedArray.cs Outdated

+                                   // No deleted slots in this batch, copy the whole batch
+                                   if (i != writeTo)
+                                   {
+                                       CopyBatch(i, writeTo);

Contributor

Scooletz Nov 20, 2024

Ah, so we do batch copy but only if we find the whole batch valid. If there's a single delete we'll go one by one then?

src/Paprika/Data/SlottedArray.cs Outdated

+                   }
+                   // Helper function to copy a batch of slots
+                   private void CopyBatch(int readFrom, int writeTo)

Contributor

Scooletz Nov 20, 2024

This looks like a long method. I'd benchmark it heavily.

Contributor

Scooletz commented Nov 20, 2024

One more general comment. Have you considered @dipkakwani splitting the search for deletions from the actual compaction of the array? Something that would mimic TryGetImpl to scan through the whole map and constructing a bit set of deletes and then act upon it? We do have an upper boundary on the number of entries (~1024 max for a whole page which gives 128 ulongs) so maybe search first, then move?

Contributor Author

dipkakwani commented Nov 21, 2024 •

edited

Loading

One more general comment. Have you considered @dipkakwani splitting the search for deletions from the actual compaction of the array? Something that would mimic TryGetImpl to scan through the whole map and constructing a bit set of deletes and then act upon it? We do have an upper boundary on the number of entries (~1024 max for a whole page which gives 128 ulongs) so maybe search first, then move?

@Scooletz Yes that is an interesting point, I did think in this direction - but I wasn't sure if there is any advantage since eventually the elements have to be moved during defragmentation. The process to move once the deleted mask is formed would require moving all the elements between two deleted items in one shot; the number of these elements could be smaller, equal or longer than the vector size. I am not sure if there if there is any advantage with this move operation vs directly reading and moving vectors whenever possible.

Contributor

Scooletz commented Nov 21, 2024

@Scooletz Yes that is an interesting point, I did think in this direction - but I wasn't sure if there is any advantage since eventually the elements have to be moved during defragmentation. The process to move once the deleted mask is formed would require moving all the elements between two deleted items in one shot; the number of these elements could be smaller, equal or longer than the vector size. I am not sure if there if there is any advantage with this move operation vs directly reading and moving vectors whenever possible.

Eventually you need to move elements, but if you had a set that marks entries that are alive with 1s and 0s where they are dead, you could potentially move in as big chunks as possible for both parts: moving Slots and moving actual payloads. This could greatly simplify the moving and also allow moving much bigger chunks in one go (less memcopy calls). There would some overhead in consuming this big bit map though. 🤔

dipkakwani added 6 commits

November 26, 2024 11:55


          delete mask creation followed by defrag

40437a0


          refactor to reduce complexity

3e6bafb


          minor changes

131f2a3


          Fix CopyDataInternal logic

17c95ec


          fix header high issue

f065ac0


          more fixes (few tests currently failing)

d3f6cea

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet