Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorized Defragmentation #439

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from
Draft

Conversation

dipkakwani
Copy link
Contributor

The existing de-fragmentation algorithm in SlottedArray scans all the slots sequentially and defragments the data structure by moving non-deleted slots together. This PR re-writes the algorithm with a new vectorized implementation.

Copy link
Contributor

@Scooletz Scooletz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main issue is with the masking of slots while it should not be done. Provided comments.


// Ignore all the even entries from the vector as the preamble data is contained within
// the first 2 bytes (ushort) out of the 4 bytes slot.
slotData = Avx2.BlendVariable(slotData, Vector256<ushort>.Zero, Vector256.Create((ushort)0b_1010_1010_1010_1010));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed? What does it do? Isn't it a bug? There's BitwiseAnd underneath so it should remove what is not needed? The comment states that the slot is 4 bytes long, but after #376 this is no longer the case in a sense that the preambles and hashes are grouped into vector sizes, see: https://github.com/NethermindEth/Paprika/blob/main/docs/design.md#slottedarray-layout

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I earlier assumed it to be 2 bytes and recently added this mask. I think I got misled by the total size parameter within Slot:

public const int TotalSize = Size + sizeof(ushort);

Just realized that it mentions in the comment that it's total size of hash and slot together, missed that completely. Maybe I will rename the parameter to TotalSizeWithHash for better clarity? From the current name TotalSize and Size look exactly similar 😅


var preambleData = Vector256.BitwiseAnd(slotData, preambleMask);

if (Vector256.EqualsAny(preambleData, deletedMask))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd benchmark it. There was the case that xoring and asserting whether equals zero was a bit faster

if (Vector256.EqualsAny(preambleData, deletedMask))
{
// Some slots are deleted in this batch, process individually
for (var j = 0; j < SlotsPerVector; j++)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop could be replaced. What we could do is to extract most significant bits after equality and move in larger chunks like we do in TryFind. Also, it copies slot by slot with no considerations for continuous deleted chunks?

// No deleted slots in this batch, copy the whole batch
if (i != writeTo)
{
CopyBatch(i, writeTo);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so we do batch copy but only if we find the whole batch valid. If there's a single delete we'll go one by one then?

}

// Helper function to copy a batch of slots
private void CopyBatch(int readFrom, int writeTo)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a long method. I'd benchmark it heavily.

@Scooletz
Copy link
Contributor

One more general comment. Have you considered @dipkakwani splitting the search for deletions from the actual compaction of the array? Something that would mimic TryGetImpl to scan through the whole map and constructing a bit set of deletes and then act upon it? We do have an upper boundary on the number of entries (~1024 max for a whole page which gives 128 ulongs) so maybe search first, then move?

@dipkakwani
Copy link
Contributor Author

dipkakwani commented Nov 21, 2024

One more general comment. Have you considered @dipkakwani splitting the search for deletions from the actual compaction of the array? Something that would mimic TryGetImpl to scan through the whole map and constructing a bit set of deletes and then act upon it? We do have an upper boundary on the number of entries (~1024 max for a whole page which gives 128 ulongs) so maybe search first, then move?

@Scooletz Yes that is an interesting point, I did think in this direction - but I wasn't sure if there is any advantage since eventually the elements have to be moved during defragmentation. The process to move once the deleted mask is formed would require moving all the elements between two deleted items in one shot; the number of these elements could be smaller, equal or longer than the vector size. I am not sure if there if there is any advantage with this move operation vs directly reading and moving vectors whenever possible.

@Scooletz
Copy link
Contributor

@Scooletz Yes that is an interesting point, I did think in this direction - but I wasn't sure if there is any advantage since eventually the elements have to be moved during defragmentation. The process to move once the deleted mask is formed would require moving all the elements between two deleted items in one shot; the number of these elements could be smaller, equal or longer than the vector size. I am not sure if there if there is any advantage with this move operation vs directly reading and moving vectors whenever possible.

Eventually you need to move elements, but if you had a set that marks entries that are alive with 1s and 0s where they are dead, you could potentially move in as big chunks as possible for both parts: moving Slots and moving actual payloads. This could greatly simplify the moving and also allow moving much bigger chunks in one go (less memcopy calls). There would some overhead in consuming this big bit map though. 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants