Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: ParquetSerializer can sometimes fail when populating _typeToAssembler cache in parallel #411

Closed
Mitkee opened this issue Oct 2, 2023 · 1 comment · Fixed by #420

Comments

@Mitkee
Copy link

Mitkee commented Oct 2, 2023

Library Version

4.16.3

OS

Windows

OS Architecture

64 bit

How to reproduce?

Unfortunately, no clear steps to reproduce, but seen this happening occasionally in production:


System.InvalidOperationException:
   at System.ThrowHelper.ThrowInvalidOperationException_ConcurrentOperationsNotSupported (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Collections.Generic.Dictionary`2.FindValue (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Collections.Generic.Dictionary`2.TryGetValue (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Parquet.Serialization.ParquetSerializer.GetAssembler (Parquet, Version=4.0.0.0, Culture=neutral, PublicKeyToken=d380b3dee6d01926)
   at Parquet.Serialization.ParquetSerializer+<DeserializeAsync>d__6`1.MoveNext (Parquet, Version=4.0.0.0, Culture=neutral, PublicKeyToken=d380b3dee6d01926)

Since the _typeToAssembler cache is not concurrent, triggering multiple operations populating this cache in parallel is the cause.

Failing test

Despite effort, I've not been able to reproduce this in a testcase :(
@Mitkee
Copy link
Author

Mitkee commented Oct 2, 2023

As a workaround:

public static class ParquetCachePopulator
{
    public static async Task PopulateParquetTypeCache<T>() where T: new()
    {
        var row = new List<T>() { new T() };
        using var stream = new MemoryStream();
        await ParquetSerializer.SerializeAsync(row, stream);
        stream.Position = 0;
        await ParquetSerializer.DeserializeAsync<T>(stream);
    }
}

..and then running

//Pre-load parquet types:
await ParquetCachePopulator.PopulateParquetTypeCache<StoredContract>();
await ParquetCachePopulator.PopulateParquetTypeCache<StoredPublicTrade>();
await ParquetCachePopulator.PopulateParquetTypeCache<FullOrderEventParquetModel>();
await ParquetCachePopulator.PopulateParquetTypeCache<OrderEventParquetModelForFtpExport>();
await ParquetCachePopulator.PopulateParquetTypeCache<PrivateTradeEventToOrderMap>();
await ParquetCachePopulator.PopulateParquetTypeCache<TradeEventParquetModel>();

pre-startup might very well do the trick. I'll give that a spin and will report back in a few days to see if it stays stable. Now I'm not against pre-loading known types, could even consider making that mandatory to avoid the issue? 🤔 Alternatively, fix the concurrency problem in the type cache, which will most likely come with some minor performance hit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant