Really poor performance with streaming parser #79

bryanmatteson · 2020-07-29T18:32:01Z

Parsing an HL7 file with really long lines results in unusable performance. In most of the ORU_R01 messages I deal with, we have some OBX segments with embedded PDF files (base64 encoded). This results in lines in the PDF being millions of bytes long. So the problem is that in StreamTextCursor.ParseText it does a full parse after every time it loads a new chunk because it reaches the end of the data before it reaches the end of the line.

Here is an HL7 attachment that can help illustrate the issue. It's parsed like this:

string filePath = ...;
using var stream = File.OpenRead(filePath);
StreamText text = await new StreamTextReader(stream).Text;
var parse = await parser.ParseStream(text, new TextSpan(0, text.Length));

HR7Message.txt

The text was updated successfully, but these errors were encountered:

ahives · 2020-07-29T19:53:07Z

Can you please add your query.

bryanmatteson · 2020-07-29T20:06:39Z

The issue is actually separate from any query. The problem is in parser.ParseStream. Here's a full main function that can reproduce the issue with the file I attached.

private static async Task Main(string[] args) {
    var filePath = "HR7Message.txt";

    ISchema<HL7V26Entity> schema = Schema.Factory.CreateHL7<HL7V26Entity>(cfg => cfg.AddFromNamespaceContaining<HL7Version26>());
    IEntityParser<HL7V26Entity> parser = Parser.Factory.CreateHL7(schema);

    using var stream = File.OpenRead(filePath);
    StreamText text = await new StreamTextReader(stream).Text;
    var parse = await parser.ParseStream(text, new TextSpan(0, text.Length)); // <-- problem is here
}

dominiqueplante · 2020-07-30T23:31:46Z

Hi @bryanmatteson

I created a branch with your message and added a performance benchmark

See this branch

I am running 2.2 Ghz I7 2016 MacBookPro and here are the performance benchmarks I got

// * Detailed results *
StreamingParserBenchmarks.StreamingParserBenchmark: Job-KKEHVM(Runtime=Core, InvocationCount=5, LaunchCount=1, RunStrategy=Throughput, TargetCount=5, UnrollFactor=1, WarmupCount=5)
Runtime = .NET Core 2.1.13 (CoreCLR 4.6.28008.01, CoreFX 4.6.28008.01), 64bit RyuJIT; GC = Concurrent Workstation
Mean = 284.0624 ms, StdErr = 0.6272 ms (0.22%); N = 5, StdDev = 1.4024 ms
Min = 282.4000 ms, Q1 = 282.9314 ms, Median = 284.0312 ms, Q3 = 285.2089 ms, Max = 286.2371 ms
IQR = 2.2775 ms, LowerFence = 279.5152 ms, UpperFence = 288.6251 ms
ConfidenceInterval = [278.6631 ms; 289.4617 ms] (CI 99.9%), Margin = 5.3993 ms (1.90% of Mean)
Skewness = 0.4, Kurtosis = 1.56, MValue = 2
-------------------- Histogram --------------------
[281.682 ms ; 286.955 ms) | @@@@@
---------------------------------------------------

Total time: 00:00:26 (26.77 sec)

// * Summary *

BenchmarkDotNet=v0.10.14, OS=macOS 10.14.2 (18C54) [Darwin 18.2.0]
Intel Core i7-4770HQ CPU 2.20GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.1.201
  [Host]     : .NET Core 2.1.13 (CoreCLR 4.6.28008.01, CoreFX 4.6.28008.01), 64bit RyuJIT
  Job-KKEHVM : .NET Core 2.1.13 (CoreCLR 4.6.28008.01, CoreFX 4.6.28008.01), 64bit RyuJIT

Runtime=Core  InvocationCount=5  LaunchCount=1  
RunStrategy=Throughput  TargetCount=5  UnrollFactor=1  
WarmupCount=5  

                   Method |     Mean |    Error |   StdDev |      Min |      Max |     Gen 0 |    Gen 1 |    Gen 2 | Allocated |
------------------------- |---------:|---------:|---------:|---------:|---------:|----------:|---------:|---------:|----------:|
 StreamingParserBenchmark | 284.1 ms | 5.399 ms | 1.402 ms | 282.4 ms | 286.2 ms | 1200.0000 | 800.0000 | 400.0000 |  15.46 MB |

// * Legends *
  Mean      : Arithmetic mean of all measurements
  Error     : Half of 99.9% confidence interval
  StdDev    : Standard deviation of all measurements
  Min       : Minimum
  Max       : Maximum
  Gen 0     : GC Generation 0 collects per 1k Operations
  Gen 1     : GC Generation 1 collects per 1k Operations
  Gen 2     : GC Generation 2 collects per 1k Operations
  Allocated : Allocated memory per single operation (managed only, inclusive, 1KB = 1024B)
  1 ms      : 1 Millisecond (0.001 sec)

// * Diagnostic Output - MemoryDiagnoser *


// ***** BenchmarkRunner: End *****
// * Artifacts cleanup *

bryanmatteson · 2020-07-31T01:25:05Z

Right, this is the workaround. In your benchmark, you load the entire stream into memory first, and then use a StringReader and finally a TextReaderStreamTextReader. This will work because it never needs to load another chunk from the stream as it has already been loaded completely into memory by your call to File.ReadAllText(_largeFilePath).

If you change the benchmark to use the StreamTextReader directly, like this:

[Benchmark]
public async Task StreamingParserBenchmarkLargeFileDemo() {
    Console.WriteLine("Starting streaming parser benchmark run");

    using (var stream = File.OpenRead(_largeFilePath)) {
        //Console.WriteLine("About to parse stream");
        StreamText text = await new StreamTextReader(stream).Text;
        ParseResult<HL7Entity> result = await _hl7Parser.ParseStream(text, new TextSpan(0, text.Length));
        ...

then I think you'll run into the bug I'm seeing. So while not using the StreamTextReader is a potential solution, there still exists a bug in the stream parsing logic for really long lines.

dominiqueplante pushed a commit to dominiqueplante/Machete that referenced this issue Jul 30, 2020

phatboyg#79 - Add performance test for a large file

31adb58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Really poor performance with streaming parser #79

Really poor performance with streaming parser #79

bryanmatteson commented Jul 29, 2020 •

edited

Loading

ahives commented Jul 29, 2020

bryanmatteson commented Jul 29, 2020 •

edited

Loading

dominiqueplante commented Jul 30, 2020 •

edited

Loading

bryanmatteson commented Jul 31, 2020

Really poor performance with streaming parser #79

Really poor performance with streaming parser #79

Comments

bryanmatteson commented Jul 29, 2020 • edited Loading

ahives commented Jul 29, 2020

bryanmatteson commented Jul 29, 2020 • edited Loading

dominiqueplante commented Jul 30, 2020 • edited Loading

bryanmatteson commented Jul 31, 2020

bryanmatteson commented Jul 29, 2020 •

edited

Loading

bryanmatteson commented Jul 29, 2020 •

edited

Loading

dominiqueplante commented Jul 30, 2020 •

edited

Loading