Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Really poor performance with streaming parser #79

Open
bryanmatteson opened this issue Jul 29, 2020 · 4 comments
Open

Really poor performance with streaming parser #79

bryanmatteson opened this issue Jul 29, 2020 · 4 comments

Comments

@bryanmatteson
Copy link

bryanmatteson commented Jul 29, 2020

Parsing an HL7 file with really long lines results in unusable performance. In most of the ORU_R01 messages I deal with, we have some OBX segments with embedded PDF files (base64 encoded). This results in lines in the PDF being millions of bytes long. So the problem is that in StreamTextCursor.ParseText it does a full parse after every time it loads a new chunk because it reaches the end of the data before it reaches the end of the line.

Here is an HL7 attachment that can help illustrate the issue. It's parsed like this:

string filePath = ...;
using var stream = File.OpenRead(filePath);
StreamText text = await new StreamTextReader(stream).Text;
var parse = await parser.ParseStream(text, new TextSpan(0, text.Length));

HR7Message.txt

@ahives
Copy link
Contributor

ahives commented Jul 29, 2020

Can you please add your query.

@bryanmatteson
Copy link
Author

bryanmatteson commented Jul 29, 2020

The issue is actually separate from any query. The problem is in parser.ParseStream. Here's a full main function that can reproduce the issue with the file I attached.

private static async Task Main(string[] args) {
    var filePath = "HR7Message.txt";

    ISchema<HL7V26Entity> schema = Schema.Factory.CreateHL7<HL7V26Entity>(cfg => cfg.AddFromNamespaceContaining<HL7Version26>());
    IEntityParser<HL7V26Entity> parser = Parser.Factory.CreateHL7(schema);

    using var stream = File.OpenRead(filePath);
    StreamText text = await new StreamTextReader(stream).Text;
    var parse = await parser.ParseStream(text, new TextSpan(0, text.Length)); // <-- problem is here
}

dominiqueplante pushed a commit to dominiqueplante/Machete that referenced this issue Jul 30, 2020
@dominiqueplante
Copy link

dominiqueplante commented Jul 30, 2020

Hi @bryanmatteson

I created a branch with your message and added a performance benchmark

See this branch

I am running 2.2 Ghz I7 2016 MacBookPro and here are the performance benchmarks I got

// * Detailed results *
StreamingParserBenchmarks.StreamingParserBenchmark: Job-KKEHVM(Runtime=Core, InvocationCount=5, LaunchCount=1, RunStrategy=Throughput, TargetCount=5, UnrollFactor=1, WarmupCount=5)
Runtime = .NET Core 2.1.13 (CoreCLR 4.6.28008.01, CoreFX 4.6.28008.01), 64bit RyuJIT; GC = Concurrent Workstation
Mean = 284.0624 ms, StdErr = 0.6272 ms (0.22%); N = 5, StdDev = 1.4024 ms
Min = 282.4000 ms, Q1 = 282.9314 ms, Median = 284.0312 ms, Q3 = 285.2089 ms, Max = 286.2371 ms
IQR = 2.2775 ms, LowerFence = 279.5152 ms, UpperFence = 288.6251 ms
ConfidenceInterval = [278.6631 ms; 289.4617 ms] (CI 99.9%), Margin = 5.3993 ms (1.90% of Mean)
Skewness = 0.4, Kurtosis = 1.56, MValue = 2
-------------------- Histogram --------------------
[281.682 ms ; 286.955 ms) | @@@@@
---------------------------------------------------

Total time: 00:00:26 (26.77 sec)

// * Summary *

BenchmarkDotNet=v0.10.14, OS=macOS 10.14.2 (18C54) [Darwin 18.2.0]
Intel Core i7-4770HQ CPU 2.20GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.1.201
  [Host]     : .NET Core 2.1.13 (CoreCLR 4.6.28008.01, CoreFX 4.6.28008.01), 64bit RyuJIT
  Job-KKEHVM : .NET Core 2.1.13 (CoreCLR 4.6.28008.01, CoreFX 4.6.28008.01), 64bit RyuJIT

Runtime=Core  InvocationCount=5  LaunchCount=1  
RunStrategy=Throughput  TargetCount=5  UnrollFactor=1  
WarmupCount=5  

                   Method |     Mean |    Error |   StdDev |      Min |      Max |     Gen 0 |    Gen 1 |    Gen 2 | Allocated |
------------------------- |---------:|---------:|---------:|---------:|---------:|----------:|---------:|---------:|----------:|
 StreamingParserBenchmark | 284.1 ms | 5.399 ms | 1.402 ms | 282.4 ms | 286.2 ms | 1200.0000 | 800.0000 | 400.0000 |  15.46 MB |

// * Legends *
  Mean      : Arithmetic mean of all measurements
  Error     : Half of 99.9% confidence interval
  StdDev    : Standard deviation of all measurements
  Min       : Minimum
  Max       : Maximum
  Gen 0     : GC Generation 0 collects per 1k Operations
  Gen 1     : GC Generation 1 collects per 1k Operations
  Gen 2     : GC Generation 2 collects per 1k Operations
  Allocated : Allocated memory per single operation (managed only, inclusive, 1KB = 1024B)
  1 ms      : 1 Millisecond (0.001 sec)

// * Diagnostic Output - MemoryDiagnoser *


// ***** BenchmarkRunner: End *****
// * Artifacts cleanup *

@bryanmatteson
Copy link
Author

Right, this is the workaround. In your benchmark, you load the entire stream into memory first, and then use a StringReader and finally a TextReaderStreamTextReader. This will work because it never needs to load another chunk from the stream as it has already been loaded completely into memory by your call to File.ReadAllText(_largeFilePath).

If you change the benchmark to use the StreamTextReader directly, like this:

[Benchmark]
public async Task StreamingParserBenchmarkLargeFileDemo() {
    Console.WriteLine("Starting streaming parser benchmark run");

    using (var stream = File.OpenRead(_largeFilePath)) {
        //Console.WriteLine("About to parse stream");
        StreamText text = await new StreamTextReader(stream).Text;
        ParseResult<HL7Entity> result = await _hl7Parser.ParseStream(text, new TextSpan(0, text.Length));
        ...

then I think you'll run into the bug I'm seeing. So while not using the StreamTextReader is a potential solution, there still exists a bug in the stream parsing logic for really long lines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants