Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance] Improve performance for date field parsing #8361

Closed
mgodwan opened this issue Jun 29, 2023 · 8 comments · Fixed by #11465 · May be fixed by #11675
Closed

[Performance] Improve performance for date field parsing #8361

mgodwan opened this issue Jun 29, 2023 · 8 comments · Fixed by #11465 · May be fixed by #11675
Assignees
Labels
enhancement Enhancement or improvement to existing feature or request Indexing Indexing, Bulk Indexing and anything related to indexing v2.12.0 Issues and PRs related to version 2.12.0 v3.0.0 Issues and PRs related to version 3.0.0

Comments

@mgodwan
Copy link
Member

mgodwan commented Jun 29, 2023

Is your feature request related to a problem? Please describe.
OpenSearch relies on java.time.format.DateTimeFormatter for parsing various date time formats. This provides flexibility to support a plethora of use cases with multiple date-time formats.
For most of the use cases, users generally have common date-time formats for which JDK formatters can be slow. We can utilize the knowledge of underlying format and provide better latencies/throughput for document/query parsing with code customized for the underlying format. A lot of logging libraries (e.g. log4j) provide common formats for datetime which we can see to support to start with.

Describe the solution you'd like
Faster parsing alternatives for known formats, which can yield better times for overall indexing.

Describe alternatives you've considered

A barebones POC code for format yyyy-MM-dd HH:mm:ss: mgodwan@4345a75

The JMH micro-benchmarks run with this format using JDK Pattern formatter vs custom formatter show following results:

Benchmark Mode Cnt Score Error Units
DocumentParsingBenchmark.baseline avgt 9 381.958 +/-11.090 ns/op
DocumentParsingBenchmark.candidate avgt 9 21.262 +/-2.543 ns/op
@mgodwan mgodwan added enhancement Enhancement or improvement to existing feature or request untriaged labels Jun 29, 2023
@mgodwan
Copy link
Member Author

mgodwan commented Jun 29, 2023

I'm working on running the same with nyc_taxis to see how these microbenchmark results translate to actual workloads.

@Rishikesh1159 Rishikesh1159 added the Indexing Indexing, Bulk Indexing and anything related to indexing label Jul 5, 2023
@CaptainDredge
Copy link
Contributor

Working on creating a new valid datetime format for opensearch datetime field mapping which will internally use fast datetime parser implementation

@mgodwan
Copy link
Member Author

mgodwan commented Nov 7, 2023

@CaptainDredge Could you share the performance measurements on this?

@CaptainDredge
Copy link
Contributor

Updating results for nyc_taxis and so dataset for the new fast parser implementation

Server configuration:

Server: 1 node r6g.2xlarge, 32G Heap, 64G RAM with 3 master node r6g.2xlarge, 32G Heap
Client: c6g.8xlarge, 16 client instances

Command:

opensearch-benchmark execute_test --pipeline=benchmark-only --target-hosts <HOST> --workload nyc_taxis --offline --results-format csv --results-file ~/nyc_default_format_2xlarge_1 --client-options timeout:60,use_ssl:false,verify_certs:false

NYC_TAXIS
Throughput improvement: 3-4%
Average Latency Improvement: 2-3%

nyc_dataset_benchmark

CPU Profile for default date format:
nyc_cpu_profile_Default

CPU profile for new fastparser implementation
nyc_cpu_profile_new_format

Macro CPU trend (2% improvement):
left: baseline(default), right: candidate
cpu_comparison_benchmark_nyc

JVM trend
jvm comparison benchmark_nyc

SO Dataset
Throughput improvement: 9-10%
Average Latency Improvement: 8%

so_dataset_benchmarks

CPU Profile:
Screenshot 2023-11-10 at 3 50 41 PM

Macro CPU trend
cpu_comparison_benchmark_so

@CaptainDredge
Copy link
Contributor

Documenting possible approaches for integrating new fast parser implementation with existing code abstractions

Overview:
First let's understand the functionality of few existing classes involved in datetime field parsing:

DateFieldMapper: This class is used to build field mapping for date type field. It uses forPattern function of DateFormatter interface to create a date formatter which is used to parse the date field

DateFormatter : This is an interface used for all types of DateFormatters although in its current state its tightly linked with its only implementation JavaDateFormatter

JavaDateFormatter: its a concrete implementation of DateFormatter responsible of intitailisation of fomatters and printer and abstract all the logic for combining individual DateTimeFormatter formatters along with the logic of order of parsers application and handle zone, locale etc.

DateTimeFormatter: This is an internal java.time package class which can't be extended and provides the functionality for parsing datetime strings along with handling all the cases for chronology, zone, locale etc.

DateFormatters: This class defines all the valid parsers and Formatters in opensearch. It uses mainly DateTimeFormatterBuilder class to intialize formatters depending upon which date format it supports

Problem:
DateFormatter interface is coupled with JavaDateFormatter class which in turn is coupled with DateTimeFormatter for the parsers.

Possible design approaches:

  1. If we want to add our own implementation of parsing logic then either we need to create our own FastParseFormatter class by extending JavaDateFormatter and override the combine and parsing logic in a complex way to support both existing JavaDateTimeFormatters as well as new FastDateFormatters which contains the new fast parsing logic. Also, it needs to adhere to user provided order of parser application on the date field. This will not be a clean implementation as well as require some good logic to maintain ordering
  2. Create a separate FastDateFormatter adhering to DateFormatter interface which will support only the new fast date format and cannot be combined with existing java datetime formats. Thus, this is the least preffered approach.
  3. Create a composition abstraction say CustomDateTimeFormatter over DateTimeFormatter class and override relevant functions like parse for each concrete implementation. Initialise the new fast parsers in DateFormatters class in a way similar to existing parsers. This will allow us to easily fallback to DateTimeFormatter functionality which our custom implementation cannot support. It will also ensure changes as close to low level parsing logic as possible. Also, it ensures we won't need to change higher level abstractions like JavaDateFormatter much except depending upon CustomDateTimeFormatter instead of DateTimeFormatter class

cc: @mgodwan

@backslasht
Copy link
Contributor

Thanks @CaptainDredge for the performance numbers, they look promising. I see nyc_taxis is having 3-4% improvement where as so have 9-10% improvement in throughput. Is the improvement based on the number of timestamp fields in the document?

On the approaches, I like option 3 as it provides easy fall back option without much changes.

@CaptainDredge
Copy link
Contributor

CaptainDredge commented Dec 5, 2023

Microbenchmark comparison with other implementations mentioned in #11177 for rfc3339 parser implementation in PR #11465

Benchmark                                                             (dateString)  Mode  Cnt     Score    Error  Units
DocumentParsingBenchmark.benchITUParser                   2023-01-01T23:38:34.000Z  avgt   30    52.980 ±  0.281  ns/op
DocumentParsingBenchmark.benchITUParser                   1970-01-01T00:16:12.675Z  avgt   30    52.476 ±  0.118  ns/op
DocumentParsingBenchmark.benchITUParser                   5050-01-01T12:02:01.123Z  avgt   30    52.559 ±  0.195  ns/op
DocumentParsingBenchmark.benchRFC3339Parser               2023-01-01T23:38:34.000Z  avgt   30    66.321 ±  0.424  ns/op
DocumentParsingBenchmark.benchRFC3339Parser               1970-01-01T00:16:12.675Z  avgt   30    68.199 ±  1.341  ns/op
DocumentParsingBenchmark.benchRFC3339Parser               5050-01-01T12:02:01.123Z  avgt   30    66.213 ±  0.311  ns/op
DocumentParsingBenchmark.charDateFormatter                2023-01-01T23:38:34.000Z  avgt   30    57.363 ±  0.173  ns/op
DocumentParsingBenchmark.charDateFormatter                1970-01-01T00:16:12.675Z  avgt   30    57.319 ±  0.152  ns/op
DocumentParsingBenchmark.charDateFormatter                5050-01-01T12:02:01.123Z  avgt   30    57.316 ±  0.091  ns/op
DocumentParsingBenchmark.isoOffsetDateFormatter           2023-01-01T23:38:34.000Z  avgt   30    54.976 ±  0.179  ns/op
DocumentParsingBenchmark.isoOffsetDateFormatter           1970-01-01T00:16:12.675Z  avgt   30    55.540 ±  0.582  ns/op
DocumentParsingBenchmark.isoOffsetDateFormatter           5050-01-01T12:02:01.123Z  avgt   30    59.384 ±  4.382  ns/op
DocumentParsingBenchmark.strictDateOptionalTimeFormatter  2023-01-01T23:38:34.000Z  avgt   30  2732.462 ± 10.043  ns/op
DocumentParsingBenchmark.strictDateOptionalTimeFormatter  1970-01-01T00:16:12.675Z  avgt   30  2794.871 ±  8.151  ns/op
DocumentParsingBenchmark.strictDateOptionalTimeFormatter  5050-01-01T12:02:01.123Z  avgt   30  2779.538 ± 13.909  ns/op

Legend:
benchITUParser -> ITU library for datetime parsing
benchRFC3339Parser -> Implementation in #11465
charDateFormatter -> Implementation in #11177 Repo
isoOffsetDateFormatter -> FastDateFormatter POCed at the start of this issue
strictDateOptionalTimeFormatter -> OS strict_date_optional_time formatter

Overall a small overhead in #11465 implementation is due to usage of java.text.ParsePosition abstraction instead of raw numerical offsets which was required to conform with doParse function of JavaDateTimeFormatter

cc: @mgodwan , @backslasht

@reta reta added v3.0.0 Issues and PRs related to version 3.0.0 v2.12.0 Issues and PRs related to version 2.12.0 labels Jan 11, 2024
@kiranprakash154
Copy link
Contributor

Hi, are we on track for this to be released in 2.12 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Indexing Indexing, Bulk Indexing and anything related to indexing v2.12.0 Issues and PRs related to version 2.12.0 v3.0.0 Issues and PRs related to version 3.0.0
Projects
None yet
8 participants
@reta @backslasht @kiranprakash154 @CaptainDredge @anasalkouz @Rishikesh1159 @mgodwan and others