[Performance] Improve performance for date field parsing #8361

mgodwan · 2023-06-29T18:32:50Z

Is your feature request related to a problem? Please describe.
OpenSearch relies on java.time.format.DateTimeFormatter for parsing various date time formats. This provides flexibility to support a plethora of use cases with multiple date-time formats.
For most of the use cases, users generally have common date-time formats for which JDK formatters can be slow. We can utilize the knowledge of underlying format and provide better latencies/throughput for document/query parsing with code customized for the underlying format. A lot of logging libraries (e.g. log4j) provide common formats for datetime which we can see to support to start with.

Describe the solution you'd like
Faster parsing alternatives for known formats, which can yield better times for overall indexing.

Describe alternatives you've considered

A barebones POC code for format yyyy-MM-dd HH:mm:ss: mgodwan@4345a75

The JMH micro-benchmarks run with this format using JDK Pattern formatter vs custom formatter show following results:

Benchmark	Mode	Cnt	Score	Error	Units
DocumentParsingBenchmark.baseline	avgt	9	381.958	+/-11.090	ns/op
DocumentParsingBenchmark.candidate	avgt	9	21.262	+/-2.543	ns/op

The text was updated successfully, but these errors were encountered:

mgodwan · 2023-06-29T18:33:27Z

I'm working on running the same with nyc_taxis to see how these microbenchmark results translate to actual workloads.

CaptainDredge · 2023-10-09T03:14:56Z

Working on creating a new valid datetime format for opensearch datetime field mapping which will internally use fast datetime parser implementation

mgodwan · 2023-11-07T09:12:51Z

@CaptainDredge Could you share the performance measurements on this?

CaptainDredge · 2023-11-10T10:29:06Z

Updating results for nyc_taxis and so dataset for the new fast parser implementation

Server configuration:

Server: 1 node r6g.2xlarge, 32G Heap, 64G RAM with 3 master node r6g.2xlarge, 32G Heap
Client: c6g.8xlarge, 16 client instances

Command:

opensearch-benchmark execute_test --pipeline=benchmark-only --target-hosts <HOST> --workload nyc_taxis --offline --results-format csv --results-file ~/nyc_default_format_2xlarge_1 --client-options timeout:60,use_ssl:false,verify_certs:false

NYC_TAXIS
Throughput improvement: 3-4%
Average Latency Improvement: 2-3%

CPU Profile for default date format:

CPU profile for new fastparser implementation

Macro CPU trend (2% improvement):
left: baseline(default), right: candidate

JVM trend

SO Dataset
Throughput improvement: 9-10%
Average Latency Improvement: 8%

CPU Profile:

Macro CPU trend

CaptainDredge · 2023-11-16T12:35:25Z

Documenting possible approaches for integrating new fast parser implementation with existing code abstractions

Overview:
First let's understand the functionality of few existing classes involved in datetime field parsing:

DateFieldMapper: This class is used to build field mapping for date type field. It uses forPattern function of DateFormatter interface to create a date formatter which is used to parse the date field

DateFormatter : This is an interface used for all types of DateFormatters although in its current state its tightly linked with its only implementation JavaDateFormatter

JavaDateFormatter: its a concrete implementation of DateFormatter responsible of intitailisation of fomatters and printer and abstract all the logic for combining individual DateTimeFormatter formatters along with the logic of order of parsers application and handle zone, locale etc.

DateTimeFormatter: This is an internal java.time package class which can't be extended and provides the functionality for parsing datetime strings along with handling all the cases for chronology, zone, locale etc.

DateFormatters: This class defines all the valid parsers and Formatters in opensearch. It uses mainly DateTimeFormatterBuilder class to intialize formatters depending upon which date format it supports

Problem:
DateFormatter interface is coupled with JavaDateFormatter class which in turn is coupled with DateTimeFormatter for the parsers.

Possible design approaches:

If we want to add our own implementation of parsing logic then either we need to create our own FastParseFormatter class by extending JavaDateFormatter and override the combine and parsing logic in a complex way to support both existing JavaDateTimeFormatters as well as new FastDateFormatters which contains the new fast parsing logic. Also, it needs to adhere to user provided order of parser application on the date field. This will not be a clean implementation as well as require some good logic to maintain ordering
Create a separate FastDateFormatter adhering to DateFormatter interface which will support only the new fast date format and cannot be combined with existing java datetime formats. Thus, this is the least preffered approach.
Create a composition abstraction say CustomDateTimeFormatter over DateTimeFormatter class and override relevant functions like parse for each concrete implementation. Initialise the new fast parsers in DateFormatters class in a way similar to existing parsers. This will allow us to easily fallback to DateTimeFormatter functionality which our custom implementation cannot support. It will also ensure changes as close to low level parsing logic as possible. Also, it ensures we won't need to change higher level abstractions like JavaDateFormatter much except depending upon CustomDateTimeFormatter instead of DateTimeFormatter class

cc: @mgodwan

backslasht · 2023-11-20T04:54:08Z

Thanks @CaptainDredge for the performance numbers, they look promising. I see nyc_taxis is having 3-4% improvement where as so have 9-10% improvement in throughput. Is the improvement based on the number of timestamp fields in the document?

On the approaches, I like option 3 as it provides easy fall back option without much changes.

CaptainDredge · 2023-12-05T04:58:50Z

Microbenchmark comparison with other implementations mentioned in #11177 for rfc3339 parser implementation in PR #11465

Benchmark                                                             (dateString)  Mode  Cnt     Score    Error  Units
DocumentParsingBenchmark.benchITUParser                   2023-01-01T23:38:34.000Z  avgt   30    52.980 ±  0.281  ns/op
DocumentParsingBenchmark.benchITUParser                   1970-01-01T00:16:12.675Z  avgt   30    52.476 ±  0.118  ns/op
DocumentParsingBenchmark.benchITUParser                   5050-01-01T12:02:01.123Z  avgt   30    52.559 ±  0.195  ns/op
DocumentParsingBenchmark.benchRFC3339Parser               2023-01-01T23:38:34.000Z  avgt   30    66.321 ±  0.424  ns/op
DocumentParsingBenchmark.benchRFC3339Parser               1970-01-01T00:16:12.675Z  avgt   30    68.199 ±  1.341  ns/op
DocumentParsingBenchmark.benchRFC3339Parser               5050-01-01T12:02:01.123Z  avgt   30    66.213 ±  0.311  ns/op
DocumentParsingBenchmark.charDateFormatter                2023-01-01T23:38:34.000Z  avgt   30    57.363 ±  0.173  ns/op
DocumentParsingBenchmark.charDateFormatter                1970-01-01T00:16:12.675Z  avgt   30    57.319 ±  0.152  ns/op
DocumentParsingBenchmark.charDateFormatter                5050-01-01T12:02:01.123Z  avgt   30    57.316 ±  0.091  ns/op
DocumentParsingBenchmark.isoOffsetDateFormatter           2023-01-01T23:38:34.000Z  avgt   30    54.976 ±  0.179  ns/op
DocumentParsingBenchmark.isoOffsetDateFormatter           1970-01-01T00:16:12.675Z  avgt   30    55.540 ±  0.582  ns/op
DocumentParsingBenchmark.isoOffsetDateFormatter           5050-01-01T12:02:01.123Z  avgt   30    59.384 ±  4.382  ns/op
DocumentParsingBenchmark.strictDateOptionalTimeFormatter  2023-01-01T23:38:34.000Z  avgt   30  2732.462 ± 10.043  ns/op
DocumentParsingBenchmark.strictDateOptionalTimeFormatter  1970-01-01T00:16:12.675Z  avgt   30  2794.871 ±  8.151  ns/op
DocumentParsingBenchmark.strictDateOptionalTimeFormatter  5050-01-01T12:02:01.123Z  avgt   30  2779.538 ± 13.909  ns/op

Legend:
benchITUParser -> ITU library for datetime parsing
benchRFC3339Parser -> Implementation in #11465
charDateFormatter -> Implementation in #11177 Repo
isoOffsetDateFormatter -> FastDateFormatter POCed at the start of this issue
strictDateOptionalTimeFormatter -> OS strict_date_optional_time formatter

Overall a small overhead in #11465 implementation is due to usage of java.text.ParsePosition abstraction instead of raw numerical offsets which was required to conform with doParse function of JavaDateTimeFormatter

cc: @mgodwan , @backslasht

kiranprakash154 · 2024-01-19T00:24:03Z

Hi, are we on track for this to be released in 2.12 ?

mgodwan added enhancement Enhancement or improvement to existing feature or request untriaged labels Jun 29, 2023

shwetathareja assigned mgodwan Jul 3, 2023

Rishikesh1159 added the Indexing Indexing, Bulk Indexing and anything related to indexing label Jul 5, 2023

anasalkouz removed the untriaged label Jul 11, 2023

mgodwan assigned CaptainDredge Oct 16, 2023

mgodwan mentioned this issue Nov 13, 2023

[BUG] Probably performance slowdown for indexing due to date string parsing #11177

Closed

CaptainDredge mentioned this issue Dec 5, 2023

Added New Fast DateTime parser implementation for RFC3339 Datetime format #11465

Merged

8 tasks

CaptainDredge mentioned this issue Dec 27, 2023

Added fast Local datetime parser for local date time #11675

Open

8 tasks

reta added v3.0.0 Issues and PRs related to version 3.0.0 v2.12.0 Issues and PRs related to version 2.12.0 labels Jan 11, 2024

reta closed this as completed in #11465 Jan 24, 2024

CaptainDredge mentioned this issue Jan 31, 2024

[Backport 2.x] Added New Fast DateTime parser implementation for RFC3339 Datetime format #12107

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] Improve performance for date field parsing #8361

[Performance] Improve performance for date field parsing #8361

mgodwan commented Jun 29, 2023

mgodwan commented Jun 29, 2023

CaptainDredge commented Oct 9, 2023

mgodwan commented Nov 7, 2023

CaptainDredge commented Nov 10, 2023

CaptainDredge commented Nov 16, 2023

backslasht commented Nov 20, 2023

CaptainDredge commented Dec 5, 2023 •

edited

Loading

kiranprakash154 commented Jan 19, 2024

[Performance] Improve performance for date field parsing #8361

[Performance] Improve performance for date field parsing #8361

Comments

mgodwan commented Jun 29, 2023

mgodwan commented Jun 29, 2023

CaptainDredge commented Oct 9, 2023

mgodwan commented Nov 7, 2023

CaptainDredge commented Nov 10, 2023

CaptainDredge commented Nov 16, 2023

backslasht commented Nov 20, 2023

CaptainDredge commented Dec 5, 2023 • edited Loading

kiranprakash154 commented Jan 19, 2024

CaptainDredge commented Dec 5, 2023 •

edited

Loading