Editorial: Simplify parsing UTC offset strings #1917

ptomato · 2021-11-13T00:35:04Z

I noticed that in a few places we parse a UTC offset string into a nanosecond offset, only to turn it back into an offset string later. This can be simplified with an editorial change.

While I was working on this, I changed the UTC offset parsing algorithms to use ParseText like @gibson042 did in #1907.

codecov · 2021-11-13T00:36:04Z

Codecov Report

Merging #1917 (58e30a6) into main (a16fcfa) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##             main    #1917   +/-   ##
=======================================
  Coverage   94.91%   94.91%           
=======================================
  Files          19       19           
  Lines       10990    10990           
  Branches     1596     1596           
=======================================
  Hits        10431    10431           
  Misses        541      541           
  Partials       18       18

Flag	Coverage Δ
test262	`79.42% <ø> (ø)`
tests	`89.65% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a16fcfa...58e30a6. Read the comment docs.

gibson042

These changes appear to be improvements, but I have several followup questions.

gibson042 · 2021-12-07T01:33:18Z

spec/abstractops.html

        1. Throw a *RangeError* exception.
-      1. Let _z_, _sign_, _hours_, _minutes_, _seconds_, _fraction_ and _name_ be the parts of _isoString_ produced respectively by the |UTCDesignator|, |TimeZoneUTCOffsetSign|, |TimeZoneUTCOffsetHour|, |TimeZoneUTCOffsetMinute|, |TimeZoneUTCOffsetSecond|, |TimeZoneUTCOffsetFractionalPart|, and |TimeZoneIANAName| productions, or *undefined* if not present.
-      1. If _z_ is not *undefined*, then
+      1. Let each of _z_, _offsetString_, and _name_ be the source text matched by the |UTCDesignator|, |TimeZoneNumericUTCOffset|, and |TimeZoneIANAName| Parse Node enclosed by _parseResult_, or an empty sequence of code points if not present.


Suggested change

1. Let each of _z_, _offsetString_, and _name_ be the source text matched by the |UTCDesignator|, |TimeZoneNumericUTCOffset|, and |TimeZoneIANAName| Parse Node enclosed by _parseResult_, or an empty sequence of code points if not present.

1. Let each of _z_, _offsetString_, and _name_ be the source text matched by the respective |UTCDesignator|, |TimeZoneNumericUTCOffset|, and |TimeZoneIANAName| Parse Node enclosed by _parseResult_, or an empty sequence of code points if not present.

Also, I finally dug into this part of the grammar and found some potential issues:

"Z" and "z" are valid expansions of TimeZoneIANAName (relevant for e.g. Temporal.TimeZone.from).

"Etc/GMT+hour" strings are not matched by TimeZoneIANAName while "Etc/GMT-hour" strings are, despite both being in the IANA Time Zone Database.

TimeZoneBracketedAnnotation matches "Etc/GMT±0digit" strings, which are not in the IANA Time Zone Database (the current range is actually asymmetric—Etc/GMT+1 to Etc/GMT+12 and Etc/GMT-1 to Etc/GMT-14 plus Etc/GMT±0, but I don't see a problem with requiring Temporal implementations to support everything up to Etc/GMT±23).

The result of the preceding two points is ambiguous parsing for "Etc/GMT±0digit" strings from TimeZoneBracketedName. Their special-case treatment should probably be either removed or moved into TimeZoneIANAName with its existing TimeZoneIANANameTail production prohibited to match e.g. TimeZoneGMTName.

There also seems to be ambiguous parsing for "…±offset[±offset]" strings from TimeZone (matched by both TimeZoneOffsetRequired and TimeZoneNameRequired), although I'm not 100% certain that's a problem.

The whole tree is so deep and some productions are so long that it's quite difficult to comprehend. I think it could be improved by condensing productions, in some cases by parameterization like replacing TimeZoneNumericUTCOffset with TimeZoneUTCOffset[+onlyNumeric] and ASCIISign with Sign[+onlyASCII] and in others with early errors like "if the source text matched by {TimeZoneIANANameComponent,CalendarNameComponent,FractionalPart} contains more than <N> code points".

What is the reason for distinguishing TimeZoneNumericUTCOffset and TimeZoneUTCOffsetName and their subproductions, all of which seem to match exactly the same input?

Relatedly, how does this code handle an expansion of TimeZone that includes both TimeZoneNumericUTCOffset and TimeZoneUTCOffsetName (the former through TimeZoneUTCOffset and the latter through TimeZoneBracketedAnnotation)? It seems to discard the TimeZoneUTCOffsetName data.

Thanks for the detailed review, I'm just going through to make sure that these things are captured in issues. I think they are out of scope for this PR.

"Z" and "z" are valid expansions of TimeZoneIANAName (relevant for e.g. Temporal.TimeZone.from).

I don't think this is a problem — Temporal.TimeZone.from('Z') will fail because there's no such IANA named time zone, just like Temporal.TimeZone.from('Y'). Ditto for Temporal.ZonedDateTime.from('2022-01-11T19:43Z[Z]').

"Etc/GMT+hour" strings are not matched by TimeZoneIANAName while "Etc/GMT-hour" strings are, despite both being in the IANA Time Zone Database.

TimeZoneBracketedAnnotation matches "Etc/GMT±0_digit_" strings, which are not in the IANA Time Zone Database (the current range is actually asymmetric—Etc/GMT+1 to Etc/GMT+12 and Etc/GMT-1 to Etc/GMT-14 plus Etc/GMT±0, but I don't see a problem with requiring Temporal implementations to support everything up to Etc/GMT±23).

The result of the preceding two points is ambiguous parsing for "Etc/GMT±0_digit_" strings from TimeZoneBracketedName. Their special-case treatment should probably be either removed or moved into TimeZoneIANAName with its existing TimeZoneIANANameTail production prohibited to match e.g. TimeZoneGMTName.

The above points are covered in #1993, I think.

There also seems to be ambiguous parsing for "…±offset[±offset]" strings from TimeZone (matched by both TimeZoneOffsetRequired and TimeZoneNameRequired), although I'm not 100% certain that's a problem.

I don't really think this is a problem either, but for the sake of caution we could fix it by rewriting TimeZone:

TimeZone :

TimeZoneUTCOffset TimeZoneBracketedAnnotation(opt)

TimeZoneBracketedAnnotation

The whole tree is so deep and some productions are so long that it's quite difficult to comprehend. I think it could be improved by condensing productions, in some cases by parameterization like replacing TimeZoneNumericUTCOffset with TimeZoneUTCOffset[+onlyNumeric] and ASCIISign with Sign[+onlyASCII] and in others with early errors like "if the source text matched by {TimeZoneIANANameComponent,CalendarNameComponent,FractionalPart} contains more than code points".

I think this is covered in #1984.

What is the reason for distinguishing TimeZoneNumericUTCOffset and TimeZoneUTCOffsetName and their subproductions, all of which seem to match exactly the same input?

This is to avoid ambiguity in strings like 1976-11-18T15:23:30+01:00[+01:00] where TimeZoneNumericUTCOffset matches the first +01:00 and TimeZoneUTCOffsetName matches the second +01:00. Otherwise the code using these productions in ParseZonedDateTimeString can't refer unambiguously to the correct parse node.

Relatedly, how does this code handle an expansion of TimeZone that includes both TimeZoneNumericUTCOffset and TimeZoneUTCOffsetName (the former through TimeZoneUTCOffset and the latter through TimeZoneBracketedAnnotation)? It seems to discard the TimeZoneUTCOffsetName data.

It depends on which from() function is parsing the string. Off the top of my head — in the case of Instant, the bracketed annotation is discarded. In the case of ZonedDateTime and TimeZone, the numeric offset is discarded (in ZonedDateTime, only after verifying that they match). In other cases, both are discarded.

Otherwise the code using these productions in ParseZonedDateTimeString can't refer unambiguously to the correct parse node.

That should be possible without introducing distinct nonterminals, see ECMA-262 The if Statement for an example ("the first Statement" vs. "the second Statement").

Including if either of them are optional? (so "first" and "second" are not always the case)

Yes, cf. https://tc39.es/ecma262/multipage/ecmascript-language-statements-and-declarations.html#sec-runtime-semantics-forloopevaluation

I've captured this as #2006 to address later. Mind if I go on and merge this?

This fixes one instance of _timeZoneResult_.[[Offset]] which was not a slot on the Record returned by that operation. It should be [[OffsetString]]. Use the name "offset string" more consistently when referring to offset strings, rather than "offset", to avoid confusion with the offset option or offset in nanoseconds.

There is no need to parse the offset string and format it again, only to pass it to CreateTemporalTimeZone which parses it again. Also use the ParseText parsing machinery instead of the handwavy "satisfies the syntax of" language.

…meZoneString We don't need to canonicalize the offset string in this operation, because all callers pass the resulting offset string to ParseTimeZoneOffsetString. Also use the ParseText parsing machinery instead of the handwavy "satisfies the syntax of" language.

I don't think this is a problem, but this removes the possible ambiguity where both TimeZoneOffsetRequired and TimeZoneNameRequired could match the same string.

…rithmetic These operations cannot fail because they values they use as input already come from valid durations (sometimes negated).

tc39/proposal-temporal#1917 tc39/proposal-temporal#1953 Bug: v8:11544 Change-Id: I667980e312248ccbaf826d4e3104fb1ddabef890 Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/3721464 Commit-Queue: Frank Tang <ftang@chromium.org> Reviewed-by: Shu-yu Guo <syg@chromium.org> Cr-Commit-Position: refs/heads/main@{#81453}

ptomato added the spec-text Specification text involved label Nov 13, 2021

ptomato requested review from Ms2ger, gibson042 and sarahghp November 13, 2021 00:35

gibson042 reviewed Dec 7, 2021

View reviewed changes

ptomato force-pushed the editorial branch from 8d61602 to c0a3824 Compare January 11, 2022 20:07

ptomato requested a review from gibson042 January 11, 2022 20:07

ptomato mentioned this pull request Jan 18, 2022

Normative: Fix sign in PlainYearMonth.subtract #2002

Merged

ptomato added 5 commits January 18, 2022 14:15

Editorial: Avoid ambiguity in TimeZone production

49b4459

I don't think this is a problem, but this removes the possible ambiguity where both TimeZoneOffsetRequired and TimeZoneNameRequired could match the same string.

Editorial: Make CreateTemporalDuration infallible in PlainYearMonth a…

58e30a6

…rithmetic These operations cannot fail because they values they use as input already come from valid durations (sometimes negated).

ptomato force-pushed the editorial branch from 8d1685d to 58e30a6 Compare January 18, 2022 22:20

Ms2ger approved these changes Jan 19, 2022

View reviewed changes

ptomato merged commit 05793a8 into main Jan 19, 2022

ptomato deleted the editorial branch January 19, 2022 21:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Editorial: Simplify parsing UTC offset strings #1917

Editorial: Simplify parsing UTC offset strings #1917

ptomato commented Nov 13, 2021

codecov bot commented Nov 13, 2021 •

edited

Loading

gibson042 left a comment

gibson042 Dec 7, 2021

ptomato Jan 11, 2022

gibson042 Jan 12, 2022 •

edited

Loading

ptomato Jan 12, 2022

gibson042 Jan 18, 2022

ptomato Jan 18, 2022

	1. Let each of _z_, _offsetString_, and _name_ be the source text matched by the \|UTCDesignator\|, \|TimeZoneNumericUTCOffset\|, and \|TimeZoneIANAName\| Parse Node enclosed by _parseResult_, or an empty sequence of code points if not present.
	1. Let each of _z_, _offsetString_, and _name_ be the source text matched by the respective \|UTCDesignator\|, \|TimeZoneNumericUTCOffset\|, and \|TimeZoneIANAName\| Parse Node enclosed by _parseResult_, or an empty sequence of code points if not present.

Editorial: Simplify parsing UTC offset strings #1917

Editorial: Simplify parsing UTC offset strings #1917

Conversation

ptomato commented Nov 13, 2021

codecov bot commented Nov 13, 2021 • edited Loading

Codecov Report

gibson042 left a comment

Choose a reason for hiding this comment

gibson042 Dec 7, 2021

Choose a reason for hiding this comment

ptomato Jan 11, 2022

Choose a reason for hiding this comment

gibson042 Jan 12, 2022 • edited Loading

Choose a reason for hiding this comment

ptomato Jan 12, 2022

Choose a reason for hiding this comment

gibson042 Jan 18, 2022

Choose a reason for hiding this comment

ptomato Jan 18, 2022

Choose a reason for hiding this comment

codecov bot commented Nov 13, 2021 •

edited

Loading

gibson042 Jan 12, 2022 •

edited

Loading