Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Support for fractional seconds in strptime() #20146

Open
asfimport opened this issue Mar 9, 2022 · 6 comments
Open

[C++] Support for fractional seconds in strptime() #20146

asfimport opened this issue Mar 9, 2022 · 6 comments

Comments

@asfimport
Copy link
Collaborator

asfimport commented Mar 9, 2022

Currently, we can't parse "our own" string representation of a timestamp array with the timestamp parser strptime:

import datetime
import pyarrow as pa
import pyarrow.compute as pc

>>> pa.array([datetime.datetime(2022, 3, 5, 9)])
<pyarrow.lib.TimestampArray object at 0x7f00c1d53dc0>
[
  2022-03-05 09:00:00.000000
]

# trying to parse the above representation as string
>>> pc.strptime(["2022-03-05 09:00:00.000000"], format="%Y-%m-%d %H:%M:%S", unit="us")
...
ArrowInvalid: Failed to parse string: '2022-03-05 09:00:00.000000' as a scalar of type timestamp[us]

The reason for this is the fractional second part, so the following works:

>>> pc.strptime(["2022-03-05 09:00:00"], format="%Y-%m-%d %H:%M:%S", unit="us")
<pyarrow.lib.TimestampArray object at 0x7f00c1d6f940>
[
  2022-03-05 09:00:00.000000
]

Now, I think the reason that this fails is because strptime only supports parsing seconds as an integer (https://man7.org/linux/man-pages/man3/strptime.3.html).

But, it creates a strange situation where the timestamp parser cannot parse the representation we use for timestamps.

In addition, for CSV we have a custom ISO parser (used by default), so when parsing the strings while reading a CSV file, the same string with fractional seconds does work:

s = b"""a
2022-03-05 09:00:00.000000"""

from pyarrow import csv

>>> csv.read_csv(io.BytesIO(s))
pyarrow.Table
a: timestamp[ns]
----
a: [[2022-03-05 09:00:00.000000000]]

I realize that you can use the generic "cast" for doing this string parsing:

>>> pc.cast(["2022-03-05 09:00:00.000000"], pa.timestamp("us"))
<pyarrow.lib.TimestampArray object at 0x7f00c1d53d60>
[
  2022-03-05 09:00:00.000000
]

But this was not the first way I thought about (I think it is quite typical to first think of strptime, and it is confusing that that doesn't work; the error message is also not helpful)
cc @pitrou @rok

Reporter: Joris Van den Bossche / @jorisvandenbossche
Watchers: Rok Mihevc / @rok

Related issues:

Note: This issue was originally created as ARROW-15883. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Rok Mihevc / @rok:
I agree this is not desired behaviour. Do we have a convention for %S?
I'm working on two other strptime issues right now (ARROW-15665, ARROW-14477) and can also check this while I'm "there".

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
The main problem is that there is no clear "standard" for strptime. Python for example has the "%f" field for fractional seconds(https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes). The R / lubridate parser seems to use "%OS" (https://rdrr.io/r/base/strptime.html, https://lubridate.tidyverse.org/reference/parse_date_time.html). But the C strptime does not support either of those (https://man7.org/linux/man-pages/man3/strptime.3.html; it uses "%OS", but apparently for something different?)

So as long as we use existing strptime implementation, I don't it is possible or easy to "fix" this.

One option I was thinking about is when the users specifies an ISO-like format string, that we could use our own fast ISO parser, instead of using strptime. But that would of course also make the support for fractional seconds dependent on the exact format you specified.

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
Looking at date.h, it seems they also have a parser, and do support decimal seconds: https://howardhinnant.github.io/date/date.html#from_stream_formatting (and using "%NS" with N the width to parse), and this is (I suppose) consistent with what will be available in C++ 20 (https://en.cppreference.com/w/cpp/chrono/parse)

@asfimport
Copy link
Collaborator Author

Rok Mihevc / @rok:
Following date.h seems like a good idea. +1 for %NS.

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
I just closed two other (older) issues as duplicate of this one, but just noting here that both of them tried to use %f for fractional seconds. For example:

>>> pc.strptime("2015-01-09 00:00:00.000", format="%Y-%m-%d %H:%M:%S.%f", unit="ns")
...
ArrowInvalid: Failed to parse string: '2015-01-09 00:00:00.000' as a scalar of type timestamp[ns]

Given that we are probably not going for "%f" (based on the discussion above), it might still be useful to give a more informative error message, for example explicitly stating that "%f" is not supported.

@asfimport
Copy link
Collaborator Author

Carl Boettiger / @cboettig:
I agree about the lack of a standard, but note that arrow itself is inconsistent about conversions between dates and strings.  e.g. in https://issues.apache.org/jira/browse/ARROW-17905?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel I show an example where arrow coerces a timestamp to a string, choosing to keep OS level precision (us), which then encounters the problem shown here, where we don't have the syntax to parse it correctly.  (While I think my issue is essentially a duplicate of the python issue, it's a bit more opaque from the R side where arrow supports lubridate::as_datetime(), but the actual lubridate::as_datetime() is not impacted by this issue). 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants