Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-40891: [JS] Store Dates as TimestampMillisecond #40892

Merged
merged 4 commits into from
Apr 3, 2024

Conversation

domoritz
Copy link
Member

@domoritz domoritz commented Mar 29, 2024

Fixes #40891

Tested with

const date = new Date("2023-03-29T12:34:56Z");
console.log("original", date)

console.log("=> vec")
const vec = arrow.vectorFromArray([date])
console.log(vec.toArray())
console.log(vec.toJSON())
console.log(vec.type)
console.log(vec.get(0))

console.log("=> vec2")
const vec2 = arrow.vectorFromArray([date], new arrow.DateMillisecond)
console.log(vec2.toArray())
console.log(vec.toJSON())
console.log(vec2.type)
console.log(vec2.get(0))

console.log("=> table")
const table = arrow.tableFromJSON([{ date }])
console.log(table.toArray())
console.log(table.schema.fields[0].type)
console.log(table.getChildAt(0)?.get(0))

console.log("=> table2")
const table2 = arrow.tableFromIPC(arrow.tableToIPC(table));
console.log(table2.toArray())
console.log(table2.schema.fields[0].type)
console.log(table2.getChildAt(0)?.get(0))

console.log("=> table3")
const table3 = new arrow.Table({ dates: vec2 })
console.log(table3.toArray())
console.log(table3.schema.fields[0].type)
console.log(table3.getChildAt(0)?.get(0))
=> table
[
  {"date": Wed Mar 29 2023 08:34:56 GMT-0400 (Eastern Daylight Time)}
]
TimestampMillisecond {
  typeId: 10,
  unit: 1,
  timezone: undefined,
  toString: [Function: toString],
  ArrayType: [class Int32Array],
  [Symbol(Symbol.toStringTag)]: "Timestamp",
  children: null,
  OffsetArrayType: [class Int32Array],
}
2023-03-29T12:34:56.000Z
=> table2
[
  {"date": Wed Mar 29 2023 08:34:56 GMT-0400 (Eastern Daylight Time)}
]
Timestamp_ {
  typeId: 10,
  unit: 1,
  timezone: null,
  toString: [Function: toString],
  ArrayType: [class Int32Array],
  children: null,
  OffsetArrayType: [class Int32Array],
}
2023-03-29T12:34:56.000Z
=> table3
[
  {"dates": Wed Mar 29 2023 08:34:56 GMT-0400 (Eastern Daylight Time)}
]
DateMillisecond {
  typeId: 8,
  unit: 1,
  toString: [Function: toString],
  ArrayType: [class Int32Array],
  [Symbol(Symbol.toStringTag)]: "Date",
  children: null,
  OffsetArrayType: [class Int32Array],
}
2023-03-29T12:34:56.000Z

js/src/type.ts Outdated
@@ -406,7 +406,7 @@ type Timestamps = Type.Timestamp | Type.TimestampSecond | Type.TimestampMillisec
/** @ignore */
interface Timestamp_<T extends Timestamps = Timestamps> extends DataType<T> {
TArray: Int32Array;
TValue: number;
TValue: number | Date;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't the TValue type be overridden just for the TimestampMillisecond class below?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted to number again.

Copy link
Contributor

@trxcllnt trxcllnt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to change the type from number to Date? IIRC it's on users to convert numbers to the corresponding timezone, so returning number | Date makes the type-safe way to do that more difficult for users.

If we always return numbers, users can convert to Dates (if necessary) themselves. IIRC constructing date instances is relatively expensive esp. if using custom locales, so users should opt-in to that behavior if they need it?

@domoritz
Copy link
Member Author

domoritz commented Apr 1, 2024

We do construct dates for DateMillisecond already so I figured it would make sense to do the same for TimestampMillisecond.

IIRC constructing date instances is relatively expensive esp. if using custom locales, so users should opt-in to that behavior if they need it?

I did not measure it but wit surprises me a bit since I expected it to just be a lightweight wrapper around timestamps. I'd be okay with just retuning numbers to be fast but also to introduce an iterator for vectors that returns convenient types in JS that may be slower but easier to work with (in a follow up PR).

Creating a million Dates takes about 5ms in bun on my machine (M1 Mac). I node, 40ms!

Getting 1M Dates from a TimestampMillisecond vector instead of timestamps takes 26 instead of 15ms (in bun on my machine). In node 54ms instead of 14ms.

So yeah, much slower in node.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Apr 2, 2024
@domoritz domoritz added awaiting committer review Awaiting committer review and removed awaiting changes Awaiting changes labels Apr 2, 2024
@domoritz domoritz added this to the 16.0.0 milestone Apr 3, 2024
@domoritz domoritz merged commit 2caec86 into apache:main Apr 3, 2024
10 checks passed
@domoritz domoritz deleted the dom/timestamps branch April 3, 2024 22:32
Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 2caec86.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

domoritz added a commit that referenced this pull request Apr 16, 2024
Merge after #40892.

This pull request also changes Dates to return timestamps instead of
Date instances (similar to Timestamps and for the same reason.

* GitHub Issue: #40959
domoritz added a commit to domoritz/arrow that referenced this pull request Apr 17, 2024
Merge after apache#40892.

This pull request also changes Dates to return timestamps instead of
Date instances (similar to Timestamps and for the same reason.

* GitHub Issue: apache#40959
raulcd pushed a commit that referenced this pull request Apr 29, 2024
Merge after #40892.

This pull request also changes Dates to return timestamps instead of
Date instances (similar to Timestamps and for the same reason.

* GitHub Issue: #40959
tolleybot pushed a commit to tmct/arrow that referenced this pull request May 2, 2024
Fixes apache#40891

Tested with 

```ts
const date = new Date("2023-03-29T12:34:56Z");
console.log("original", date)

console.log("=> vec")
const vec = arrow.vectorFromArray([date])
console.log(vec.toArray())
console.log(vec.toJSON())
console.log(vec.type)
console.log(vec.get(0))

console.log("=> vec2")
const vec2 = arrow.vectorFromArray([date], new arrow.DateMillisecond)
console.log(vec2.toArray())
console.log(vec.toJSON())
console.log(vec2.type)
console.log(vec2.get(0))

console.log("=> table")
const table = arrow.tableFromJSON([{ date }])
console.log(table.toArray())
console.log(table.schema.fields[0].type)
console.log(table.getChildAt(0)?.get(0))

console.log("=> table2")
const table2 = arrow.tableFromIPC(arrow.tableToIPC(table));
console.log(table2.toArray())
console.log(table2.schema.fields[0].type)
console.log(table2.getChildAt(0)?.get(0))

console.log("=> table3")
const table3 = new arrow.Table({ dates: vec2 })
console.log(table3.toArray())
console.log(table3.schema.fields[0].type)
console.log(table3.getChildAt(0)?.get(0))
```

```
=> table
[
  {"date": Wed Mar 29 2023 08:34:56 GMT-0400 (Eastern Daylight Time)}
]
TimestampMillisecond {
  typeId: 10,
  unit: 1,
  timezone: undefined,
  toString: [Function: toString],
  ArrayType: [class Int32Array],
  [Symbol(Symbol.toStringTag)]: "Timestamp",
  children: null,
  OffsetArrayType: [class Int32Array],
}
2023-03-29T12:34:56.000Z
=> table2
[
  {"date": Wed Mar 29 2023 08:34:56 GMT-0400 (Eastern Daylight Time)}
]
Timestamp_ {
  typeId: 10,
  unit: 1,
  timezone: null,
  toString: [Function: toString],
  ArrayType: [class Int32Array],
  children: null,
  OffsetArrayType: [class Int32Array],
}
2023-03-29T12:34:56.000Z
=> table3
[
  {"dates": Wed Mar 29 2023 08:34:56 GMT-0400 (Eastern Daylight Time)}
]
DateMillisecond {
  typeId: 8,
  unit: 1,
  toString: [Function: toString],
  ArrayType: [class Int32Array],
  [Symbol(Symbol.toStringTag)]: "Date",
  children: null,
  OffsetArrayType: [class Int32Array],
}
2023-03-29T12:34:56.000Z
```
* GitHub Issue: apache#40891
tolleybot pushed a commit to tmct/arrow that referenced this pull request May 2, 2024
Merge after apache#40892.

This pull request also changes Dates to return timestamps instead of
Date instances (similar to Timestamps and for the same reason.

* GitHub Issue: apache#40959
tolleybot pushed a commit to tmct/arrow that referenced this pull request May 4, 2024
Fixes apache#40891

Tested with 

```ts
const date = new Date("2023-03-29T12:34:56Z");
console.log("original", date)

console.log("=> vec")
const vec = arrow.vectorFromArray([date])
console.log(vec.toArray())
console.log(vec.toJSON())
console.log(vec.type)
console.log(vec.get(0))

console.log("=> vec2")
const vec2 = arrow.vectorFromArray([date], new arrow.DateMillisecond)
console.log(vec2.toArray())
console.log(vec.toJSON())
console.log(vec2.type)
console.log(vec2.get(0))

console.log("=> table")
const table = arrow.tableFromJSON([{ date }])
console.log(table.toArray())
console.log(table.schema.fields[0].type)
console.log(table.getChildAt(0)?.get(0))

console.log("=> table2")
const table2 = arrow.tableFromIPC(arrow.tableToIPC(table));
console.log(table2.toArray())
console.log(table2.schema.fields[0].type)
console.log(table2.getChildAt(0)?.get(0))

console.log("=> table3")
const table3 = new arrow.Table({ dates: vec2 })
console.log(table3.toArray())
console.log(table3.schema.fields[0].type)
console.log(table3.getChildAt(0)?.get(0))
```

```
=> table
[
  {"date": Wed Mar 29 2023 08:34:56 GMT-0400 (Eastern Daylight Time)}
]
TimestampMillisecond {
  typeId: 10,
  unit: 1,
  timezone: undefined,
  toString: [Function: toString],
  ArrayType: [class Int32Array],
  [Symbol(Symbol.toStringTag)]: "Timestamp",
  children: null,
  OffsetArrayType: [class Int32Array],
}
2023-03-29T12:34:56.000Z
=> table2
[
  {"date": Wed Mar 29 2023 08:34:56 GMT-0400 (Eastern Daylight Time)}
]
Timestamp_ {
  typeId: 10,
  unit: 1,
  timezone: null,
  toString: [Function: toString],
  ArrayType: [class Int32Array],
  children: null,
  OffsetArrayType: [class Int32Array],
}
2023-03-29T12:34:56.000Z
=> table3
[
  {"dates": Wed Mar 29 2023 08:34:56 GMT-0400 (Eastern Daylight Time)}
]
DateMillisecond {
  typeId: 8,
  unit: 1,
  toString: [Function: toString],
  ArrayType: [class Int32Array],
  [Symbol(Symbol.toStringTag)]: "Date",
  children: null,
  OffsetArrayType: [class Int32Array],
}
2023-03-29T12:34:56.000Z
```
* GitHub Issue: apache#40891
tolleybot pushed a commit to tmct/arrow that referenced this pull request May 4, 2024
Merge after apache#40892.

This pull request also changes Dates to return timestamps instead of
Date instances (similar to Timestamps and for the same reason.

* GitHub Issue: apache#40959
rok pushed a commit to tmct/arrow that referenced this pull request May 8, 2024
Fixes apache#40891

Tested with 

```ts
const date = new Date("2023-03-29T12:34:56Z");
console.log("original", date)

console.log("=> vec")
const vec = arrow.vectorFromArray([date])
console.log(vec.toArray())
console.log(vec.toJSON())
console.log(vec.type)
console.log(vec.get(0))

console.log("=> vec2")
const vec2 = arrow.vectorFromArray([date], new arrow.DateMillisecond)
console.log(vec2.toArray())
console.log(vec.toJSON())
console.log(vec2.type)
console.log(vec2.get(0))

console.log("=> table")
const table = arrow.tableFromJSON([{ date }])
console.log(table.toArray())
console.log(table.schema.fields[0].type)
console.log(table.getChildAt(0)?.get(0))

console.log("=> table2")
const table2 = arrow.tableFromIPC(arrow.tableToIPC(table));
console.log(table2.toArray())
console.log(table2.schema.fields[0].type)
console.log(table2.getChildAt(0)?.get(0))

console.log("=> table3")
const table3 = new arrow.Table({ dates: vec2 })
console.log(table3.toArray())
console.log(table3.schema.fields[0].type)
console.log(table3.getChildAt(0)?.get(0))
```

```
=> table
[
  {"date": Wed Mar 29 2023 08:34:56 GMT-0400 (Eastern Daylight Time)}
]
TimestampMillisecond {
  typeId: 10,
  unit: 1,
  timezone: undefined,
  toString: [Function: toString],
  ArrayType: [class Int32Array],
  [Symbol(Symbol.toStringTag)]: "Timestamp",
  children: null,
  OffsetArrayType: [class Int32Array],
}
2023-03-29T12:34:56.000Z
=> table2
[
  {"date": Wed Mar 29 2023 08:34:56 GMT-0400 (Eastern Daylight Time)}
]
Timestamp_ {
  typeId: 10,
  unit: 1,
  timezone: null,
  toString: [Function: toString],
  ArrayType: [class Int32Array],
  children: null,
  OffsetArrayType: [class Int32Array],
}
2023-03-29T12:34:56.000Z
=> table3
[
  {"dates": Wed Mar 29 2023 08:34:56 GMT-0400 (Eastern Daylight Time)}
]
DateMillisecond {
  typeId: 8,
  unit: 1,
  toString: [Function: toString],
  ArrayType: [class Int32Array],
  [Symbol(Symbol.toStringTag)]: "Date",
  children: null,
  OffsetArrayType: [class Int32Array],
}
2023-03-29T12:34:56.000Z
```
* GitHub Issue: apache#40891
rok pushed a commit to tmct/arrow that referenced this pull request May 8, 2024
Merge after apache#40892.

This pull request also changes Dates to return timestamps instead of
Date instances (similar to Timestamps and for the same reason.

* GitHub Issue: apache#40959
rok pushed a commit to tmct/arrow that referenced this pull request May 8, 2024
Fixes apache#40891

Tested with 

```ts
const date = new Date("2023-03-29T12:34:56Z");
console.log("original", date)

console.log("=> vec")
const vec = arrow.vectorFromArray([date])
console.log(vec.toArray())
console.log(vec.toJSON())
console.log(vec.type)
console.log(vec.get(0))

console.log("=> vec2")
const vec2 = arrow.vectorFromArray([date], new arrow.DateMillisecond)
console.log(vec2.toArray())
console.log(vec.toJSON())
console.log(vec2.type)
console.log(vec2.get(0))

console.log("=> table")
const table = arrow.tableFromJSON([{ date }])
console.log(table.toArray())
console.log(table.schema.fields[0].type)
console.log(table.getChildAt(0)?.get(0))

console.log("=> table2")
const table2 = arrow.tableFromIPC(arrow.tableToIPC(table));
console.log(table2.toArray())
console.log(table2.schema.fields[0].type)
console.log(table2.getChildAt(0)?.get(0))

console.log("=> table3")
const table3 = new arrow.Table({ dates: vec2 })
console.log(table3.toArray())
console.log(table3.schema.fields[0].type)
console.log(table3.getChildAt(0)?.get(0))
```

```
=> table
[
  {"date": Wed Mar 29 2023 08:34:56 GMT-0400 (Eastern Daylight Time)}
]
TimestampMillisecond {
  typeId: 10,
  unit: 1,
  timezone: undefined,
  toString: [Function: toString],
  ArrayType: [class Int32Array],
  [Symbol(Symbol.toStringTag)]: "Timestamp",
  children: null,
  OffsetArrayType: [class Int32Array],
}
2023-03-29T12:34:56.000Z
=> table2
[
  {"date": Wed Mar 29 2023 08:34:56 GMT-0400 (Eastern Daylight Time)}
]
Timestamp_ {
  typeId: 10,
  unit: 1,
  timezone: null,
  toString: [Function: toString],
  ArrayType: [class Int32Array],
  children: null,
  OffsetArrayType: [class Int32Array],
}
2023-03-29T12:34:56.000Z
=> table3
[
  {"dates": Wed Mar 29 2023 08:34:56 GMT-0400 (Eastern Daylight Time)}
]
DateMillisecond {
  typeId: 8,
  unit: 1,
  toString: [Function: toString],
  ArrayType: [class Int32Array],
  [Symbol(Symbol.toStringTag)]: "Date",
  children: null,
  OffsetArrayType: [class Int32Array],
}
2023-03-29T12:34:56.000Z
```
* GitHub Issue: apache#40891
rok pushed a commit to tmct/arrow that referenced this pull request May 8, 2024
Merge after apache#40892.

This pull request also changes Dates to return timestamps instead of
Date instances (similar to Timestamps and for the same reason.

* GitHub Issue: apache#40959
vibhatha pushed a commit to vibhatha/arrow that referenced this pull request May 25, 2024
Fixes apache#40891

Tested with 

```ts
const date = new Date("2023-03-29T12:34:56Z");
console.log("original", date)

console.log("=> vec")
const vec = arrow.vectorFromArray([date])
console.log(vec.toArray())
console.log(vec.toJSON())
console.log(vec.type)
console.log(vec.get(0))

console.log("=> vec2")
const vec2 = arrow.vectorFromArray([date], new arrow.DateMillisecond)
console.log(vec2.toArray())
console.log(vec.toJSON())
console.log(vec2.type)
console.log(vec2.get(0))

console.log("=> table")
const table = arrow.tableFromJSON([{ date }])
console.log(table.toArray())
console.log(table.schema.fields[0].type)
console.log(table.getChildAt(0)?.get(0))

console.log("=> table2")
const table2 = arrow.tableFromIPC(arrow.tableToIPC(table));
console.log(table2.toArray())
console.log(table2.schema.fields[0].type)
console.log(table2.getChildAt(0)?.get(0))

console.log("=> table3")
const table3 = new arrow.Table({ dates: vec2 })
console.log(table3.toArray())
console.log(table3.schema.fields[0].type)
console.log(table3.getChildAt(0)?.get(0))
```

```
=> table
[
  {"date": Wed Mar 29 2023 08:34:56 GMT-0400 (Eastern Daylight Time)}
]
TimestampMillisecond {
  typeId: 10,
  unit: 1,
  timezone: undefined,
  toString: [Function: toString],
  ArrayType: [class Int32Array],
  [Symbol(Symbol.toStringTag)]: "Timestamp",
  children: null,
  OffsetArrayType: [class Int32Array],
}
2023-03-29T12:34:56.000Z
=> table2
[
  {"date": Wed Mar 29 2023 08:34:56 GMT-0400 (Eastern Daylight Time)}
]
Timestamp_ {
  typeId: 10,
  unit: 1,
  timezone: null,
  toString: [Function: toString],
  ArrayType: [class Int32Array],
  children: null,
  OffsetArrayType: [class Int32Array],
}
2023-03-29T12:34:56.000Z
=> table3
[
  {"dates": Wed Mar 29 2023 08:34:56 GMT-0400 (Eastern Daylight Time)}
]
DateMillisecond {
  typeId: 8,
  unit: 1,
  toString: [Function: toString],
  ArrayType: [class Int32Array],
  [Symbol(Symbol.toStringTag)]: "Date",
  children: null,
  OffsetArrayType: [class Int32Array],
}
2023-03-29T12:34:56.000Z
```
* GitHub Issue: apache#40891
vibhatha pushed a commit to vibhatha/arrow that referenced this pull request May 25, 2024
Merge after apache#40892.

This pull request also changes Dates to return timestamps instead of
Date instances (similar to Timestamps and for the same reason.

* GitHub Issue: apache#40959
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[JS] Dates should be stored in TimestampMillisecond
2 participants