Skip to content

Separate lazy and eager #13

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 121 commits into from
Oct 5, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
121 commits
Select commit Hold shift + click to select a range
56d9372
wip
MarcoGorelli Aug 30, 2023
0069a60
wip, kinda working!!!
MarcoGorelli Aug 30, 2023
95ee9cd
kinda working???
MarcoGorelli Aug 30, 2023
58ba2ff
getting there!
MarcoGorelli Aug 30, 2023
f13cbb7
and / or test working
MarcoGorelli Aug 30, 2023
a137245
164 passing...wow. getting there
MarcoGorelli Aug 30, 2023
6525f09
got dataframe tests passing!!!
MarcoGorelli Aug 30, 2023
ffb70e0
wip
MarcoGorelli Aug 30, 2023
61c465d
wip
MarcoGorelli Aug 30, 2023
fd81692
getting closer...but not there yet
MarcoGorelli Aug 30, 2023
884a7af
wip
MarcoGorelli Aug 30, 2023
c233934
sort out sorted indices
MarcoGorelli Aug 31, 2023
fda1ccd
statistics test
MarcoGorelli Aug 31, 2023
a6b6b2f
toplevel any_rowwise and all_rowwise
MarcoGorelli Aug 31, 2023
8011978
getting there?
MarcoGorelli Aug 31, 2023
8bf3f6b
getting there!
MarcoGorelli Aug 31, 2023
a4c189e
clean up a bit
MarcoGorelli Aug 31, 2023
9feee3f
this is wonderful!
MarcoGorelli Aug 31, 2023
8b4fa31
clean up
MarcoGorelli Aug 31, 2023
c2c5de1
wip sorting out typing
MarcoGorelli Aug 31, 2023
0796ee4
simplify
MarcoGorelli Aug 31, 2023
304b906
simplify
MarcoGorelli Aug 31, 2023
0acf111
sort out dataframe tests
MarcoGorelli Aug 31, 2023
f4bfa72
namespace.sorted_indices
MarcoGorelli Aug 31, 2023
709c2cb
rename
MarcoGorelli Aug 31, 2023
29348ad
getting there
MarcoGorelli Aug 31, 2023
540c0f0
restore get_rows test
MarcoGorelli Aug 31, 2023
33c4ac2
some docs
MarcoGorelli Aug 31, 2023
5846e04
remove dead code
MarcoGorelli Aug 31, 2023
eafab2b
correct sort!
MarcoGorelli Sep 4, 2023
11a23f5
restore polarscolumn
MarcoGorelli Sep 6, 2023
ac84f10
Merge branch 'main' into remove-column-add-expression
MarcoGorelli Sep 7, 2023
baab078
wip
MarcoGorelli Sep 7, 2023
caef9ad
fixup column.column test
MarcoGorelli Sep 7, 2023
d8a2da7
add eagerframes
MarcoGorelli Sep 7, 2023
43988b9
tests passing again
MarcoGorelli Sep 7, 2023
cdbe7c3
remove to_array_object from lazy
MarcoGorelli Sep 7, 2023
9f5f0d4
clean up a bit
MarcoGorelli Sep 7, 2023
bd60221
restore more stuff
MarcoGorelli Sep 7, 2023
dc7da51
restore expression.sorted_indices
MarcoGorelli Sep 7, 2023
16fb172
restore more tests
MarcoGorelli Sep 7, 2023
a55abb2
coverage
MarcoGorelli Sep 7, 2023
e55862f
just write column operations in terms of expressions, easy!
MarcoGorelli Sep 7, 2023
57241ff
wip try rewriting more, maybe revert this
MarcoGorelli Sep 7, 2023
043f3f3
Revert "wip try rewriting more, maybe revert this"
MarcoGorelli Sep 8, 2023
21c20b7
simplify
MarcoGorelli Sep 8, 2023
1ef1e2d
reuse as much as possible
MarcoGorelli Sep 8, 2023
c93215f
restore is_dtype test
MarcoGorelli Sep 8, 2023
34d9216
restore get_rows
MarcoGorelli Sep 8, 2023
a9f32ad
further fixup names
MarcoGorelli Sep 8, 2023
421660e
start adding root_names and output_name to pandas
MarcoGorelli Sep 9, 2023
eeafb8e
extra assert for output names
MarcoGorelli Sep 9, 2023
28656ce
add todo note about merging root names
MarcoGorelli Sep 9, 2023
1294696
combine root names
MarcoGorelli Sep 9, 2023
b48ee7e
test expression reductions
MarcoGorelli Sep 9, 2023
c495825
notimplementederror on unique_indices
MarcoGorelli Sep 9, 2023
baeed42
validate col name earlier
MarcoGorelli Sep 9, 2023
c0babc9
test and fix column filter and column __gt__
MarcoGorelli Sep 9, 2023
d6e85cc
increase coverage
MarcoGorelli Sep 9, 2023
3357ce1
keep increasing coverage
MarcoGorelli Sep 9, 2023
009a8fa
coverage
MarcoGorelli Sep 9, 2023
1397a16
come on...
MarcoGorelli Sep 9, 2023
e91de78
more coverage, almost 90...
MarcoGorelli Sep 9, 2023
7e5839d
89.57
MarcoGorelli Sep 9, 2023
14b0cde
89.57
MarcoGorelli Sep 9, 2023
231ccae
89.57
MarcoGorelli Sep 9, 2023
bf5ef25
lazy groupby
MarcoGorelli Sep 9, 2023
aa0f8e5
more coverage ftw
MarcoGorelli Sep 9, 2023
e002166
almost 96
MarcoGorelli Sep 10, 2023
d75f09b
100% on the pandas side!
MarcoGorelli Sep 10, 2023
3f70afb
maybe_lazify -> relax
MarcoGorelli Sep 10, 2023
fdf7bee
support multiple expressions in select
MarcoGorelli Sep 10, 2023
9694938
improve some types
MarcoGorelli Sep 10, 2023
82f065c
wip (tests failing)
MarcoGorelli Sep 10, 2023
a6df346
start updating sigs
MarcoGorelli Sep 11, 2023
541cc81
wip
MarcoGorelli Sep 11, 2023
1ba24e8
get tests passing again!
MarcoGorelli Sep 11, 2023
01eb9f8
handle inserting multiple columns
MarcoGorelli Sep 12, 2023
b9b1fdc
handle inserting multiple columns
MarcoGorelli Sep 12, 2023
8740ff5
get column rename
MarcoGorelli Sep 12, 2023
c6e919f
start to sort out broadcasting
MarcoGorelli Sep 13, 2023
2d096fc
pandas expr reductions
MarcoGorelli Sep 13, 2023
6e849e4
test broadcasting
MarcoGorelli Sep 13, 2023
4e34f4a
broadcast in insert_columns
MarcoGorelli Sep 13, 2023
ea5c8af
insert broadcasting
MarcoGorelli Sep 13, 2023
eb74f33
broadcasting can work everywhere!
MarcoGorelli Sep 13, 2023
eff4d00
add missing things
MarcoGorelli Sep 14, 2023
4057939
len test
MarcoGorelli Sep 14, 2023
0b686d2
expression slice rows
MarcoGorelli Sep 14, 2023
ee6574d
clean up
MarcoGorelli Sep 14, 2023
b81400a
simplify
MarcoGorelli Sep 14, 2023
1ed112b
start fixing up
MarcoGorelli Oct 3, 2023
c19269c
assign update
MarcoGorelli Oct 3, 2023
48e4d1f
update -> assign
MarcoGorelli Oct 3, 2023
487dad4
very important group_by renaming
MarcoGorelli Oct 3, 2023
2f2a88a
get column by name
MarcoGorelli Oct 3, 2023
9b4a8ad
tests passing again :tada:
MarcoGorelli Oct 3, 2023
dec2a87
ok, tests passing once again
MarcoGorelli Oct 3, 2023
0aac9b4
permissivecolumn.len
MarcoGorelli Oct 3, 2023
2fbf610
permissivecolumn.len
MarcoGorelli Oct 3, 2023
cefae93
add pandas date
MarcoGorelli Oct 3, 2023
c91cc18
polars date type
MarcoGorelli Oct 3, 2023
76de547
add datetime dtypes!
MarcoGorelli Oct 3, 2023
90fdda0
add duration dtype too!
MarcoGorelli Oct 3, 2023
b6dca0f
coverage
MarcoGorelli Oct 5, 2023
c057318
coverage...
MarcoGorelli Oct 5, 2023
7170408
any all expr
MarcoGorelli Oct 5, 2023
ee367b6
coverage
MarcoGorelli Oct 5, 2023
c96d9a7
oh yeah
MarcoGorelli Oct 5, 2023
0f62633
comparisons tests
MarcoGorelli Oct 5, 2023
986201d
over 97
MarcoGorelli Oct 5, 2023
430be61
closer to 98
MarcoGorelli Oct 5, 2023
4ce6d3f
98.3
MarcoGorelli Oct 5, 2023
64c50a7
come on come on come on
MarcoGorelli Oct 5, 2023
9281711
come on come on come on
MarcoGorelli Oct 5, 2023
c3cd4c7
ok, 100, did it
MarcoGorelli Oct 5, 2023
76284fa
wip
MarcoGorelli Oct 5, 2023
73bff7f
remove 3.8;
MarcoGorelli Oct 5, 2023
bb56439
remove 3.8;
MarcoGorelli Oct 5, 2023
bb7d8fc
coverage
MarcoGorelli Oct 5, 2023
801cbf1
coverage
MarcoGorelli Oct 5, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/tox.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ jobs:
tox:
strategy:
matrix:
python-version: ["3.8", "3.9", "3.10", "3.11"]
python-version: ["3.9", "3.10", "3.11"]
os: [windows-latest, ubuntu-latest]

runs-on: ${{ matrix.os }}
Expand Down
223 changes: 169 additions & 54 deletions dataframe_api_compat/pandas_standard/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
from __future__ import annotations

import re
from typing import Any
from typing import Literal
from typing import TYPE_CHECKING

import pandas as pd
Expand All @@ -10,12 +12,24 @@
from dataframe_api_compat.pandas_standard.pandas_standard import PandasColumn
from dataframe_api_compat.pandas_standard.pandas_standard import PandasDataFrame
from dataframe_api_compat.pandas_standard.pandas_standard import PandasGroupBy
from dataframe_api_compat.pandas_standard.pandas_standard import PandasPermissiveColumn
from dataframe_api_compat.pandas_standard.pandas_standard import PandasPermissiveFrame

if TYPE_CHECKING:
from collections.abc import Sequence
from dataframe_api._types import DType


def col(name: str) -> PandasColumn:
return PandasColumn(
root_names=[name], output_name=name, base_call=lambda df: df.loc[:, name]
)


Column = PandasColumn
PermissiveColumn = PandasPermissiveColumn
DataFrame = PandasDataFrame
PermissiveFrame = PandasPermissiveFrame
GroupBy = PandasGroupBy


Expand Down Expand Up @@ -67,35 +81,82 @@ class String:
...


DTYPE_MAP = {
"int64": Int64(),
"Int64": Int64(),
"int32": Int32(),
"Int32": Int32(),
"int16": Int16(),
"Int16": Int16(),
"int8": Int8(),
"Int8": Int8(),
"uint64": UInt64(),
"UInt64": UInt64(),
"uint32": UInt32(),
"UInt32": UInt32(),
"uint16": UInt16(),
"UInt16": UInt16(),
"uint8": UInt8(),
"UInt8": UInt8(),
"float64": Float64(),
"Float64": Float64(),
"float32": Float32(),
"Float32": Float32(),
"bool": Bool(),
"boolean": Bool(),
"object": String(),
"string": String(),
}


def map_standard_dtype_to_pandas_dtype(dtype: Any) -> Any:
class Date:
...


class Datetime:
def __init__(self, time_unit, time_zone=None):
self.time_unit = time_unit
# todo validate time zone
self.time_zone = time_zone


class Duration:
def __init__(self, time_unit):
self.time_unit = time_unit


def map_pandas_dtype_to_standard_dtype(dtype: Any) -> DType:
if dtype == "int64":
return Int64()
if dtype == "Int64":
return Int64()
if dtype == "int32":
return Int32()
if dtype == "Int32":
return Int32()
if dtype == "int16":
return Int16()
if dtype == "Int16":
return Int16()
if dtype == "int8":
return Int8()
if dtype == "Int8":
return Int8()
if dtype == "uint64":
return UInt64()
if dtype == "UInt64":
return UInt64()
if dtype == "uint32":
return UInt32()
if dtype == "UInt32":
return UInt32()
if dtype == "uint16":
return UInt16()
if dtype == "UInt16":
return UInt16()
if dtype == "uint8":
return UInt8()
if dtype == "UInt8":
return UInt8()
if dtype == "float64":
return Float64()
if dtype == "Float64":
return Float64()
if dtype == "float32":
return Float32()
if dtype == "Float32":
return Float32()
if dtype == "bool":
# 'boolean' not yet covered, as the default dtype in pandas is still 'bool'
return Bool()
if dtype == "object":
return String()
if dtype == "string":
return String()
if dtype == "datetime64[s]":
return Date()
if dtype.startswith("datetime64["):
time_unit = re.search(r"datetime64\[(\w{1,2})", dtype).group(1)
return Datetime(time_unit)
if dtype.startswith("timedelta64["):
time_unit = re.search(r"timedelta64\[(\w{1,2})", dtype).group(1)
return Duration(time_unit)
raise AssertionError(f"Unsupported dtype! {dtype}")


def map_standard_dtype_to_pandas_dtype(dtype: DType) -> Any:
if isinstance(dtype, Int64):
return "int64"
if isinstance(dtype, Int32):
Expand All @@ -120,9 +181,26 @@ def map_standard_dtype_to_pandas_dtype(dtype: Any) -> Any:
return "bool"
if isinstance(dtype, String):
return "object"
if isinstance(dtype, Datetime):
if dtype.time_zone is not None: # pragma: no cover (todo)
return f"datetime64[{dtype.time_unit}, {dtype.time_zone}]"
return f"datetime64[{dtype.time_unit}]"
if isinstance(dtype, Duration):
return f"timedelta64[{dtype.time_unit}]"
raise AssertionError(f"Unknown dtype: {dtype}")


def convert_to_standard_compliant_column(
ser: pd.Series, api_version: str | None = None
) -> PandasDataFrame:
if api_version is None: # pragma: no cover
api_version = LATEST_API_VERSION
if ser.name is not None and not isinstance(ser.name, str):
raise ValueError(f"Expected column with string name, got: {ser.name}")
name = ser.name or ""
return PandasPermissiveColumn(ser.rename(name), api_version=api_version)


def convert_to_standard_compliant_dataframe(
df: pd.DataFrame, api_version: str | None = None
) -> PandasDataFrame:
Expand All @@ -131,13 +209,6 @@ def convert_to_standard_compliant_dataframe(
return PandasDataFrame(df, api_version=api_version)


def convert_to_standard_compliant_column(
df: pd.Series[Any],
api_version: str | None = None,
) -> PandasColumn[Any]:
return PandasColumn(df, api_version=api_version or LATEST_API_VERSION)


def concat(dataframes: Sequence[PandasDataFrame]) -> PandasDataFrame:
dtypes = dataframes[0].dataframe.dtypes
dfs = []
Expand All @@ -164,16 +235,30 @@ def concat(dataframes: Sequence[PandasDataFrame]) -> PandasDataFrame:

def column_from_sequence(
sequence: Sequence[Any], *, dtype: Any, name: str, api_version: str | None = None
) -> PandasColumn[Any]:
) -> PandasPermissiveColumn[Any]:
ser = pd.Series(sequence, dtype=map_standard_dtype_to_pandas_dtype(dtype), name=name)
return PandasColumn(ser, api_version=LATEST_API_VERSION)
return PandasPermissiveColumn(ser, api_version=api_version or LATEST_API_VERSION)


def dataframe_from_dict(
data: dict[str, PandasPermissiveColumn[Any]], api_version: str | None = None
) -> PandasDataFrame:
for _, col in data.items():
if not isinstance(col, PandasPermissiveColumn): # pragma: no cover
raise TypeError(f"Expected PandasPermissiveColumn, got {type(col)}")
return PandasDataFrame(
pd.DataFrame(
{label: column.column.rename(label) for label, column in data.items()}
),
api_version=api_version or LATEST_API_VERSION,
)


def column_from_1d_array(
data: Any, *, dtype: Any, name: str | None = None, api_version: str | None = None
) -> PandasColumn[Any]: # pragma: no cover
) -> PandasPermissiveColumn[Any]: # pragma: no cover
ser = pd.Series(data, dtype=map_standard_dtype_to_pandas_dtype(dtype), name=name)
return PandasColumn(ser, api_version=api_version or LATEST_API_VERSION)
return PandasPermissiveColumn(ser, api_version=api_version or LATEST_API_VERSION)


def dataframe_from_2d_array(
Expand All @@ -189,20 +274,6 @@ def dataframe_from_2d_array(
return PandasDataFrame(df, api_version=api_version or LATEST_API_VERSION)


def dataframe_from_dict(
data: dict[str, PandasColumn[Any]], api_version: str | None = None
) -> PandasDataFrame:
for _, col in data.items():
if not isinstance(col, PandasColumn): # pragma: no cover
raise TypeError(f"Expected PandasColumn, got {type(col)}")
return PandasDataFrame(
pd.DataFrame(
{label: column.column.rename(label) for label, column in data.items()}
),
api_version=api_version or LATEST_API_VERSION,
)


def is_null(value: Any) -> bool:
return value is null

Expand All @@ -223,3 +294,47 @@ def is_dtype(dtype: Any, kind: str | tuple[str, ...]) -> bool:
if _kind == "string":
dtypes.add(String)
return isinstance(dtype, tuple(dtypes))


def any_rowwise(*columns: str, skip_nulls: bool = True) -> PandasColumn:
# todo: accept expressions
def func(df):
return df.loc[:, list(columns) or df.columns.tolist()].any(axis=1)

return PandasColumn(root_names=list(columns), output_name="any", base_call=func)


def all_rowwise(*columns: str, skip_nulls: bool = True) -> PandasColumn:
def func(df: pd.DataFrame) -> pd.Series:
return df.loc[:, list(columns) or df.columns.tolist()].all(axis=1)

return PandasColumn(root_names=list(columns), output_name="all", base_call=func)


def sorted_indices(
*keys: str,
ascending: Sequence[bool] | bool = True,
nulls_position: Literal["first", "last"] = "last",
) -> Column:
def func(df: pd.DataFrame) -> pd.Series:
if ascending:
return (
df.loc[:, list(keys)]
.sort_values(list(keys))
.index.to_series()
.reset_index(drop=True)
)
return (
df.loc[:, list(keys)]
.sort_values(list(keys))
.index.to_series()[::-1]
.reset_index(drop=True)
)

return PandasColumn(root_names=list(keys), output_name="indices", base_call=func)


def unique_indices(
keys: str | list[str] | None = None, *, skip_nulls: bool = True
) -> Column:
raise NotImplementedError("namespace.unique_indices not implemented for pandas yet")
Loading