Should there be `namespace.col` for filtering?

## TL;DR

Add `namespace.col`, to allow for lazy columns / lazy column reductions to work seamlessly in lazy implementations

Then:
- use `col = df.get_column_by_name('a')` if you want an eager column 'a', on which you can call reductions and comparisons (like `col.mean()`,  `col > 0`). Not necessarily available for all implementations (e.g. Ibis)
- use `namespace.col('a')` if you want a (possibly lazy) expression which you intend to use for filtering, like `df.get_rows_by_mask((namespace.col('a') - namespace.col('a').mean()) > 0)`

## Longer proposal

I'm a bit concerned about some operations just not being available in lazy cases in general, such as
```python
mask = df2.get_column_by_name('a') > 0
df1.get_rows_by_mask(mask)
```

~Polars~ Ibis at least doesn't allow this, filtering by a mask is only allowed if the mask is a column which comes from the same dataframe that's being filtered on - so
```python
mask = df1.get_column_by_name('a') > 0
df1.get_rows_by_mask(mask)
```
would be allowed instead. ~Such comparisons are intentionally not implemented in polars, see https://github.com/pola-rs/polars/issues/10274 for a related issue (`__eq__` between two different dataframes)~

I'd like to suggest, therefore, that we introduce `namespace.col`, so that the above could become
```python
mask = namespace.col('a') > 0
df1.get_rows_by_mask(mask)
```

And then there doesn't need to be any documentation like "this won't work for some implementations if the mask was created from a different dataframe"

This would also feel more familiar to developers:
- `pyspark` already has this:
  ```python
  from pyspark.sql.functions import col
  df.filter(col("state") == "OH")
  ```
- `polars` already has this (same as above, but `from polars import col`)
- `pandas` _may_ add it (see EuroScipy 2023 lightning talk by Joris)
- ibis has `_`
- siuba also has `_`

EDIT: have tried clarifying the issue. Have also replaced Polars with Ibis in the example as the Consortium seems more interested in that

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Should there be `namespace.col` for filtering? #229

TL;DR

Longer proposal

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Should there be namespace.col for filtering? #229

Description

TL;DR

Longer proposal

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Should there be `namespace.col` for filtering? #229