Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve support for missing values for column api #101

Merged
merged 34 commits into from
Apr 14, 2023

Conversation

ezmiller
Copy link
Collaborator

@ezmiller ezmiller commented Mar 12, 2023

Goal

Add some basic missing support, parallelling that in TC's main dataset API.

Solution

This PR adds the following fns:

  • missing - returns missing index in a column
  • is-missing? - returns true|false if specified index position is missing
  • count-missing
  • replace-missing
  • drop-missing

The only actually new function here is count-missing, which is just a convenience. replace-missing and drop-missing just use tech.v3.dataset's functions. They pack the column into a dataset, call those fns, and then extract the column.

* Accepts columns in the first position to support use with pipes
* If `col` is a vector of columns, then map-fn is run on all
I think this may be more of an internal fn
* move fns to their own namespace to mirror main tc api
* add `drop-missing` and `replace-missing`
@ezmiller ezmiller merged commit b9c5019 into ethan/column-api-dev-branch-1 Apr 14, 2023
@ezmiller ezmiller deleted the ethan/missing-support-columns branch April 14, 2023 13:45
ezmiller added a commit that referenced this pull request Apr 13, 2024
* Add namespace stub

* Add super naive colunn fn

* Add some simple column fns

* Add typeof function for column

* Save work on column exploration doc

* Upgrade to latest clay version

* Save scratch work in column.clj

* Polishing up existing column fns

* added some docstrings
* re-organized a little

* Move column ns into own domain tablecloth.column.api

* Add tests for `tablecloth.column.api/column`

* Add tests for `zeros` and `ones`

* Use api template to write public api

* Write tests against `tablecloth.column.api.column` ns

* Add column exploration html

* Add `typeof?` function to check datatype of column els

* Use buffer when creating zeros & ones columns

* Use `dtype` alias in ns

* Add comment to code snippet generating column api

* Fix comment syntax

* Use `tech.v3.datatype/const-reader` for `zeros` and `ones` function

* Update type interface to use type hierarchy in tablecloth.api.util (#76)

* Add ->general-types function

* Add a general type :logical

* Use type hierarchy in tablecloth.api.utils for `typeof` functions

* Add column dev branch to pr workflow

* Add tests for typeof

* Fix tests for typeof

* Return the concrete type from `typeof`

* Simplify `concrete-types` fn

* Optimize ->general-types by using static lookup

* Adjust fns listing types

* We decided that the default meaning of type points to the "concrete"
type, and not the general type.
* So `types` now returns the set of concrete types and `general-types`
returns the general types.

* Revert "Adjust fns listing types"

This reverts commit d93e34f.

* Fix `typeof` test to test for concerete types

* Reorganize `typeof?` tests

* Reword docstring for `typeof?` slightly

* Update column api template and add missing `typeof?`

* Add commment to `general-types-lookup`

* Improve `->general-types` docstring

* Add `general-types` fn that returns sets of general types

* Adjust util `types` fn to return concrete types

* Lift `tech.v3.datatype.functional` operations (#90)

* Add ->general-types function

* Add a general type :logical

* Use type hierarchy in tablecloth.api.utils for `typeof` functions

* Add column dev branch to pr workflow

* Add tests for typeof

* Fix tests for typeof

* Return the concrete type from `typeof`

* Simplify `concrete-types` fn

* Optimize ->general-types by using static lookup

* Adjust fns listing types

* We decided that the default meaning of type points to the "concrete"
type, and not the general type.
* So `types` now returns the set of concrete types and `general-types`
returns the general types.

* Revert "Adjust fns listing types"

This reverts commit d93e34f.

* Fix `typeof` test to test for concerete types

* Reorganize `typeof?` tests

* Reword docstring for `typeof?` slightly

* Update column api template and add missing `typeof?`

* Add commment to `general-types-lookup`

* Improve `->general-types` docstring

* Add `general-types` fn that returns sets of general types

* Adjust util `types` fn to return concrete types

* Save changes to column api.clj

* Save ongoing experiments with lifting

* Save ongoing work on lifting

* Adjust lift-ops-1 to handle any number of args with rest arg

* Working `rearrange-args` fn

* Save work actually writing lifted fns

* Saving first attempt to writer operators

* Add `percentiiles test

* Adjust `rearrange-args to take new-args in option map

* Unify two lift functions

* Add in docstrings when present

* Move lift utils into utils ns

* Rename lifting namespaces

* Lift some more fns

* Make exclusions for ns header helper an arg

* Add new operators and tests

* Add ops with lhs rhs arg pattern

* Lift '*

* Add require to operators ns for utils

* Update test to make it more complete

* Lift `equals

* Make test more accurate

* Reorganize tests

* Fix grammar

* Lift 'shift

* Uncomment 'or test

* Lift 'normalize op

* Life 'magnitude

* Lifting bit manipulation ops

* lift ieee-remainder

* Lifting more functions

* Add excludes

* Lift a bunch of new functions

* Alphebetize some lists

* More alphebitization

* Clean up

* Instead of using `col` as arg conform to using `x & and `y

* Temporarily disable failing test fix in 7.000-beta23

* Disable the correct test

* Just some minor cleanup in op tests

* Some more cleanup/reorg in op tests

* Update generated operators namespace with switch from col -> x etc

* Lift 'descriptive-statistics

* Fix messed up test layout

* Lift 'quartiles

* Lift 'fill-range and a bunch of reduce operations

* Lift 'mean-fast 'sum-fast 'magnitude-squared

* Lift correlation fns

kendalls, pearsons, and spearmans

* Lift cumulative ops

* cleanup

* Bring column exploration doc up-to-date (#95)

* Upgrade to latest clay version

* Show using tablecloth.column.api.operators ns

* Cleanup whitespace

* Add method for subsetting (#96)

* Export tech.ml.dataset `select` fn for column api

* Update docstring exported to api

* Update column-exploration with basic illustration of select

* Add `slice`

* clean up tests a bit

* Improve `slice` docstring slightly

* Export `slice` to column api

* Add stuff about `slice` to column exploration doc

* Move accesssing & subsetting seciton above basic ops

* Update column_expolration.html

* Update comment block

* Add iteration support by wrapping tech.v3.dataset.column/column-map (#97)

* Export tech.ml.dataset `select` fn for column api

* Update docstring exported to api

* Update column-exploration with basic illustration of select

* Add `slice`

* clean up tests a bit

* Improve `slice` docstring slightly

* Export `slice` to column api

* Add stuff about `slice` to column exploration doc

* Move accesssing & subsetting seciton above basic ops

* Update column_expolration.html

* Update comment block

* Add column-map wrapper over tech.v3.dataset.column/column-mapping

* Accepts columns in the first position to support use with pipes
* If `col` is a vector of columns, then map-fn is run on all

* Fix arg name

* Clean up

* Add iteration to column exploration and reorganize

* Add column-map to column api_template

* Add example of using column-map with multiple columns

* Update column_exploration html doc

* Update column_exploration html doc

* Add sorting support for column (#99)

* Add rough version of `sort-column` with some tests

* Add basic docstring

* Add support for `:asc` and `:desc` to sort-column

* Add note to handle missing values

* Make slight improvement to sort-column docstringa

* Improve support for missing values for column api (#101)

* Export tech.ml.dataset `select` fn for column api

* Update docstring exported to api

* Update column-exploration with basic illustration of select

* Add `slice`

* clean up tests a bit

* Improve `slice` docstring slightly

* Export `slice` to column api

* Add stuff about `slice` to column exploration doc

* Move accesssing & subsetting seciton above basic ops

* Update column_expolration.html

* Update comment block

* Add column-map wrapper over tech.v3.dataset.column/column-mapping

* Accepts columns in the first position to support use with pipes
* If `col` is a vector of columns, then map-fn is run on all

* Fix arg name

* Clean up

* Add iteration to column exploration and reorganize

* Add column-map to column api_template

* Add example of using column-map with multiple columns

* Update column_exploration html doc

* Update column_exploration html doc

* Export tech.v3.dataset.column's missing fns

* Remove `set-missing`

I think this may be more of an internal fn

* Add `count-missing` function

* Add test for `sort-column` for missing values

* Activate test that wil now pass due to tmd upgrade

* Add sort-column to api-template

* Add sort-column section to column_exploration doc

* Add more missing apidoc

* move fns to their own namespace to mirror main tc api
* add `drop-missing` and `replace-missing`

* Add details about missing api to column exploration

* Add a exmaple of using count to column exploration

* Add a few simple tests for missing ns

* Fix docstrings

* Add proof of concept

* Consolidate tablecloth.column.api/operators args (#106)

* Conslidate ops args to x y z

* Fix lift op for comparison ops

* Update lift-op fn to handle multiple ar lookups

Case that required this was the comparison ops. We
want (> x y z) from (> lhs rhs) (> lhs mid rhs). We
can't universally map y to rhs because it would be
wront for the 3-arity option.

* Lift column ops to the dataset level (#107)

* Readme: Replace `lein test` with `lein midje`

* Add proof of concept for lifting

* Clean up

* Fix magnitude arguments

* Fix typo breaking lift operation for `magnitude

* Save prototype working example that handles optional arguments

* Clean up

* Reorganize codegen utilities

* moved hopefully common utilities up  into 'tablecloth.utils.codegen
* retooled those helpers in that ns to be a bit more accessible (WIP)

* Clean up

* Clean up

* Rejigger codegen for column ops to take just fn-sym arglists

* Try lifting all column ops to ds (no tests yet)

* Exclude ops that do not potentially return column

* Do not lift options that do not return columns

* Add docstrings for some codegen

Also regenerated operators to make sure tests pass.

* Add docstring to ds col ops

* version bump and small fix

* Modify ds-level lift op to also return fn that returns column

This is a breaking change for the column api lifting until I adapt
the lift-op to the changes made in the codegen where the argument
is supplied in data rather than within a fn.

* example added for replace-missing

* Add tests for ops that take inf number of cols

* Add tests for ops returning ds taking max of three cols

* Add tests for ops returning ds and taking two columns max

* Test for ops returning ds and max of one column

* Add more functions to test for ops taking one col

* Clean up

* Lifted ops taking one column and returning a scalar

* Lift functions taking two columns and returning a scalar

* Clean up

* Clean up

* bump to 7.000-beta-50

* fixes #108

* hashing in joins enabled for every case

* 7.000-beta-51

* Clean up

* Lift functions taking 1 col and returning scalar

* Adjust column api lift ops to new declarative syntax

* Adjust lift plan for tablecloth.column.api for tmd v7

* Remove mention of tech.ml.datatype

* Add missing word

* Bump tmd version to 7.006 for fix to fns that were erroring

fns are: quartiles-1, quartiles-3 and median

* Fixing more tests

* Comment some code to keep around for a spell

* Remove special lift op for 'round

It's arugments were fixed.

* Cleanup

* 7.007

---------

Co-authored-by: Teodor Heggelund <git@teod.eu>
Co-authored-by: genmeblog <38646601+genmeblog@users.noreply.github.com>
Co-authored-by: GenerateMe <generateme.blog@gmail.com>
Co-authored-by: adham-omran <git@adham-omran.com>

* Ethan/lift scalar ops to ds as aggregators (#118)

* Fix indentation

* Save rough working example

Not fully tested

* Fix tests for new aggregator form of ops that return scalar

* Add `column` API documentation (#120)

* Add a sample notebook file

* Save draft work on column api doc

* Add doc entry for tcc/select boolean select

This appears to be broken now, but ti shouldn't be.

* Export column api operators in column api ns

* Add in some documentation of operations

* Hide namespace expression from generated doc

* Fix circular dependency

* Update generated docs

* Update text in colum operations section

* More updates to the docs

* Remove "Functionality" header in TOC

This way Dataset is an entry, and I can add Column after that.

* Add Column API documentation

* Add an indication of column op signature to docs

* Export lifted column operators in dataset api template

* Add documentation for column operations on datasets

* Some minor changes

* Rename the two headers for Dataset and Column, adding API onto the
end.
* A few small fixes.

* Remove the `Functions` section

This is essentially replaced by the Column API that lifts these
functions into Tablecloth

* Try to remove cyclical dependency

* Revert "Try to remove cyclical dependency"

This reverts commit fcb16c4.

* Fix circular dependency

* Actually fix cyclical dependency

* Undo added line

* Try deploying a documentation preview

* Add preview-branch to docs preview action

Default was gh-pages, we use master.

* Try adding umbrella-dir setting

* Try removing docs folder in umbrella-dir

* Remove old pr docs preview workflow

* Regenerated docs after merge from master

* Add section about column missing values to docs

* Regenerated docs after merge from master

* Remove draft notebook

* Remove temporary trigger for dev branch since it was target of prs

---------

Co-authored-by: Teodor Heggelund <git@teod.eu>
Co-authored-by: genmeblog <38646601+genmeblog@users.noreply.github.com>
Co-authored-by: GenerateMe <generateme.blog@gmail.com>
Co-authored-by: adham-omran <git@adham-omran.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant