Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add column API mvp #100

Merged
merged 57 commits into from
Apr 13, 2024
Merged

Add column API mvp #100

merged 57 commits into from
Apr 13, 2024

Conversation

ezmiller
Copy link
Collaborator

@ezmiller ezmiller commented Mar 10, 2023

Goal

This PR adds a new column API to tablecloth.

Overview

ThIs PR adds a new column API to tablecloth. It also lifts a new set of functions into the existing dataset API that make it possible to run the new column operations on columns in a dataset. There are a few core concepts to the new column API:

  1. There is now such a thing as a column alongside the dataset in Tablecloth. This column is just the tech.ml.dataset Column that constitutes a dataset, but the new Column API makes it a basic primitive in Tablecloth.

  2. While there are some special functions within this new API that help with questions about the identity of a column and the typing of its elements, the bulk of the new code here is generated code that wraps operations functions already present in dtype-next's tech.v3.datatype.functional namespace. In lifting these functions, the resulting functions have two main characteristics: 1) They take a column as the first argument always, just like the functions in Tablecloth's dataset API, and 2) they always return a new column.

  3. In addition to adding the new column API, this PR also adds a new set of operators to Tablecloth's Dataset API: tablecloth.api.operators. These functions allow the column operations to be easily performed on datasets. These may end up being the most commonly parts of this PR simply because they are convenient, adding expressiveness to dataset manipulations. Here's a screenshot from @kiramclean's excellent 2023 Clojure Conj talk on "Clojure for Data Science in the Real world" that sums it up well:

image

Details of the implementation

There are a great many lines of code in this PR, but the bulk of the changes are actually made via code generation tools that were built to "lift" these tools. The utilities for this process are located in src/tablecloth/utils/codegen.clj. Then for each of the APIs where we are doing lifting the utilities are used to generate the two operators namespaces in src/tablecloth/api/lift_operators.clj and src/tablecloth/column/api/lift_operators.clj. Those two namespaces contain the functions that actually describe the functions generated.

Right now to regenerate these namespaces, we need to manually run chunks of code that are commented out at the bottom of those two lift namespaces. Going forward we should consider automating these processes in github actions.

Open Questions

  • Will the code generation pathway become burdensome over time, or is this a good pattern? Is there any real alternative for this stack?

* added some docstrings
* re-organized a little
* Add ->general-types function

* Add a general type :logical

* Use type hierarchy in tablecloth.api.utils for `typeof` functions

* Add column dev branch to pr workflow

* Add tests for typeof

* Fix tests for typeof

* Return the concrete type from `typeof`

* Simplify `concrete-types` fn

* Optimize ->general-types by using static lookup

* Adjust fns listing types

* We decided that the default meaning of type points to the "concrete"
type, and not the general type.
* So `types` now returns the set of concrete types and `general-types`
returns the general types.

* Revert "Adjust fns listing types"

This reverts commit d93e34f.

* Fix `typeof` test to test for concerete types

* Reorganize `typeof?` tests

* Reword docstring for `typeof?` slightly

* Update column api template and add missing `typeof?`

* Add commment to `general-types-lookup`

* Improve `->general-types` docstring

* Add `general-types` fn that returns sets of general types

* Adjust util `types` fn to return concrete types
* Add ->general-types function

* Add a general type :logical

* Use type hierarchy in tablecloth.api.utils for `typeof` functions

* Add column dev branch to pr workflow

* Add tests for typeof

* Fix tests for typeof

* Return the concrete type from `typeof`

* Simplify `concrete-types` fn

* Optimize ->general-types by using static lookup

* Adjust fns listing types

* We decided that the default meaning of type points to the "concrete"
type, and not the general type.
* So `types` now returns the set of concrete types and `general-types`
returns the general types.

* Revert "Adjust fns listing types"

This reverts commit d93e34f.

* Fix `typeof` test to test for concerete types

* Reorganize `typeof?` tests

* Reword docstring for `typeof?` slightly

* Update column api template and add missing `typeof?`

* Add commment to `general-types-lookup`

* Improve `->general-types` docstring

* Add `general-types` fn that returns sets of general types

* Adjust util `types` fn to return concrete types

* Save changes to column api.clj

* Save ongoing experiments with lifting

* Save ongoing work on lifting

* Adjust lift-ops-1 to handle any number of args with rest arg

* Working `rearrange-args` fn

* Save work actually writing lifted fns

* Saving first attempt to writer operators

* Add `percentiiles test

* Adjust `rearrange-args to take new-args in option map

* Unify two lift functions

* Add in docstrings when present

* Move lift utils into utils ns

* Rename lifting namespaces

* Lift some more fns

* Make exclusions for ns header helper an arg

* Add new operators and tests

* Add ops with lhs rhs arg pattern

* Lift '*

* Add require to operators ns for utils

* Update test to make it more complete

* Lift `equals

* Make test more accurate

* Reorganize tests

* Fix grammar

* Lift 'shift

* Uncomment 'or test

* Lift 'normalize op

* Life 'magnitude

* Lifting bit manipulation ops

* lift ieee-remainder

* Lifting more functions

* Add excludes

* Lift a bunch of new functions

* Alphebetize some lists

* More alphebitization

* Clean up

* Instead of using `col` as arg conform to using `x & and `y

* Temporarily disable failing test fix in 7.000-beta23

* Disable the correct test

* Just some minor cleanup in op tests

* Some more cleanup/reorg in op tests

* Update generated operators namespace with switch from col -> x etc

* Lift 'descriptive-statistics

* Fix messed up test layout

* Lift 'quartiles

* Lift 'fill-range and a bunch of reduce operations

* Lift 'mean-fast 'sum-fast 'magnitude-squared

* Lift correlation fns

kendalls, pearsons, and spearmans

* Lift cumulative ops

* cleanup
* Upgrade to latest clay version

* Show using tablecloth.column.api.operators ns

* Cleanup whitespace
* Fix indentation

* Save rough working example

Not fully tested

* Fix tests for new aggregator form of ops that return scalar
* Add a sample notebook file

* Save draft work on column api doc

* Add doc entry for tcc/select boolean select

This appears to be broken now, but ti shouldn't be.

* Export column api operators in column api ns

* Add in some documentation of operations

* Hide namespace expression from generated doc

* Fix circular dependency

* Update generated docs

* Update text in colum operations section

* More updates to the docs

* Remove "Functionality" header in TOC

This way Dataset is an entry, and I can add Column after that.

* Add Column API documentation

* Add an indication of column op signature to docs

* Export lifted column operators in dataset api template

* Add documentation for column operations on datasets

* Some minor changes

* Rename the two headers for Dataset and Column, adding API onto the
end.
* A few small fixes.

* Remove the `Functions` section

This is essentially replaced by the Column API that lifts these
functions into Tablecloth

* Try to remove cyclical dependency

* Revert "Try to remove cyclical dependency"

This reverts commit fcb16c4.

* Fix circular dependency

* Actually fix cyclical dependency

* Undo added line
@@ -4,6 +4,7 @@ on:
pull_request:
branches:
- master
- ethan/column-api-dev-branch-1
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to remove before we merge this branch.

@ezmiller
Copy link
Collaborator Author

closing to reopen for testing doc preview

@ezmiller ezmiller closed this Feb 24, 2024
@ezmiller ezmiller reopened this Feb 24, 2024
Copy link

github-actions bot commented Feb 24, 2024

PR Preview Action v1.4.7
🚀 Deployed preview to https://scicloj.github.io/tablecloth/pr-preview/pr-100/
on branch gh-pages at 2024-04-06 18:35 UTC

@ezmiller ezmiller closed this Feb 24, 2024
@ezmiller ezmiller reopened this Mar 22, 2024
@ezmiller ezmiller merged commit 9d72a88 into master Apr 13, 2024
2 checks passed
@ezmiller ezmiller deleted the ethan/column-api-dev-branch-1 branch May 19, 2024 00:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant