Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix: Empty type #1227

Merged
merged 36 commits into from
Feb 20, 2024
Merged

Bugfix: Empty type #1227

merged 36 commits into from
Feb 20, 2024

Conversation

vasil-pashov
Copy link
Collaborator

@vasil-pashov vasil-pashov commented Jan 12, 2024

Reference Issues/PRs

Closes #1107

What does this implement or fix?

Fixes how empty type interacts with other types. In general we should be able to:

  • Append other types to columns which were initially empty
  • Append empty columns to columns of any other type
  • Update values with empty type preserving the type of the column

Changes:

  • Each type handler now has a function to report the byte size of its elements.
  • Each type handler now has a function to default initialize some memory
  • Empty type handler now backfills the "empty" elements.
    • integer types -> 0 (Not perfect but in future we're planning to add default value argument)
    • float types -> NaN
    • string types -> None
    • bool -> False
    • Nullable boolean -> None
    • Date -> NaT
  • The function which does default initialization up to now was used only in dynamic schema. Now the empty handler calls it as well to do the backfill. The function first checks the PoD types and if any of them matches uses the basic default initialization, otherwise it checks if there is a type handler and if so uses its default_initialize functionality
  • Refactor how updating works. Column::truncate is used instead of copying segment rows one by one. This should improve the performance of update.
  • Add a new python fixture lmdb_version_store_static_and_dynamic to cover all combinations {V1, V2} ecoding x {Static, Dynamic} schema
  • Empty typed columns are now reported as dense columns. They don't have a sparse map and both physical and logical rows are left uninitialized (value -1)

DISCUSS

  • Should we add an option to support updating non-empty stuff with empty (Conclusion reached in a slack thread: yes, we should allow to update with None as long as the type of the output is the same as the type of the column.)
  • What should be the output for the following example (a column of none vs a column of 0) (Conclusion reached in a slack thread. The result should be [0,0], i.e. the output should have the same type as the type of the column.)
lib.write("sym", pd.DataFrame({"col": [1,2,3]}))
lib.append("sym", pd.DataFrame({"col": [None, None]}))
lib.read("sym", row_range=[3:5]).data
  • Do we need hypothesis testing random appends of empty and non-empty stuff to the same column?
    Dev TODO
  • Verify the following throws
lib.write("sym", pd.DataFrame({"col": [None, None]}))
lib.append("sym", pd.DataFrame({"col": [1, 2, 3]}))
lib.append("sym", pd.DataFrame({"col": ["some", "string"]}))
  • Appending to empty for dynamic schema
  • Fix appending empty to other types with static schema
  • Fix appending empty to other types with dynamic schema
  • Create a single function to handle backfilling of data and use it both in the empty handler and in reduce_and_fix
  • Change the name of PYBOOL type
  • Fix update e.g.
lmdb_version_store_v2.write('test', pd.DataFrame([None, None], index=pd.date_range(periods=2, end=dt.now(), freq='1T')))
lmdb_version_store_v2.update('test', pd.DataFrame([None, None], index=pd.date_range(periods=2, end=dt.now(), freq='1T')))
  • Add tests for starting with empty column list e.g. pd.DataFrame({"col": []}). Potentially mark it as xfail and fix with a later PR.
  • Add tests for update when the update range is not entirely contained in the dataframe range index

Any other comments?

Checklist

Checklist for code changes...
  • Have you updated the relevant docstrings, documentation and copyright notice?
  • Is this contribution tested against all ArcticDB's features?
  • Do all exceptions introduced raise appropriate error messages?
  • Are API changes highlighted in the PR description?
  • Is the PR labelled as enhancement or bug so it appears in autogenerated release notes?

@vasil-pashov vasil-pashov force-pushed the dev/vasil.pashov/fix_empty_type branch 2 times, most recently from 7efb716 to adf99ee Compare January 12, 2024 14:03
@vasil-pashov vasil-pashov changed the title Dev/vasil.pashov/fix empty type Bugfix: Empty type Jan 14, 2024
Vasil Pashov added 4 commits January 15, 2024 17:03
- Remove PYBOOL64
- Fix issues regarding appending empty to strings and strings to empty
  with dynamic schema.
- Add tests for appending empty to all other types and all other types
  to empty
@vasil-pashov vasil-pashov force-pushed the dev/vasil.pashov/fix_empty_type branch from adf99ee to 8eb7253 Compare January 22, 2024 16:33
@vasil-pashov vasil-pashov marked this pull request as ready for review February 1, 2024 15:56
@vasil-pashov vasil-pashov force-pushed the dev/vasil.pashov/fix_empty_type branch from 762662a to ff4bcdd Compare February 5, 2024 14:59
@alexowens90
Copy link
Collaborator

Can you add a test for #1056 as well? xfail it if these changes don't fix it, but good to understand the current behaviour

@vasil-pashov vasil-pashov force-pushed the dev/vasil.pashov/fix_empty_type branch from 5c36769 to c2ef852 Compare February 9, 2024 14:47
@DrNickClarke
Copy link
Contributor

DrNickClarke commented Feb 15, 2024

  • Please run the new empty missing tests before merging this code
  • Please test that defrag works on a column containing mixed int + missing and bool + missing (ie that is does not compress the missing data and values into the same segment, which I understand would cause an issue)

@alexowens90
Copy link
Collaborator

  • Please run the new empty missing tests before merging this code
  • Please test that defrag works on a column containing mixed int + missing and bool + missing (ie that is does not compress the missing data and values into the same segment, which I understand would cause an issue)

I hadn't considered the impact on defrag before. We should merge this and open a new ticket to deal with it though, it will be a mostly orthogonal change.

Vasil Pashov added 3 commits February 16, 2024 18:36
@vasil-pashov
Copy link
Collaborator Author

vasil-pashov commented Feb 16, 2024

Related epic tracking all remaining things left to do #1339

@vasil-pashov vasil-pashov merged commit 6ed0148 into master Feb 20, 2024
112 checks passed
@vasil-pashov vasil-pashov deleted the dev/vasil.pashov/fix_empty_type branch February 20, 2024 10:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Appending empty type to non-empty column and vice versa is not working
4 participants