-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Appending empty type to non-empty column and vice versa is not working #1107
Comments
This needs to include truly empty dataframes (with zero records). Especially the case where the initial write is empty. This is a requirement for a client. |
What needs to be done is to add a check just before the call to trivially_compatible_types where the error is thrown (which is protecting the call to decode_and_expand, so that if the column has a null type we just skip that column. Then in the operator or ReduceColumnTask where it currently calls default_initialize_rows when the column is entirely missing, it also needs to do that when the column has a none type. It needs to be this way because this is the point where we know what the overall column type is, so that we can use the correct backfill value (i.e. floats should be backfilled with NaN etc) |
I have now made a long list of issues related to empty and missing data. All of these need to be tackled as part of this issue. The list and an explanation is here |
#### Reference Issues/PRs Closes #1107 #### What does this implement or fix? Fixes how empty type interacts with other types. In general we should be able to: * Append other types to columns which were initially empty * Append empty columns to columns of any other type * Update values with empty type preserving the type of the column ### Changes: * Each type handler now has a function to report the byte size of its elements. * Each type handler now has a function to default initialize some memory * Empty type handler now backfills the "empty" elements. * integer types -> 0 (Not perfect but in future we're planning to add default value argument) * float types -> NaN * string types -> None * bool -> False * Nullable boolean -> None * Date -> NaT * The function which does default initialization up to now was used only in dynamic schema. Now the empty handler calls it as well to do the backfill. The function first checks the PoD types and if any of them matches uses the basic default initialization, otherwise it checks if there is a type handler and if so uses its default_initialize functionality * Refactor how updating works. `Column::truncate` is used instead of copying segment rows one by one. This should improve the performance of update. * Add a new python fixture `lmdb_version_store_static_and_dynamic` to cover all combinations {V1, V2} ecoding x {Static, Dynamic} schema * Empty typed columns are now reported as dense columns. They don't have a sparse map and both physical and logical rows are left uninitialized (value `-1`) **DISCUSS** - [x] Should we add an option to support updating non-empty stuff with empty (Conclusion reached in a slack thread: yes, we should allow to update with None as long as the type of the output is the same as the type of the column.) - [x] What should be the output for the following example (a column of none vs a column of 0) (Conclusion reached in a slack thread. The result should be [0,0], i.e. the output should have the same type as the type of the column.) ```python lib.write("sym", pd.DataFrame({"col": [1,2,3]})) lib.append("sym", pd.DataFrame({"col": [None, None]})) lib.read("sym", row_range=[3:5]).data ``` - [ ] Do we need hypothesis testing random appends of empty and non-empty stuff to the same column? **Dev TODO** - [x] Verify the following throws ```python lib.write("sym", pd.DataFrame({"col": [None, None]})) lib.append("sym", pd.DataFrame({"col": [1, 2, 3]})) lib.append("sym", pd.DataFrame({"col": ["some", "string"]})) ``` - [x] Appending to empty for dynamic schema - [x] Fix appending empty to other types with static schema - [x] Fix appending empty to other types with dynamic schema - [x] Create a single function to handle backfilling of data and use it both in the empty handler and in reduce_and_fix - [x] Change the name of PYBOOL type - [x] Fix update e.g. ```python lmdb_version_store_v2.write('test', pd.DataFrame([None, None], index=pd.date_range(periods=2, end=dt.now(), freq='1T'))) lmdb_version_store_v2.update('test', pd.DataFrame([None, None], index=pd.date_range(periods=2, end=dt.now(), freq='1T'))) ``` - [ ] Add tests for starting with empty column list e.g. `pd.DataFrame({"col": []})`. Potentially mark it as xfail and fix with a later PR. - [ ] Add tests for update when the update range is not entirely contained in the dataframe range index #### Any other comments? #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details> <!-- Thanks for contributing a Pull Request to ArcticDB! Please ensure you have taken a look at: - ArcticDB's Code of Conduct: https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md - ArcticDB's Contribution Licensing: https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing --> --------- Co-authored-by: Vasil Pashov <vasil.pashov@man.com>
Describe the bug
When trying to append an empty segment to a column that already has data (or trying to append data to an empty segment) it fails due to type mismatch. Two things need to be done:
Steps/Code to Reproduce
Expected Results
No error is thrown.
{"col1": [0,0,1,2,3]}
OS, Python Version and ArcticDB Version
All
Backend storage used
No response
Additional Context
No response
The text was updated successfully, but these errors were encountered: