Appending empty type to non-empty column and vice versa is not working #1107

vasil-pashov · 2023-11-27T12:24:10Z

Describe the bug

When trying to append an empty segment to a column that already has data (or trying to append data to an empty segment) it fails due to type mismatch. Two things need to be done:

Make it possible to append empty segments to non-empty columns (and vice versa)
Perform filling of data. If appending empty column to column of int values fill the "empty" ones with zero, if appending to floats fill with NaN if appending to strings fill with None.

Steps/Code to Reproduce

from arcticdb import Arctic
from pandas.testing import assert_frame_equal

ac = Arctic("lmdb://test")
lib = ac.get_library("test", create_if_missing=True)
df_empty = pd.DataFrame({"col1": [None, None]})
lib.write("test_can_append_to_empty_column", df_empty)
df_non_empty = pd.DataFrame({"col1": [1, 2, 3]})
lib .append("test_can_append_to_empty_column", df_non_empty)
expected_result = pd.concat([df_empty, df_non_empty], ignore_index=True)
assert_frame_equal(lib .read("test_can_append_to_empty_column").data, expected_result)

Expected Results

No error is thrown.

{"col1": [0,0,1,2,3]}

OS, Python Version and ArcticDB Version

All

Backend storage used

No response

Additional Context

No response

DrNickClarke · 2023-12-11T13:39:28Z

This needs to include truly empty dataframes (with zero records). Especially the case where the initial write is empty. This is a requirement for a client.

willdealtry · 2023-12-14T15:47:18Z

What needs to be done is to add a check just before the call to trivially_compatible_types where the error is thrown (which is protecting the call to decode_and_expand, so that if the column has a null type we just skip that column.

Then in the operator or ReduceColumnTask where it currently calls default_initialize_rows when the column is entirely missing, it also needs to do that when the column has a none type.

It needs to be this way because this is the point where we know what the overall column type is, so that we can use the correct backfill value (i.e. floats should be backfilled with NaN etc)

DrNickClarke · 2023-12-21T10:58:57Z

I have now made a long list of issues related to empty and missing data. All of these need to be tackled as part of this issue. The list and an explanation is here
https://github.com/man-group/arcticdb-bloomberg/blob/main/notes/testing/missing_empty_data_tests.ipynb

#### Reference Issues/PRs Closes #1107 #### What does this implement or fix? Fixes how empty type interacts with other types. In general we should be able to: * Append other types to columns which were initially empty * Append empty columns to columns of any other type * Update values with empty type preserving the type of the column ### Changes: * Each type handler now has a function to report the byte size of its elements. * Each type handler now has a function to default initialize some memory * Empty type handler now backfills the "empty" elements. * integer types -> 0 (Not perfect but in future we're planning to add default value argument) * float types -> NaN * string types -> None * bool -> False * Nullable boolean -> None * Date -> NaT * The function which does default initialization up to now was used only in dynamic schema. Now the empty handler calls it as well to do the backfill. The function first checks the PoD types and if any of them matches uses the basic default initialization, otherwise it checks if there is a type handler and if so uses its default_initialize functionality * Refactor how updating works. `Column::truncate` is used instead of copying segment rows one by one. This should improve the performance of update. * Add a new python fixture `lmdb_version_store_static_and_dynamic` to cover all combinations {V1, V2} ecoding x {Static, Dynamic} schema * Empty typed columns are now reported as dense columns. They don't have a sparse map and both physical and logical rows are left uninitialized (value `-1`) **DISCUSS** - [x] Should we add an option to support updating non-empty stuff with empty (Conclusion reached in a slack thread: yes, we should allow to update with None as long as the type of the output is the same as the type of the column.) - [x] What should be the output for the following example (a column of none vs a column of 0) (Conclusion reached in a slack thread. The result should be [0,0], i.e. the output should have the same type as the type of the column.) ```python lib.write("sym", pd.DataFrame({"col": [1,2,3]})) lib.append("sym", pd.DataFrame({"col": [None, None]})) lib.read("sym", row_range=[3:5]).data ``` - [ ] Do we need hypothesis testing random appends of empty and non-empty stuff to the same column? **Dev TODO** - [x] Verify the following throws ```python lib.write("sym", pd.DataFrame({"col": [None, None]})) lib.append("sym", pd.DataFrame({"col": [1, 2, 3]})) lib.append("sym", pd.DataFrame({"col": ["some", "string"]})) ``` - [x] Appending to empty for dynamic schema - [x] Fix appending empty to other types with static schema - [x] Fix appending empty to other types with dynamic schema - [x] Create a single function to handle backfilling of data and use it both in the empty handler and in reduce_and_fix - [x] Change the name of PYBOOL type - [x] Fix update e.g. ```python lmdb_version_store_v2.write('test', pd.DataFrame([None, None], index=pd.date_range(periods=2, end=dt.now(), freq='1T'))) lmdb_version_store_v2.update('test', pd.DataFrame([None, None], index=pd.date_range(periods=2, end=dt.now(), freq='1T'))) ``` - [ ] Add tests for starting with empty column list e.g. `pd.DataFrame({"col": []})`. Potentially mark it as xfail and fix with a later PR. - [ ] Add tests for update when the update range is not entirely contained in the dataframe range index #### Any other comments? #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details>  --------- Co-authored-by: Vasil Pashov <vasil.pashov@man.com>

vasil-pashov added the bug Something isn't working label Nov 27, 2023

vasil-pashov self-assigned this Nov 27, 2023

vasil-pashov changed the title ~~Appending empty type to non-empty column and vice versa is not working =~~ Appending empty type to non-empty column and vice versa is not working Nov 30, 2023

vasil-pashov mentioned this issue Feb 2, 2024

Bugfix: Empty type #1227

Merged

17 tasks

vasil-pashov closed this as completed in #1227 Feb 20, 2024

alexowens90 added this to the 4.4.0 milestone Mar 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Appending empty type to non-empty column and vice versa is not working #1107

Appending empty type to non-empty column and vice versa is not working #1107

vasil-pashov commented Nov 27, 2023

DrNickClarke commented Dec 11, 2023

willdealtry commented Dec 14, 2023

DrNickClarke commented Dec 21, 2023 •

edited

Loading

Appending empty type to non-empty column and vice versa is not working #1107

Appending empty type to non-empty column and vice versa is not working #1107

Comments

vasil-pashov commented Nov 27, 2023

Describe the bug

Steps/Code to Reproduce

Expected Results

OS, Python Version and ArcticDB Version

Backend storage used

Additional Context

DrNickClarke commented Dec 11, 2023

willdealtry commented Dec 14, 2023

DrNickClarke commented Dec 21, 2023 • edited Loading

DrNickClarke commented Dec 21, 2023 •

edited

Loading