Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Appending empty type to non-empty column and vice versa is not working #1107

Closed
vasil-pashov opened this issue Nov 27, 2023 · 3 comments · Fixed by #1227
Closed

Appending empty type to non-empty column and vice versa is not working #1107

vasil-pashov opened this issue Nov 27, 2023 · 3 comments · Fixed by #1227
Assignees
Labels
bug Something isn't working
Milestone

Comments

@vasil-pashov
Copy link
Collaborator

Describe the bug

When trying to append an empty segment to a column that already has data (or trying to append data to an empty segment) it fails due to type mismatch. Two things need to be done:

  1. Make it possible to append empty segments to non-empty columns (and vice versa)
  2. Perform filling of data. If appending empty column to column of int values fill the "empty" ones with zero, if appending to floats fill with NaN if appending to strings fill with None.

Steps/Code to Reproduce

from arcticdb import Arctic
from pandas.testing import assert_frame_equal

ac = Arctic("lmdb://test")
lib = ac.get_library("test", create_if_missing=True)
df_empty = pd.DataFrame({"col1": [None, None]})
lib.write("test_can_append_to_empty_column", df_empty)
df_non_empty = pd.DataFrame({"col1": [1, 2, 3]})
lib .append("test_can_append_to_empty_column", df_non_empty)
expected_result = pd.concat([df_empty, df_non_empty], ignore_index=True)
assert_frame_equal(lib .read("test_can_append_to_empty_column").data, expected_result)

Expected Results

No error is thrown.

{"col1": [0,0,1,2,3]}

OS, Python Version and ArcticDB Version

All

Backend storage used

No response

Additional Context

No response

@vasil-pashov vasil-pashov added the bug Something isn't working label Nov 27, 2023
@vasil-pashov vasil-pashov self-assigned this Nov 27, 2023
@vasil-pashov vasil-pashov changed the title Appending empty type to non-empty column and vice versa is not working = Appending empty type to non-empty column and vice versa is not working Nov 30, 2023
@DrNickClarke
Copy link
Contributor

This needs to include truly empty dataframes (with zero records). Especially the case where the initial write is empty. This is a requirement for a client.

@willdealtry
Copy link
Collaborator

What needs to be done is to add a check just before the call to trivially_compatible_types where the error is thrown (which is protecting the call to decode_and_expand, so that if the column has a null type we just skip that column.

Then in the operator or ReduceColumnTask where it currently calls default_initialize_rows when the column is entirely missing, it also needs to do that when the column has a none type.

It needs to be this way because this is the point where we know what the overall column type is, so that we can use the correct backfill value (i.e. floats should be backfilled with NaN etc)

@DrNickClarke
Copy link
Contributor

DrNickClarke commented Dec 21, 2023

I have now made a long list of issues related to empty and missing data. All of these need to be tackled as part of this issue. The list and an explanation is here
https://github.com/man-group/arcticdb-bloomberg/blob/main/notes/testing/missing_empty_data_tests.ipynb

@vasil-pashov vasil-pashov mentioned this issue Feb 2, 2024
17 tasks
vasil-pashov added a commit that referenced this issue Feb 20, 2024
#### Reference Issues/PRs
Closes #1107

#### What does this implement or fix?
Fixes how empty type interacts with other types. In general we should be
able to:
* Append other types to columns which were initially empty
* Append empty columns to columns of any other type
* Update values with empty type preserving the type of the column

### Changes:
* Each type handler now has a function to report the byte size of its
elements.
* Each type handler now has a function to default initialize some memory
* Empty type handler now backfills the "empty" elements.
* integer types -> 0 (Not perfect but in future we're planning to add
default value argument)
  * float types -> NaN
  * string types -> None
  * bool -> False
  * Nullable boolean -> None
  * Date -> NaT
* The function which does default initialization up to now was used only
in dynamic schema. Now the empty handler calls it as well to do the
backfill. The function first checks the PoD types and if any of them
matches uses the basic default initialization, otherwise it checks if
there is a type handler and if so uses its default_initialize
functionality
* Refactor how updating works. `Column::truncate` is used instead of
copying segment rows one by one. This should improve the performance of
update.
* Add a new python fixture `lmdb_version_store_static_and_dynamic` to
cover all combinations {V1, V2} ecoding x {Static, Dynamic} schema
* Empty typed columns are now reported as dense columns. They don't have
a sparse map and both physical and logical rows are left uninitialized
(value `-1`)


**DISCUSS**
- [x] Should we add an option to support updating non-empty stuff with
empty (Conclusion reached in a slack thread: yes, we should allow to
update with None as long as the type of the output is the same as the
type of the column.)
- [x] What should be the output for the following example (a column of
none vs a column of 0) (Conclusion reached in a slack thread. The result
should be [0,0], i.e. the output should have the same type as the type
of the column.)
```python
lib.write("sym", pd.DataFrame({"col": [1,2,3]}))
lib.append("sym", pd.DataFrame({"col": [None, None]}))
lib.read("sym", row_range=[3:5]).data
```
- [ ] Do we need hypothesis testing random appends of empty and
non-empty stuff to the same column?
**Dev TODO**
- [x] Verify the following throws
```python
lib.write("sym", pd.DataFrame({"col": [None, None]}))
lib.append("sym", pd.DataFrame({"col": [1, 2, 3]}))
lib.append("sym", pd.DataFrame({"col": ["some", "string"]}))
```
- [x] Appending to empty for dynamic schema
- [x] Fix appending empty to other types with static schema
- [x] Fix appending empty to other types with dynamic schema
- [x] Create a single function to handle backfilling of data and use it
both in the empty handler and in reduce_and_fix
- [x] Change the name of PYBOOL type
- [x] Fix update e.g.
```python
lmdb_version_store_v2.write('test', pd.DataFrame([None, None], index=pd.date_range(periods=2, end=dt.now(), freq='1T')))
lmdb_version_store_v2.update('test', pd.DataFrame([None, None], index=pd.date_range(periods=2, end=dt.now(), freq='1T')))
```
- [ ] Add tests for starting with empty column list e.g.
`pd.DataFrame({"col": []})`. Potentially mark it as xfail and fix with a
later PR.
- [ ] Add tests for update when the update range is not entirely
contained in the dataframe range index

#### Any other comments?

#### Checklist

<details>
  <summary>
   Checklist for code changes...
  </summary>
 
- [ ] Have you updated the relevant docstrings, documentation and
copyright notice?
- [ ] Is this contribution tested against [all ArcticDB's
features](../docs/mkdocs/docs/technical/contributing.md)?
- [ ] Do all exceptions introduced raise appropriate [error
messages](https://docs.arcticdb.io/error_messages/)?
 - [ ] Are API changes highlighted in the PR description?
- [ ] Is the PR labelled as enhancement or bug so it appears in
autogenerated release notes?
</details>

<!--
Thanks for contributing a Pull Request to ArcticDB! Please ensure you
have taken a look at:
- ArcticDB's Code of Conduct:
https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md
- ArcticDB's Contribution Licensing:
https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing
-->

---------

Co-authored-by: Vasil Pashov <vasil.pashov@man.com>
@alexowens90 alexowens90 added this to the 4.4.0 milestone Mar 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants