Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic schema can append rowrange to datetime index leading to datacorruption #1507

Closed
vasil-pashov opened this issue Apr 18, 2024 · 0 comments · Fixed by #1529
Closed

Dynamic schema can append rowrange to datetime index leading to datacorruption #1507

vasil-pashov opened this issue Apr 18, 2024 · 0 comments · Fixed by #1529
Assignees
Labels
bug Something isn't working

Comments

@vasil-pashov
Copy link
Collaborator

Describe the bug

When dynamic schema is enabled and the symbol contains empty dataframe it is possible to append a rowrange index to the symbol. This should be forbidden since the indexes are not matching. The append changes the index type to be rowrange, but still it is possible to append datetime to that as well leading to a data corruption.

Steps/Code to Reproduce

import arcticdb
import numpy as np
import pandas as pd

# 4.4.0
print(arcticdb.__version__)

ac = arcticdb.Arctic("lmdb://test")
opts = arcticdb.LibraryOptions(dynamic_schema=True)
lib = ac.get_library("test", create_if_missing=True, library_options=opts)

lib.write("test_dyn", pd.DataFrame({"a": [], dtype="float64"}))
assert lib.read("test_dyn").data.index.equals(pd.DatetimeIndex([]))

# This should not be allowed
lib.append("test_dyn", pd.DataFrame({"a": [1.0]}, pd.RangeIndex(start=0, stop=1, step=1)))
# The index appears to be changed
assert lib.read("test_dyn").data.index.equals(pd.RangeIndex(start=0, stop=1, step=1))
# The output looks ok even though the index is messed up
#    a
# 0  1
print(lib.read("test_dyn").data)


# Even though the index appears to be changed the following will throw:
# lib.append("test_dyn", pd.DataFrame({"a": [1.0]}, pd.RangeIndex(start=1, stop=2, step=1)))

# Appending datetime to this is allowed only if validate index is false
lib.append("test_dyn", pd.DatetimeIndex(["01/01/2024"]), validate_index=False)

# The data is corrupted
#                         index
# 1                         NaT
# 682406825570164596 2024-01-01

Expected Results

There are two cases:

  1. The empty index feature is enabled (Implement empty index for 0-rowed columns #1429, Feature flag empty_index #1475)
    Anything could be appended to symbol containing an empty dataframe. The type of the index is determined at the time of first append (or update). After the type of the index is determined index mixing indexes should not be allowed.

  2. The empty index feature is disabled
    The index is determined at write time. In case of Pandas 2 it is DatetimeIndex([]) for all empty DataFrames (in case of Pandas 1 it is not so consistent refer to test_empty_column_type.py for the expected behavior). Appending to a symbol containing empty dataframe is allowed only if the indexes are matching.

OS, Python Version and ArcticDB Version

Python: 3.10.11 (tags/v3.10.11:7d4cc5a, Apr 5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)]
OS: Windows-10-10.0.22631-SP0
ArcticDB: 4.4.0
Numpy: 1.26.3
Pandas: 2.1.4

Backend storage used

No response

Additional Context

No response

@vasil-pashov vasil-pashov added the bug Something isn't working label Apr 18, 2024
@vasil-pashov vasil-pashov self-assigned this Apr 18, 2024
alexowens90 added a commit that referenced this issue Apr 26, 2024
Closes #1507 
Closes #1509

While fixing, noticed that the added test
`test_append_range_index_from_zero` would also not have passed, so fixed
this too.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants