-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve df to tables writer #709
Improve df to tables writer #709
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the tests are also checking that the types remain unchanged after writing & reading, right? I would like to have a specific one, but it's ok as is.
invisible_cities/io/dst_io.py
Outdated
data_types = [(col, arr[col].dtype if arr[col].dtype != 'O' else \ | ||
f'S{str_col_length}') for col in arr.dtype.names] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit hard to read. Can we define a function (even inline) to do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about now?
I just noticed that the check type was to False in the tests, I changed that |
invisible_cities/io/dst_io.py
Outdated
def _make_tabledef(column_types : pd.Series, str_col_length : int=32) -> dict: | ||
def _make_tabledef(df : pd.DataFrame, str_col_length : int=32) -> dict: | ||
column_types = df.dtypes | ||
tabledef = {} | ||
for indx, colname in enumerate(column_types.index): | ||
coltype = column_types[colname].name | ||
if coltype == 'object': | ||
if df[colname].str.len().max() > str_col_length: | ||
warnings.warn(f'dataframe contains strings longer than allowed', UserWarning) | ||
tabledef[colname] = tb.StringCol(str_col_length, pos=indx) | ||
else: | ||
tabledef[colname] = tb.Col.from_type(coltype, pos=indx) | ||
return tabledef |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I don't think _make_tabledef
cares about the length of the object. I would revert this back to the original version and perform the check in store_pandas_as_tables
. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. I added a check in store_pandas_as_tables
directly. I also modified to cast numpy records array to the table type if it was an already existing one. However I feel like this 3x search for strings and type castings can be somehow better optimized...
invisible_cities/io/dst_io.py
Outdated
|
||
arr = df.to_records(index=False) | ||
#hack to transform object numpy dtype to string | ||
def _cast_type(dtype : np.dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand where the underscore comes from, but since it is inside a function it doesn't need to have special syntax :)
invisible_cities/io/dst_io_test.py
Outdated
table_name = 'table_name_3' | ||
with tb.open_file(filename, 'w') as h5out: | ||
with pytest.warns(UserWarning, match='dataframe contains strings longer than allowed'): | ||
store_pandas_as_tables(h5out, df, group_name, table_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest that here you set explicitly the string column length in the call to store_pandas_as_tables
to make the test more obvious.
invisible_cities/io/dst_io_test.py
Outdated
assert_dataframes_close(df_read, pd.concat([df1, df2]).reset_index(drop=True), False, rtol=1e-5) | ||
assert_dataframes_equal(df_read, pd.concat([df1, df2]).reset_index(drop=True)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these dataframes contain floats, assert_dataframes_close
is more appropriate.
invisible_cities/io/dst_io.py
Outdated
else: | ||
if arr.dtype[colname] != data_types[colname]: | ||
warnings.warn(f'dataframe numeric types not consistent with the table existing ones', UserWarning) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this part belongs to the case in which there is already a table object stored. It doesn't make sense to check it for both cases
invisible_cities/io/dst_io.py
Outdated
for colname in df.columns: | ||
if (df[colname].dtype.name == 'object'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this could be written as
for colname, col in filter(lambda (_, c): c.dtype.name == "object", df.items()):
if col.str.len().max() > data_types[colname].itemsize:
warnings.warn...
or
for colname, col in df.items():
if col.dtype.name == "object" and col.str.len().max() > data_types[colname].itemsize:
warnings.warn...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right... Let me try to rewrite the whole function in a more sensible way...
invisible_cities/io/dst_io.py
Outdated
if len(arr) == 0: | ||
warnings.warn(f'dataframe is empty', UserWarning) | ||
else: | ||
_can_cast(arr, data_types) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest a better name like _check_castability
or something like that. The use of "can" suggests that it returns a boolean value.
invisible_cities/io/dst_io.py
Outdated
warnings.warn(f'dataframe numeric types not consistent with the table existing ones', UserWarning) | ||
|
||
|
||
def store_pandas_as_tables(h5out : tb.file.File, df : pd.DataFrame, group_name : str, table_name : str, compression : str='ZLIB4', descriptive_string : [str]="", str_col_length : int=32) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it was already like that, but can we take the opportunity to split this into many for readability?
invisible_cities/io/dst_io.py
Outdated
warnings.warn(f'dataframe contains strings longer than allowed', UserWarning) | ||
elif arr_types[name] != table_types[name]: | ||
warnings.warn(f'dataframe numeric types not consistent with the table existing ones', UserWarning) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't these be actual errors rather than just warnings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure, probably we do want error in case you try to store string in numeric column, but we do not want to stop the process if someone tries to put an Int32 value into Int64 columns... And writing all possible combinations that make sense seems like a bit tedious job, so I opted for 'leave it to user responsibility to check the warning'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmm, if you think about the cities, that shouldn't happen as the types should be consistent through the iterations. If you think about other usages, I don't see that it is a big deal for the user to deal with slightly different types.
For the first warning the code will probably just chop part of the string, and the second one would either crash or store useless data. And, being realistic, warnings are not usually checked..
I think I addressed all previous comments. @gonzaponte if you are happy with the code I can do the history cleaning? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After these cosmetical changes we are good to go
invisible_cities/io/dst_io.py
Outdated
def _check_castability(arr : np.ndarray, table_types : np.dtype): | ||
arr_types = arr.dtype | ||
if set(arr_types.names) != set(table_types.names): | ||
raise TableMismatch(f'dataframe differs from already existing table structure') | ||
for name in arr_types.names: | ||
if arr_types[name].name == 'object': | ||
max_str_length = max(map(len, arr[name])) | ||
if max_str_length > table_types[name].itemsize: | ||
warnings.warn(f'dataframe contains strings longer than allowed', UserWarning) | ||
elif not np.can_cast(arr_types[name], table_types[name], casting='same_kind'): | ||
raise TableMismatch(f'dataframe numeric types not consistent with the table existing ones') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def _check_castability(arr : np.ndarray, table_types : np.dtype): | |
arr_types = arr.dtype | |
if set(arr_types.names) != set(table_types.names): | |
raise TableMismatch(f'dataframe differs from already existing table structure') | |
for name in arr_types.names: | |
if arr_types[name].name == 'object': | |
max_str_length = max(map(len, arr[name])) | |
if max_str_length > table_types[name].itemsize: | |
warnings.warn(f'dataframe contains strings longer than allowed', UserWarning) | |
elif not np.can_cast(arr_types[name], table_types[name], casting='same_kind'): | |
raise TableMismatch(f'dataframe numeric types not consistent with the table existing ones') | |
def _check_castability(arr : np.ndarray, table_types : np.dtype): | |
arr_types = arr.dtype | |
if set(arr_types.names) != set(table_types.names): | |
raise TableMismatch(f'dataframe differs from already existing table structure') | |
for name in arr_types.names: | |
if arr_types[name].name == 'object': | |
max_str_length = max(map(len, arr[name])) | |
if max_str_length > table_types[name].itemsize: | |
warnings.warn(f'dataframe contains strings longer than allowed', UserWarning) | |
elif not np.can_cast(arr_types[name], table_types[name], casting='same_kind'): | |
raise TableMismatch(f'dataframe numeric types not consistent with the table existing ones') |
Helps readability.
invisible_cities/io/dst_io_test.py
Outdated
|
||
|
||
@given(df=dataframe) | ||
def test_store_pandas_as_tables_raises_warning_inconsistent_types(config_tmpdir, df): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def test_store_pandas_as_tables_raises_warning_inconsistent_types(config_tmpdir, df): | |
def test_store_pandas_as_tables_raises_TableMismatch_inconsistent_types(config_tmpdir, df): |
920472a
to
1f95085
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR makes the function store_pandas_as_table
public, as it should be. It also improves the performance of the writer and adds type checks for robustness. A bunch of tests have been added, which is always good.
Good job!
The function is not used only in this module so it shouldnt be private.
The writing is done by appending a numpy record array to the pytable. It also adds checks of whether the type conversion is possible.
Also removes False for type checking in assert_dataframe_close
Add tests that check the warining is raised if the string is too long, and a test that check TableMismatch error is raised if the numeric types are different.
In case new dataframe columns do not have the same order of columns as the already existing table
1f95085
to
bf6085c
Compare
This PR removes underscore in front of store_pandas_as_tables writer since the method is being used outside the module. It also modifies the method to improve speed when writing large dataframes, no tests needed because it is just internal change.