Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve df to tables writer #709

Merged
merged 6 commits into from
Mar 19, 2020

Conversation

mmkekic
Copy link
Collaborator

@mmkekic mmkekic commented Mar 12, 2020

This PR removes underscore in front of store_pandas_as_tables writer since the method is being used outside the module. It also modifies the method to improve speed when writing large dataframes, no tests needed because it is just internal change.

@mmkekic mmkekic requested a review from gonzaponte March 12, 2020 18:07
Copy link
Collaborator

@gonzaponte gonzaponte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the tests are also checking that the types remain unchanged after writing & reading, right? I would like to have a specific one, but it's ok as is.

Comment on lines 64 to 65
data_types = [(col, arr[col].dtype if arr[col].dtype != 'O' else \
f'S{str_col_length}') for col in arr.dtype.names]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit hard to read. Can we define a function (even inline) to do that?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about now?

@mmkekic
Copy link
Collaborator Author

mmkekic commented Mar 13, 2020

I think the tests are also checking that the types remain unchanged after writing & reading, right? I would like to have a specific one, but it's ok as is.

I just noticed that the check type was to False in the tests, I changed that

Comment on lines 32 to 40
def _make_tabledef(column_types : pd.Series, str_col_length : int=32) -> dict:
def _make_tabledef(df : pd.DataFrame, str_col_length : int=32) -> dict:
column_types = df.dtypes
tabledef = {}
for indx, colname in enumerate(column_types.index):
coltype = column_types[colname].name
if coltype == 'object':
if df[colname].str.len().max() > str_col_length:
warnings.warn(f'dataframe contains strings longer than allowed', UserWarning)
tabledef[colname] = tb.StringCol(str_col_length, pos=indx)
else:
tabledef[colname] = tb.Col.from_type(coltype, pos=indx)
return tabledef
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I don't think _make_tabledef cares about the length of the object. I would revert this back to the original version and perform the check in store_pandas_as_tables. What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. I added a check in store_pandas_as_tables directly. I also modified to cast numpy records array to the table type if it was an already existing one. However I feel like this 3x search for strings and type castings can be somehow better optimized...


arr = df.to_records(index=False)
#hack to transform object numpy dtype to string
def _cast_type(dtype : np.dtype):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand where the underscore comes from, but since it is inside a function it doesn't need to have special syntax :)

table_name = 'table_name_3'
with tb.open_file(filename, 'w') as h5out:
with pytest.warns(UserWarning, match='dataframe contains strings longer than allowed'):
store_pandas_as_tables(h5out, df, group_name, table_name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest that here you set explicitly the string column length in the call to store_pandas_as_tables to make the test more obvious.

assert_dataframes_close(df_read, pd.concat([df1, df2]).reset_index(drop=True), False, rtol=1e-5)
assert_dataframes_equal(df_read, pd.concat([df1, df2]).reset_index(drop=True))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these dataframes contain floats, assert_dataframes_close is more appropriate.

Comment on lines 75 to 77
else:
if arr.dtype[colname] != data_types[colname]:
warnings.warn(f'dataframe numeric types not consistent with the table existing ones', UserWarning)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this part belongs to the case in which there is already a table object stored. It doesn't make sense to check it for both cases

Comment on lines 71 to 72
for colname in df.columns:
if (df[colname].dtype.name == 'object'):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this could be written as

for colname, col in filter(lambda (_, c): c.dtype.name == "object", df.items()):
    if col.str.len().max() > data_types[colname].itemsize:
        warnings.warn...

or

for colname, col in df.items():
    if col.dtype.name == "object" and col.str.len().max() > data_types[colname].itemsize:
        warnings.warn...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right... Let me try to rewrite the whole function in a more sensible way...

if len(arr) == 0:
warnings.warn(f'dataframe is empty', UserWarning)
else:
_can_cast(arr, data_types)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest a better name like _check_castability or something like that. The use of "can" suggests that it returns a boolean value.

warnings.warn(f'dataframe numeric types not consistent with the table existing ones', UserWarning)


def store_pandas_as_tables(h5out : tb.file.File, df : pd.DataFrame, group_name : str, table_name : str, compression : str='ZLIB4', descriptive_string : [str]="", str_col_length : int=32) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it was already like that, but can we take the opportunity to split this into many for readability?

Comment on lines 50 to 52
warnings.warn(f'dataframe contains strings longer than allowed', UserWarning)
elif arr_types[name] != table_types[name]:
warnings.warn(f'dataframe numeric types not consistent with the table existing ones', UserWarning)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't these be actual errors rather than just warnings?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure, probably we do want error in case you try to store string in numeric column, but we do not want to stop the process if someone tries to put an Int32 value into Int64 columns... And writing all possible combinations that make sense seems like a bit tedious job, so I opted for 'leave it to user responsibility to check the warning'

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm, if you think about the cities, that shouldn't happen as the types should be consistent through the iterations. If you think about other usages, I don't see that it is a big deal for the user to deal with slightly different types.

For the first warning the code will probably just chop part of the string, and the second one would either crash or store useless data. And, being realistic, warnings are not usually checked..

@mmkekic
Copy link
Collaborator Author

mmkekic commented Mar 18, 2020

I think I addressed all previous comments. @gonzaponte if you are happy with the code I can do the history cleaning?

Copy link
Collaborator

@gonzaponte gonzaponte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After these cosmetical changes we are good to go

Comment on lines 42 to 55
def _check_castability(arr : np.ndarray, table_types : np.dtype):
arr_types = arr.dtype
if set(arr_types.names) != set(table_types.names):
raise TableMismatch(f'dataframe differs from already existing table structure')
for name in arr_types.names:
if arr_types[name].name == 'object':
max_str_length = max(map(len, arr[name]))
if max_str_length > table_types[name].itemsize:
warnings.warn(f'dataframe contains strings longer than allowed', UserWarning)
elif not np.can_cast(arr_types[name], table_types[name], casting='same_kind'):
raise TableMismatch(f'dataframe numeric types not consistent with the table existing ones')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _check_castability(arr : np.ndarray, table_types : np.dtype):
arr_types = arr.dtype
if set(arr_types.names) != set(table_types.names):
raise TableMismatch(f'dataframe differs from already existing table structure')
for name in arr_types.names:
if arr_types[name].name == 'object':
max_str_length = max(map(len, arr[name]))
if max_str_length > table_types[name].itemsize:
warnings.warn(f'dataframe contains strings longer than allowed', UserWarning)
elif not np.can_cast(arr_types[name], table_types[name], casting='same_kind'):
raise TableMismatch(f'dataframe numeric types not consistent with the table existing ones')
def _check_castability(arr : np.ndarray, table_types : np.dtype):
arr_types = arr.dtype
if set(arr_types.names) != set(table_types.names):
raise TableMismatch(f'dataframe differs from already existing table structure')
for name in arr_types.names:
if arr_types[name].name == 'object':
max_str_length = max(map(len, arr[name]))
if max_str_length > table_types[name].itemsize:
warnings.warn(f'dataframe contains strings longer than allowed', UserWarning)
elif not np.can_cast(arr_types[name], table_types[name], casting='same_kind'):
raise TableMismatch(f'dataframe numeric types not consistent with the table existing ones')

Helps readability.



@given(df=dataframe)
def test_store_pandas_as_tables_raises_warning_inconsistent_types(config_tmpdir, df):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def test_store_pandas_as_tables_raises_warning_inconsistent_types(config_tmpdir, df):
def test_store_pandas_as_tables_raises_TableMismatch_inconsistent_types(config_tmpdir, df):

@mmkekic mmkekic force-pushed the improve_df_to_tables_writer branch from 920472a to 1f95085 Compare March 19, 2020 10:56
Copy link
Collaborator

@gonzaponte gonzaponte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR makes the function store_pandas_as_table public, as it should be. It also improves the performance of the writer and adds type checks for robustness. A bunch of tests have been added, which is always good.

Good job!

mmkekic added 6 commits March 19, 2020 16:16
The function is not used only in this module so it shouldnt be
private.
The writing is done by appending a numpy record array to the
pytable. It also adds checks of whether the type conversion is possible.
Also removes False for type checking in assert_dataframe_close
Add tests that check the warining is raised if the string is too long,
and a test that check TableMismatch error is raised if the numeric
types are different.
In case new dataframe columns do not have the same order of columns as
the already existing table
@mmkekic mmkekic force-pushed the improve_df_to_tables_writer branch from 1f95085 to bf6085c Compare March 19, 2020 15:21
@bpalmeiro bpalmeiro merged commit aac89ff into next-exp:master Mar 19, 2020
@mmkekic mmkekic deleted the improve_df_to_tables_writer branch February 22, 2021 19:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants