Improve df to tables writer #709

mmkekic · 2020-03-12T18:06:57Z

This PR removes underscore in front of store_pandas_as_tables writer since the method is being used outside the module. It also modifies the method to improve speed when writing large dataframes, no tests needed because it is just internal change.

gonzaponte

I think the tests are also checking that the types remain unchanged after writing & reading, right? I would like to have a specific one, but it's ok as is.

gonzaponte · 2020-03-12T18:22:25Z

invisible_cities/io/dst_io.py

+    data_types = [(col, arr[col].dtype if arr[col].dtype != 'O' else \
+                   f'S{str_col_length}') for col in arr.dtype.names]


It's a bit hard to read. Can we define a function (even inline) to do that?

How about now?

mmkekic · 2020-03-13T18:59:49Z

I think the tests are also checking that the types remain unchanged after writing & reading, right? I would like to have a specific one, but it's ok as is.

I just noticed that the check type was to False in the tests, I changed that

gonzaponte · 2020-03-13T19:05:06Z

invisible_cities/io/dst_io.py

-def _make_tabledef(column_types : pd.Series, str_col_length : int=32) -> dict:
+def _make_tabledef(df : pd.DataFrame, str_col_length : int=32) -> dict:
+    column_types = df.dtypes
    tabledef = {}
    for indx, colname in enumerate(column_types.index):
        coltype = column_types[colname].name
        if coltype == 'object':
+            if df[colname].str.len().max() > str_col_length:
+                warnings.warn(f'dataframe contains strings longer than allowed', UserWarning)
            tabledef[colname] = tb.StringCol(str_col_length, pos=indx)
        else:
            tabledef[colname] = tb.Col.from_type(coltype, pos=indx)
    return tabledef


Actually I don't think _make_tabledef cares about the length of the object. I would revert this back to the original version and perform the check in store_pandas_as_tables. What do you think?

Agree. I added a check in store_pandas_as_tables directly. I also modified to cast numpy records array to the table type if it was an already existing one. However I feel like this 3x search for strings and type castings can be somehow better optimized...

gonzaponte · 2020-03-13T19:06:03Z

invisible_cities/io/dst_io.py

+
+    arr = df.to_records(index=False)
+    #hack to transform object numpy dtype to string
+    def _cast_type(dtype : np.dtype):


I understand where the underscore comes from, but since it is inside a function it doesn't need to have special syntax :)

gonzaponte · 2020-03-13T19:08:51Z

invisible_cities/io/dst_io_test.py

+    table_name = 'table_name_3'
+    with tb.open_file(filename, 'w') as h5out:
+        with pytest.warns(UserWarning, match='dataframe contains strings longer than allowed'):
+            store_pandas_as_tables(h5out, df, group_name, table_name)


I suggest that here you set explicitly the string column length in the call to store_pandas_as_tables to make the test more obvious.

gonzaponte · 2020-03-13T19:47:20Z

invisible_cities/io/dst_io_test.py

-    assert_dataframes_close(df_read, pd.concat([df1, df2]).reset_index(drop=True), False, rtol=1e-5)
+    assert_dataframes_equal(df_read, pd.concat([df1, df2]).reset_index(drop=True))


these dataframes contain floats, assert_dataframes_close is more appropriate.

invisible_cities/io/dst_io.py

gonzaponte · 2020-03-14T12:06:37Z

invisible_cities/io/dst_io.py

+        else:
+            if arr.dtype[colname] != data_types[colname]:
+                    warnings.warn(f'dataframe numeric types not consistent with the table existing ones', UserWarning)


I think this part belongs to the case in which there is already a table object stored. It doesn't make sense to check it for both cases

gonzaponte · 2020-03-14T12:44:17Z

invisible_cities/io/dst_io.py

+    for colname in df.columns:
+        if (df[colname].dtype.name == 'object'):


Maybe this could be written as

for colname, col in filter(lambda (_, c): c.dtype.name == "object", df.items()): if col.str.len().max() > data_types[colname].itemsize: warnings.warn...

or

for colname, col in df.items(): if col.dtype.name == "object" and col.str.len().max() > data_types[colname].itemsize: warnings.warn...

Right... Let me try to rewrite the whole function in a more sensible way...

gonzaponte · 2020-03-14T20:44:59Z

invisible_cities/io/dst_io.py

+    if len(arr) == 0:
+        warnings.warn(f'dataframe is empty', UserWarning)
+    else:
+        _can_cast(arr, data_types)


I suggest a better name like _check_castability or something like that. The use of "can" suggests that it returns a boolean value.

invisible_cities/io/dst_io.py

gonzaponte · 2020-03-14T20:53:36Z

invisible_cities/io/dst_io.py

+            warnings.warn(f'dataframe numeric types not consistent with the table existing ones', UserWarning)
+
+
+def store_pandas_as_tables(h5out : tb.file.File, df : pd.DataFrame, group_name : str, table_name : str, compression : str='ZLIB4', descriptive_string : [str]="", str_col_length : int=32) -> None:


I know it was already like that, but can we take the opportunity to split this into many for readability?

gonzaponte · 2020-03-14T20:56:34Z

invisible_cities/io/dst_io.py

+                warnings.warn(f'dataframe contains strings longer than allowed', UserWarning)
+        elif arr_types[name] != table_types[name]:
+            warnings.warn(f'dataframe numeric types not consistent with the table existing ones', UserWarning)


Shouldn't these be actual errors rather than just warnings?

I am not sure, probably we do want error in case you try to store string in numeric column, but we do not want to stop the process if someone tries to put an Int32 value into Int64 columns... And writing all possible combinations that make sense seems like a bit tedious job, so I opted for 'leave it to user responsibility to check the warning'

Mmm, if you think about the cities, that shouldn't happen as the types should be consistent through the iterations. If you think about other usages, I don't see that it is a big deal for the user to deal with slightly different types.

For the first warning the code will probably just chop part of the string, and the second one would either crash or store useless data. And, being realistic, warnings are not usually checked..

mmkekic · 2020-03-18T17:24:23Z

I think I addressed all previous comments. @gonzaponte if you are happy with the code I can do the history cleaning?

gonzaponte

After these cosmetical changes we are good to go

gonzaponte · 2020-03-18T18:07:40Z

invisible_cities/io/dst_io.py

+def _check_castability(arr : np.ndarray, table_types : np.dtype):
+    arr_types = arr.dtype
+    if set(arr_types.names) != set(table_types.names):
+        raise TableMismatch(f'dataframe differs from already existing table structure')
+    for name in arr_types.names:
+        if arr_types[name].name == 'object':
+            max_str_length = max(map(len, arr[name]))
+            if max_str_length > table_types[name].itemsize:
+                warnings.warn(f'dataframe contains strings longer than allowed', UserWarning)
+        elif not np.can_cast(arr_types[name], table_types[name], casting='same_kind'):
+            raise TableMismatch(f'dataframe numeric types not consistent with the table existing ones')


Suggested change

def _check_castability(arr : np.ndarray, table_types : np.dtype):

arr_types = arr.dtype

if set(arr_types.names) != set(table_types.names):

raise TableMismatch(f'dataframe differs from already existing table structure')

for name in arr_types.names:

if arr_types[name].name == 'object':

max_str_length = max(map(len, arr[name]))

if max_str_length > table_types[name].itemsize:

warnings.warn(f'dataframe contains strings longer than allowed', UserWarning)

elif not np.can_cast(arr_types[name], table_types[name], casting='same_kind'):

raise TableMismatch(f'dataframe numeric types not consistent with the table existing ones')

def _check_castability(arr : np.ndarray, table_types : np.dtype):

arr_types = arr.dtype

if set(arr_types.names) != set(table_types.names):

raise TableMismatch(f'dataframe differs from already existing table structure')

for name in arr_types.names:

if arr_types[name].name == 'object':

max_str_length = max(map(len, arr[name]))

if max_str_length > table_types[name].itemsize:

warnings.warn(f'dataframe contains strings longer than allowed', UserWarning)

elif not np.can_cast(arr_types[name], table_types[name], casting='same_kind'):

raise TableMismatch(f'dataframe numeric types not consistent with the table existing ones')

Helps readability.

gonzaponte · 2020-03-18T18:08:06Z

invisible_cities/io/dst_io_test.py

+
+
+@given(df=dataframe)
+def test_store_pandas_as_tables_raises_warning_inconsistent_types(config_tmpdir, df):


Suggested change

def test_store_pandas_as_tables_raises_warning_inconsistent_types(config_tmpdir, df):

def test_store_pandas_as_tables_raises_TableMismatch_inconsistent_types(config_tmpdir, df):

gonzaponte

This PR makes the function store_pandas_as_table public, as it should be. It also improves the performance of the writer and adds type checks for robustness. A bunch of tests have been added, which is always good.

Good job!

The function is not used only in this module so it shouldnt be private.

The writing is done by appending a numpy record array to the pytable. It also adds checks of whether the type conversion is possible.

Also removes False for type checking in assert_dataframe_close

Add tests that check the warining is raised if the string is too long, and a test that check TableMismatch error is raised if the numeric types are different.

In case new dataframe columns do not have the same order of columns as the already existing table

mmkekic requested a review from gonzaponte March 12, 2020 18:07

gonzaponte reviewed Mar 12, 2020

View reviewed changes

gonzaponte reviewed Mar 13, 2020

View reviewed changes

gonzaponte reviewed Mar 14, 2020

View reviewed changes

gonzaponte reviewed Mar 18, 2020

View reviewed changes

mmkekic force-pushed the improve_df_to_tables_writer branch from 920472a to 1f95085 Compare March 19, 2020 10:56

gonzaponte approved these changes Mar 19, 2020

View reviewed changes

mmkekic added 6 commits March 19, 2020 16:16

Remove _ in front of store_pandas_as_tables

52b5514

The function is not used only in this module so it shouldnt be private.

Speedup writer by appending full numpy record array.

d77e654

The writing is done by appending a numpy record array to the pytable. It also adds checks of whether the type conversion is possible.

Fix tests for modified _make_tabledef input

813831d

Also removes False for type checking in assert_dataframe_close

Add test raise warning and error.

30ef604

Add tests that check the warining is raised if the string is too long, and a test that check TableMismatch error is raised if the numeric types are different.

Order arr columns before casting and saving

3cb0313

In case new dataframe columns do not have the same order of columns as the already existing table

Add test for store_pandas_as_tables with unordered dataframes

bf6085c

mmkekic force-pushed the improve_df_to_tables_writer branch from 1f95085 to bf6085c Compare March 19, 2020 15:21

bpalmeiro merged commit aac89ff into next-exp:master Mar 19, 2020

mmkekic deleted the improve_df_to_tables_writer branch February 22, 2021 19:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve df to tables writer #709

Improve df to tables writer #709

mmkekic commented Mar 12, 2020

gonzaponte left a comment

gonzaponte Mar 12, 2020

mmkekic Mar 13, 2020

mmkekic commented Mar 13, 2020

gonzaponte Mar 13, 2020

mmkekic Mar 13, 2020

gonzaponte Mar 13, 2020

gonzaponte Mar 13, 2020

gonzaponte Mar 13, 2020

gonzaponte Mar 14, 2020

gonzaponte Mar 14, 2020

mmkekic Mar 14, 2020

gonzaponte Mar 14, 2020

gonzaponte Mar 14, 2020

gonzaponte Mar 14, 2020

mmkekic Mar 16, 2020

gonzaponte Mar 18, 2020

mmkekic commented Mar 18, 2020

gonzaponte left a comment

gonzaponte Mar 18, 2020

gonzaponte Mar 18, 2020

gonzaponte left a comment •

edited

Loading

		data_types = [(col, arr[col].dtype if arr[col].dtype != 'O' else \
		f'S{str_col_length}') for col in arr.dtype.names]

		assert_dataframes_close(df_read, pd.concat([df1, df2]).reset_index(drop=True), False, rtol=1e-5)
		assert_dataframes_equal(df_read, pd.concat([df1, df2]).reset_index(drop=True))

		for colname in df.columns:
		if (df[colname].dtype.name == 'object'):

		warnings.warn(f'dataframe numeric types not consistent with the table existing ones', UserWarning)


		def store_pandas_as_tables(h5out : tb.file.File, df : pd.DataFrame, group_name : str, table_name : str, compression : str='ZLIB4', descriptive_string : [str]="", str_col_length : int=32) -> None:



		@given(df=dataframe)
		def test_store_pandas_as_tables_raises_warning_inconsistent_types(config_tmpdir, df):

	def test_store_pandas_as_tables_raises_warning_inconsistent_types(config_tmpdir, df):
	def test_store_pandas_as_tables_raises_TableMismatch_inconsistent_types(config_tmpdir, df):

Improve df to tables writer #709

Improve df to tables writer #709

Conversation

mmkekic commented Mar 12, 2020

gonzaponte left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mmkekic commented Mar 13, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mmkekic commented Mar 18, 2020

gonzaponte left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gonzaponte left a comment • edited Loading

Choose a reason for hiding this comment

gonzaponte left a comment •

edited

Loading