You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi Donald, thank you for making this package available! I've found it useful.
I think there is an issue with the method below:
def df_chunks(self, df):
chunks = list()
n_chunks = len(df) // self.batch_size + 1
for i in range(n_chunks):
chunks.append(df.loc[i*self.batch_size:(i+1)*self.batch_size,:])
return chunks
According to https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html, DataFrame slices using loc() include both the start and the stop index, so your iteration results in slices where the next slice's start is a duplicate of the previous slice's stop, and the slice size is also one more than intended. For example, if batch_size = 2, and n_chunks = 2, the slices are [0 : 2, : ] and [2 : 4, : ]. It should be [0 : 1, : ] and [2 : 3, : ].
Furthermore, n_chunks is incorrect for batch_size = 1. For example, if len(df) = 2000 and batch_size is 1, your code will result in n_chunks = 2001, when it should be 2000.
I think I corrected the issues above with the following changes:
def df_chunks(self, df):
chunks = list()
for i in range(0, len(df), self.batch_size):
chunks.append(df.loc[i : i + (self.batch_size - 1), : ])
return chunks
The above does not use n_chunks. Note that the stop index can exceed len(df), but this is fine because Pandas only slices up to len(df).
Let me know if you agree with my findings and proposed changes. Thank you again!
The text was updated successfully, but these errors were encountered:
Hi Donald, thank you for making this package available! I've found it useful.
I think there is an issue with the method below:
According to https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html, DataFrame slices using loc() include both the start and the stop index, so your iteration results in slices where the next slice's start is a duplicate of the previous slice's stop, and the slice size is also one more than intended. For example, if batch_size = 2, and n_chunks = 2, the slices are [0 : 2, : ] and [2 : 4, : ]. It should be [0 : 1, : ] and [2 : 3, : ].
Furthermore, n_chunks is incorrect for batch_size = 1. For example, if len(df) = 2000 and batch_size is 1, your code will result in n_chunks = 2001, when it should be 2000.
I think I corrected the issues above with the following changes:
The above does not use n_chunks. Note that the stop index can exceed len(df), but this is fine because Pandas only slices up to len(df).
Let me know if you agree with my findings and proposed changes. Thank you again!
The text was updated successfully, but these errors were encountered: