Incorrect batches/chunks of DataFrames #1

yxietruth · 2017-08-24T02:44:28Z

Hi Donald, thank you for making this package available! I've found it useful.

I think there is an issue with the method below:

def df_chunks(self, df):
        chunks = list()
        n_chunks = len(df) // self.batch_size + 1
        for i in range(n_chunks):
            chunks.append(df.loc[i*self.batch_size:(i+1)*self.batch_size,:])
        return chunks

According to https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html, DataFrame slices using loc() include both the start and the stop index, so your iteration results in slices where the next slice's start is a duplicate of the previous slice's stop, and the slice size is also one more than intended. For example, if batch_size = 2, and n_chunks = 2, the slices are [0 : 2, : ] and [2 : 4, : ]. It should be [0 : 1, : ] and [2 : 3, : ].

Furthermore, n_chunks is incorrect for batch_size = 1. For example, if len(df) = 2000 and batch_size is 1, your code will result in n_chunks = 2001, when it should be 2000.

I think I corrected the issues above with the following changes:

def df_chunks(self, df):
        chunks = list()
        for i in range(0, len(df), self.batch_size):
            chunks.append(df.loc[i : i + (self.batch_size - 1), : ])
        return chunks

The above does not use n_chunks. Note that the stop index can exceed len(df), but this is fine because Pandas only slices up to len(df).

Let me know if you agree with my findings and proposed changes. Thank you again!

The text was updated successfully, but these errors were encountered:

change df_chunks as per donaldrauscher#1

justmaxfield added a commit to justmaxfield/sfdc-bulk that referenced this issue Mar 26, 2018

Update api.py

f04e6a1

change df_chunks as per donaldrauscher#1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect batches/chunks of DataFrames #1

Incorrect batches/chunks of DataFrames #1

yxietruth commented Aug 24, 2017

Incorrect batches/chunks of DataFrames #1

Incorrect batches/chunks of DataFrames #1

Comments

yxietruth commented Aug 24, 2017