Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect batches/chunks of DataFrames #1

Open
yxietruth opened this issue Aug 24, 2017 · 0 comments
Open

Incorrect batches/chunks of DataFrames #1

yxietruth opened this issue Aug 24, 2017 · 0 comments

Comments

@yxietruth
Copy link

Hi Donald, thank you for making this package available! I've found it useful.

I think there is an issue with the method below:

def df_chunks(self, df):
        chunks = list()
        n_chunks = len(df) // self.batch_size + 1
        for i in range(n_chunks):
            chunks.append(df.loc[i*self.batch_size:(i+1)*self.batch_size,:])
        return chunks

According to https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html, DataFrame slices using loc() include both the start and the stop index, so your iteration results in slices where the next slice's start is a duplicate of the previous slice's stop, and the slice size is also one more than intended. For example, if batch_size = 2, and n_chunks = 2, the slices are [0 : 2, : ] and [2 : 4, : ]. It should be [0 : 1, : ] and [2 : 3, : ].

Furthermore, n_chunks is incorrect for batch_size = 1. For example, if len(df) = 2000 and batch_size is 1, your code will result in n_chunks = 2001, when it should be 2000.

I think I corrected the issues above with the following changes:

def df_chunks(self, df):
        chunks = list()
        for i in range(0, len(df), self.batch_size):
            chunks.append(df.loc[i : i + (self.batch_size - 1), : ])
        return chunks

The above does not use n_chunks. Note that the stop index can exceed len(df), but this is fine because Pandas only slices up to len(df).

Let me know if you agree with my findings and proposed changes. Thank you again!

justmaxfield added a commit to justmaxfield/sfdc-bulk that referenced this issue Mar 26, 2018
change df_chunks as per donaldrauscher#1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant