Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-1756: [Python] Fix large file read/write error #1276

Closed

Conversation

Licht-T
Copy link
Contributor

@Licht-T Licht-T commented Nov 1, 2017

This is the part of ARROW-1756.

@wesm wesm changed the title ARROW-1756: [Python]Fix large file read/write error on Mac ARROW-1756: [Python] Fix large file read/write error on Mac Nov 2, 2017
Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this issue limited to Mac? I thought the bug reporter was on Linux, but either way this is a good thing to fix

@@ -246,31 +246,55 @@ static inline Status FileRead(int fd, uint8_t* buffer, int64_t nbytes,
}
*bytes_read = static_cast<int64_t>(_read(fd, buffer, static_cast<uint32_t>(nbytes)));
#else
*bytes_read = static_cast<int64_t>(read(fd, buffer, static_cast<size_t>(nbytes)));
int64_t n = 0;
int64_t N = nbytes;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since N is never changed, just use nbytes? Maybe make it const int64_t nbytes in the function arguments for good measure


*bytes_read = 0;

while (ret != -1 && n < N) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use *bytes_read instead of n?

*bytes_read = 0;

while (ret != -1 && n < N) {
int64_t i = std::min(static_cast<int64_t>(INT32_MAX), N - n);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a more descriptive variable name than i

while (ret != -1 && n < N) {
int64_t i = std::min(static_cast<int64_t>(INT32_MAX), N - n);
ret = static_cast<int>(write(fd, buffer + n, static_cast<size_t>(i)));
n += i;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comments as above.


@pytest.mark.slow
def test_large_dataframe(self):
df = pd.DataFrame(np.random.randint(0, 100, size=(400000000, 1)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any data will do here, so you could do np.arange here to make this faster rather than generating so much random data

@pytest.mark.slow
def test_large_dataframe(self):
df = pd.DataFrame(np.random.randint(0, 100, size=(400000000, 1)),
columns=list('A'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

['A']

@wesm
Copy link
Member

wesm commented Nov 2, 2017

Thanks for looking at this

@Licht-T
Copy link
Contributor Author

Licht-T commented Nov 3, 2017

@wesm I checked on Linux, but this only fixes the problem on Mac.
I am also trying to find the cause of the Linux issue.

@Licht-T
Copy link
Contributor Author

Licht-T commented Nov 3, 2017

@wesm I found the reason why.
I'll fix.

http://man7.org/linux/man-pages/man2/write.2.html

On Linux, write() (and similar system calls) will transfer at most 0x7ffff000 (2,147,479,552) bytes

@wesm
Copy link
Member

wesm commented Nov 3, 2017

OK, thanks. I'll wait on you to address the code comments and then we can merge this

@Licht-T Licht-T force-pushed the fix-large-file-read-write-error branch from 3b5782a to 81c1972 Compare November 4, 2017 11:49
@Licht-T Licht-T changed the title ARROW-1756: [Python] Fix large file read/write error on Mac ARROW-1756: [Python] Fix large file read/write error Nov 4, 2017
@Licht-T
Copy link
Contributor Author

Licht-T commented Nov 4, 2017

@wesm Fixed. Now it works fine on Linux!

Copy link
Member

@xhochy xhochy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thanks for taking a deep dive into those internals.

Will merge once the build is green.

@Licht-T
Copy link
Contributor Author

Licht-T commented Nov 4, 2017

@xhochy I want to solve the Windows issue as same.
Should I make a JIRA issue?

@xhochy
Copy link
Member

xhochy commented Nov 4, 2017

@Licht-T yes, please open a JIRA.

I would merge this PR once Travis is happy but sadly C++ builds keep timeouting on Travis. I will re-trigger later again and merge on success.

#define ARROW_IO_MAX_CONUT INT32_MAX
#else
// see notes on Linux read/write manpage
#define ARROW_IO_MAX_CONUT 0x7ffff000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typos in these defines, I will fix

*bytes_read = 0;

while (*bytes_read != -1 && *bytes_read < nbytes) {
int64_t i = std::min(static_cast<int64_t>(ARROW_IO_MAX_CONUT), nbytes - *bytes_read);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use descriptive variable names. I will fix

Change-Id: I14967fb34936b61bc1fd636b702888e95a332765
@wesm
Copy link
Member

wesm commented Nov 4, 2017

This is infinite-looping on Linux, debugging

Change-Id: I00ee07fdc0d25f399610db49b2adfa65088e9d63
@wesm
Copy link
Member

wesm commented Nov 4, 2017

+1

@wesm wesm closed this in 62190d7 Nov 4, 2017
@wesm
Copy link
Member

wesm commented Nov 4, 2017

thanks @Licht-T!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants