Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in buffer handling in readlines() function for \r\n new line type #22430

Closed
pszufe opened this issue Jun 18, 2017 · 10 comments
Closed

Bug in buffer handling in readlines() function for \r\n new line type #22430

pszufe opened this issue Jun 18, 2017 · 10 comments
Labels
bug Indicates an unexpected problem or unintended behavior io Involving the I/O subsystem: libuv, read, write, etc.
Milestone

Comments

@pszufe
Copy link

pszufe commented Jun 18, 2017

There some buffering error in how windows new line characters are being handled by the readlines() function.

Please consider the following code:

stream = open("foo.txt", "w")
write(stream, "a")
for i = 1:2^17
  write(stream, "ab\r\n")
end
close(stream)

lineNo = 0
for line in readlines("foo.txt")
   lineNo += 1;
   if '\r' in line
       println("line ",lineNo," contains \\r character")
   end
end

And the output:

line 32768 contains \r character
line 65536 contains \r character
line 98304 contains \r character
line 131072 contains \r character

Seems like improper handling of the last character in the buffer.

Tested on Julia 0.6.0-rc3 and 0.7.0-DEV.636

@bkamins
Copy link
Member

bkamins commented Jun 18, 2017

Here is a minimal working example catching the problem (and it can be simply changed to a test set covering all possibilities of end line types and positions):

fname = "readlines_test_file_0988429387.txt"
for dochomp in [true, false], endline in ["\n", "\r\n"], i in -5:5
    ref = ("1"^(2^17-i))*endline
    open(fname, "w") do io
        write(io, ref)
    end
    x = readlines(fname, chomp = dochomp)
    if dochomp
        ref = chomp(ref)
    end
    if ref != x[1]
        println("\ni: $i, dochomp: $dochomp, endline: ", collect(endline))
        println("  ", collect(x[1][end-4:end]))
    end
end

It is probably caused by the code starting in this line:

julia/src/sys.c

Line 257 in 96b28d6

char *pd = (char*)memchr(s->buf+s->bpos, delim, (size_t)(s->size - s->bpos));

when \r is a last character in a buffer at position IOS_BUFSIZE.

@StefanKarpinski StefanKarpinski added bug Indicates an unexpected problem or unintended behavior io Involving the I/O subsystem: libuv, read, write, etc. labels Jun 18, 2017
@StefanKarpinski StefanKarpinski added this to the 0.6.x milestone Jun 18, 2017
@nalimilan
Copy link
Member

@bkamins Do you feel like making a PR?

Cf. #19944 and #20203. Cc: @mpastell

@bkamins
Copy link
Member

bkamins commented Jun 19, 2017

@nalimilan Unfortunately no as I have no experience with modifying C sources of Julia.

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Jun 19, 2017

Unfortunately no as I have no experience with modifying C sources of Julia.

There's a first time for everything 😀

@bkamins
Copy link
Member

bkamins commented Jun 19, 2017

I can tell you what I guess is wrong and how it could be fixed, but I do not feel qualified enough to make a PR.

What is probably wrong:
If \r is last character in a buffer then in line
https://github.com/JuliaLang/julia/pull/20203/files#diff-6272fc3febeea948f41d4a1261fef524L815
pd is NULL
and then write in line
https://github.com/JuliaLang/julia/pull/20203/files#diff-6272fc3febeea948f41d4a1261fef524L817
happens.
Then next chunk is read in while loop and it contains only \n, which gets chomped, but \r is already written and in line
https://github.com/JuliaLang/julia/pull/20203/files#diff-6272fc3febeea948f41d4a1261fef524R842
ntowrite is 1.

The simplest fix would be to after this line (but I am not sure how to safely do it without risk of breaking something else):
https://github.com/JuliaLang/julia/pull/20203/files#diff-6272fc3febeea948f41d4a1261fef524R828
if written is 0 and
nchomp is 1 and
to buffer is not empty and
last byte in to buffer is \r then
last byte in to buffer should be removed.

@bkamins
Copy link
Member

bkamins commented Jun 24, 2017

If I understand it correctly ios_trunc can fail (with return code 1), so I would propose:

  1. to remove chomp option from ios_copyuntil
  2. change in line

    julia/src/sys.c

    Line 278 in 96b28d6

    size_t n = ios_copyuntil(&dest, s, delim, chomp);

from:

size_t n = ios_copyuntil(&dest, s, delim, chomp);

to (assuming chomp is removed from ios_copyuntil):

size_t n = ios_copyuntil(&dest, s, delim);
if (n > 0 && dest->buf[n-1] == '\n') {
    n--;
    if (n > 1 && dest->buf[n-2] == '\r') {
        n--;
    }
    assert(ios_trunc(dest, n) == 0); // it should always be possible to truncate dest
}

Unfortunately I am unable to test it (I have failed to successfully build Julia head under Win and under Ubuntu). @pszufe Could you try checking this proposal?

@nalimilan
Copy link
Member

@bkamins If you think that fix would work, you can make a pull request without building Julia locally, adding new test that would previously fail, and see whether the build and tests pass on Travis and AppVeyor. In general that's a risky strategy (the build may fail just because you forgot to add a parenthesis), but for localized fixes like this it could be OK.

Are you sure your proposal won't hurt performance, though? Not saying it will, but that's something to consider.

@bkamins
Copy link
Member

bkamins commented Jun 24, 2017

I was working it out in my head so I do not want to make a push (yesterday I made a small proposal #22496 without properly running all checks and I had to correct it). I have talked with @pszufe and he told me that he would help to setup a proper environment and test the correctness and performance.

I believe that performance should not be degraded as there are only two assignments to bpos and size in ios_trunc, and the benefit is that it is a simple solution. Correcting ios_copyuntil would either require ios_trunc anyway (and ios_copyuntil can be potentially used in different contexts without guarantee that to->bm == bm_mem so ios_trunc could fail) or much more messy checking if last byte written in one chunk is not \r and delaying adding or not adding it to a buffer until the next chunk is read.

@bkamins
Copy link
Member

bkamins commented Jul 2, 2017

Does #22621 seem acceptable?

@ViralBShah
Copy link
Member

Closed in #22621

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Indicates an unexpected problem or unintended behavior io Involving the I/O subsystem: libuv, read, write, etc.
Projects
None yet
Development

No branches or pull requests

5 participants