Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make getindex for String check if indices are valid #22572

Merged
merged 4 commits into from
Sep 20, 2017

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Jun 27, 2017

Make getindex for String check if indices are valid. See #22548 for discussion.

Benchmark code (new_getindex is the proposed implementation):

using BenchmarkTools

x = "a∀"
x[1:2]
new_getindex(x, 1:2)
for i in 1:2, j in i:2
    println("\nnew_getindex(x,$i:$j)")
    display(@benchmark new_getindex($x, $i:$j))
    println("\nx[$i:$j]")
    display(@benchmark $x[$i:$j])
end

produces:

new_getindex(x,1:1)
BenchmarkTools.Trial:
  memory estimate:  32 bytes
  allocs estimate:  1
  --------------
  minimum time:     24.726 ns (0.00% GC)
  median time:      26.126 ns (0.00% GC)
  mean time:        30.946 ns (8.76% GC)
  maximum time:     2.997 μs (96.86% GC)
  --------------
  samples:          10000
  evals/sample:     1000

x[1:1]
BenchmarkTools.Trial:
  memory estimate:  32 bytes
  allocs estimate:  1
  --------------
  minimum time:     24.726 ns (0.00% GC)
  median time:      26.592 ns (0.00% GC)
  mean time:        30.430 ns (8.19% GC)
  maximum time:     1.952 μs (95.27% GC)
  --------------
  samples:          10000
  evals/sample:     1000

new_getindex(x,1:2)
BenchmarkTools.Trial:
  memory estimate:  32 bytes
  allocs estimate:  1
  --------------
  minimum time:     27.058 ns (0.00% GC)
  median time:      28.459 ns (0.00% GC)
  mean time:        33.833 ns (7.45% GC)
  maximum time:     2.164 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

x[1:2]
BenchmarkTools.Trial:
  memory estimate:  32 bytes
  allocs estimate:  1
  --------------
  minimum time:     27.058 ns (0.00% GC)
  median time:      28.925 ns (0.00% GC)
  mean time:        33.668 ns (7.92% GC)
  maximum time:     2.864 μs (96.87% GC)
  --------------
  samples:          10000
  evals/sample:     1000

new_getindex(x,2:2)
BenchmarkTools.Trial:
  memory estimate:  32 bytes
  allocs estimate:  1
  --------------
  minimum time:     26.592 ns (0.00% GC)
  median time:      28.925 ns (0.00% GC)
  mean time:        33.125 ns (7.68% GC)
  maximum time:     2.035 μs (94.68% GC)
  --------------
  samples:          10000
  evals/sample:     1000

x[2:2]
BenchmarkTools.Trial:
  memory estimate:  32 bytes
  allocs estimate:  1
  --------------
  minimum time:     25.660 ns (0.00% GC)
  median time:      27.992 ns (0.00% GC)
  mean time:        32.443 ns (7.73% GC)
  maximum time:     1.980 μs (95.83% GC)
  --------------
  samples:          10000
  evals/sample:     1000

@bkamins
Copy link
Member Author

bkamins commented Jun 27, 2017

Actually some library functions used indexing at invalid points. I will work on fixing it.

@kshyatt kshyatt added the strings "Strings!" label Jun 27, 2017
@bkamins
Copy link
Member Author

bkamins commented Jun 27, 2017

I hope I have fixed all positions in code where 'getindex' used non-strict bounds.
The only place I did not touch is REPL implementation. It uses getindex with wrong bounds in potentially many places, eg. in LineEdit lines 225, 408, 551; REPL.jl lines 284, 856 (maybe some more places I did not identify).

@nalimilan As REPL is a very sensitive part of Julia to mess with do you know whom to ask for advice how to proceed (which in the end might be to abandon this change)?

@nalimilan
Copy link
Member

Yeah, fixing the REPL code is going to be tedious. But don't worry too much, if you break it and the tests don't catch it, people will report it soon enough. :-) Anyway, it would be nice to clean this code from all undue assumptions about string indexing.

I guess for cases which are really too obscure, you could simply call previndex with the index which was previously used, to ensure you get a valid index. Then once everything works, we can see if we can find a cleaner solution, if needed by asking the people who wrote that code (according to git blame).

@inbounds si = codeunit(s, i)
if is_valid_continuation(si)
throw(UnicodeError(UTF_ERR_INVALID_INDEX, i, si))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better continue throwing a UnicodeError rather than a BoundsError, as the former gives more details about the problematic code point. This can be helpful if you import invalid Unicode data and want to understand what's the problem.

@bkamins
Copy link
Member Author

bkamins commented Jun 27, 2017

I have done blame and that is why I refrained from starting the fixes - the code was written by core Julia developers who probably understand the whole ecosystem way beyond my level.

prevind is what I actually first thought to use but actually one has to use something like (but optimized for speed):
isvalid(x, i) ? i : prevind(x, i)

Additionally my thinking was that if we have problems in REPL it is possible that many packages were developed also using this assumption (which is actually sometimes convenient - especially when you use byte buffers - you pass i and do not have to care if it is valid as it will be moved to a correct index), and then, a more reasonable approach communitywise would be to actually leave getindex as is, and fix other methods to be consistent with current functionality.

In summary - it is a breaking change and it might break a lot of code based on previous behavior. That is why I started having second thoughts.

@nalimilan
Copy link
Member

I have done blame and that is why I refrained from starting the fixes - the code was written by core Julia developers who probably understand the whole ecosystem way beyond my level.

It'd hard to contribute to a project if you feel intimidated when you see that the code has been written by core developers. :-)

I can't guarantee that the PR will be merged, so if you're afraid you could waste your time don't do it, but in general the best way to find out whether a change would be too breaking is to have it work in Base (which is admittedly the most complex and ancient Julia codebase around). Then if it looks acceptable it may be merged. If the only places to fix are the ones you listed, I would say that's pretty reasonable.

The ability to write i-1 sounds practical at first, but I wouldn't consider it as a good feature (and I think @stevengj agrees on that point) since it contributes to the habit of writing i+1 for the next index, which is incorrect. Then people don't necessarily tests their code on non-ASCII inputs, and it breaks in production. Imposing consistently strict rules is clearer and more robust. Even if it's a bit painful at first, overall it helps improving the quality of the codebase. (BTW, if you're not aware of it, see the related discussion at #9297.)

@ararslan ararslan changed the title Closes #22548 Make getindex for String check if indices are valid Jun 28, 2017
@nalimilan
Copy link
Member

Great, looks like you managed to pass the bootstrap phase! The remaining failures should be easier to fix.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, modulo a few comments. The PR is surprisingly small, which is a good sign that the stricter indexing should shouldn't break to much code.

Please add a mention of this breaking change in NEWS.md.

@@ -222,7 +222,7 @@ function refresh_multi_line(termbuf::TerminalBuffer, terminal::UnixTerminal, buf
# in this case, we haven't yet written the cursor position
line_pos -= slength # '\n' gets an extra pos
if line_pos < 0 || !moreinput
num_chars = (line_pos >= 0 ? llength : strwidth(l[1:(line_pos + slength)]))
num_chars = (line_pos >= 0 ? llength : strwidth(l[1:prevind(l, line_pos + slength+1)]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency, should add spaces around +.

@@ -853,7 +853,7 @@ function setup_interface(
end
# Check if input line starts with "julia> ", remove it if we are in prompt paste mode
jl_prompt_len = 7
if (firstline || isprompt_paste) && (oldpos + jl_prompt_len <= sizeof(input) && input[oldpos:oldpos+jl_prompt_len-1] == JULIA_PROMPT)
if (firstline || isprompt_paste) && (oldpos + jl_prompt_len <= sizeof(input) && input[oldpos:prevind(input, oldpos+jl_prompt_len)] == JULIA_PROMPT)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A cleaner way of doing this check, which wouldn't involve playing with indices, would be startswith(SubString(input, oldpos), JULIA_PROMPT).

l = sizeof(s)
if i < 1 || i > l
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a styling issue, but I don't see the point of changing these from if to &&, especially since you add a new if block for UnicodeError.

@nalimilan
Copy link
Member

@StefanKarpinski @stevengj Good to go? If so, we should probably run CI again in case changes since the PR was opened broke the tests.

@nalimilan nalimilan mentioned this pull request Sep 7, 2017
Closes JuliaLang#22548

fixes a bug with use of prevind in dates/io.jl
Copy link
Member

@vtjnash vtjnash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebased to pickup any local changes and re-run CI.

@nalimilan
Copy link
Member

Looks like a new failure related to this PR has been introduced.

@@ -534,10 +536,11 @@ function completions(string, pos)
# <Mod>/src/<Mod>.jl
# <Mod>.jl/src/<Mod>.jl
if isfile(joinpath(dir, pname))
endswith(pname, ".jl") && push!(suggestions, pname[1:end-3])
endswith(pname, ".jl") && push!(suggestions,
pname[1:prevind(pname, end-2)])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have been OK, since we know the three last characters are ASCII. -2 has no reason to be more correct than -3, right? Same below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately not:

julia> x="α.jl"
"α.jl"

julia> x[end-3]
ERROR: UnicodeError: invalid character index
Stacktrace:
 [1] slow_utf8_next(::Ptr{UInt8}, ::UInt8, ::Int64, ::Int64) at .\strings\string.jl:172
 [2] next at .\strings\string.jl:204 [inlined]
 [3] getindex(::String, ::Int64) at .\strings\basic.jl:32

I can use end-2 exactly because I know that last three characters are ASCII.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, carry on, I need to learn how to count up to three. :-)

@nalimilan
Copy link
Member

Unfortunately, there's yet another test failure.

@bkamins
Copy link
Member Author

bkamins commented Sep 19, 2017

@nalimilan Sorry for that. Unfortunately the errors pop up at random places (and even different CI builds print different stack traces for errors). I will try to be yet more careful and make the next commit pass :).

Actually the pain with fixing this PR made me submit #23765 as strict getindex is very tricky (see my comment last comment to your code review). E.g. I would fully remove the kind code like pname[1:end-3] with chop(pname, 3) as proposed there. Otherwise people will always have problem with proper indexing as it is not very intuitive.

Of course I still believe we need this PR, because otherwise bugs will silently go through.

@@ -11,7 +11,11 @@ function completes_global(x, name)
end

function appendmacro!(syms, macros, needle, endchar)
append!(syms, s[2:end-sizeof(needle)]*endchar for s in filter(x -> endswith(x, needle), macros))
r = Regex("^.(.*)$needle\$")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps Regex("^.(.*)\\Q$needle\\E\$") since you don't want special characters to have special interpretation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure :). I will fix after I am sure that CI goes through correcty (as it seems now will go through without errors).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CI looks fine now, failures are unrelated.

@bkamins
Copy link
Member Author

bkamins commented Sep 19, 2017

Unfortunately wrapping regexp with \Q and \E fails for \E in the needle, so I had to go back to the original solution as we do not have regex escape in Julia as of now (I have added a comment to #6124).

When I look how s[2:end-sizeof(needle)] differs from properly calculated bounds in such a simple scenario I would strongly advocate for some form of #23765 to make life easier.

@nalimilan
Copy link
Member

Let's merge?

@KristofferC KristofferC merged commit 0058a4f into JuliaLang:master Sep 20, 2017
@KristofferC
Copy link
Member

Great PR. This will probably find a lot of invalid use cases in user code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
strings "Strings!"
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants