Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

char_range() function for indexing into source strings #457

Closed
davidanthoff opened this issue Jul 11, 2024 · 12 comments · Fixed by #486
Closed

char_range() function for indexing into source strings #457

davidanthoff opened this issue Jul 11, 2024 · 12 comments · Fixed by #486

Comments

@davidanthoff
Copy link
Contributor

We have a crash from the VS Code extension that seems to originate from JuliaSyntax. Repo steps are:

  1. pkg> dev Sunny
  2. Run
using JuliaSyntax

src = read(joinpath(homedir(), ".julia/dev/Sunny/test/test_tensors.jl"), String)

tree = parseall(SyntaxNode, src, filename="foo.jl")

i = JuliaSyntax.last_byte(tree[2][end][end])

src[i]

That crashes with

ERROR: StringIndexError: invalid index [2142], valid nearby indices [2140]=>'′', [2143]=>'\r'

I assume that last_byte should always return a valid index, right? So this seems like a bug.

The rest of this issue is some speculation from my end, might all be wrong :)

I started looking a bit around in the JuliaSyntax code, and there seems to be a fair bit of code that assumes 1-byte characters, which strikes me as incorrect? Or maybe this is carefully only done when there is a guarantee that only 1-byte codepoints can appear? An example is

span(t::FullToken) = 1 + last_byte(t) - first_byte(t)
Doesn't that (incorrectly) assume that the character at index position last_byte is only one code unit? Generally if I search the repo for - 1 or + 1 I see a fair bit of code where I would have assumed that a prevind or nextind would be needed?

@KristofferC
Copy link
Member

KristofferC commented Jul 11, 2024

Why would indexing into a string with the last_byte make sense? Seems the same conceptually as doing "α"[sizeof("α")].

@davidanthoff
Copy link
Contributor Author

Yeah, I also realized, I should probably be using range instead of using first_byte and last_byte. At least for range I assume it should return valid indices, right? My PR would fix that, the repo for the problem with range is:

using JuliaSyntax

src = read(joinpath(homedir(), ".julia/dev/Sunny/test/test_tensors.jl"), String)

tree = parseall(SyntaxNode, src, filename="foo.jl")

code_range = range(tree[2][end][end])

src[code_range]

which also errors.

@DaniGlez
Copy link

If anyone else is facing this issue, a way to bypass it until there's a permanent fix is to find your lines ending in non-ASCII Unicode characters and make that no longer the case.

@MilesCranmer
Copy link
Member

MilesCranmer commented Jul 15, 2024

Thanks @DaniGlez!

Here's a regexp (VSCode flavour) for finding non-ASCII characters that end lines:

[^\x00-\x7f]$

@c42f
Copy link
Member

c42f commented Jul 16, 2024

there seems to be a fair bit of code that assumes 1-byte characters

JuliaSyntax works on byte indices internally and it's (usually) quite careful to convert to valid character indices in the occasional cases where this is required.

So unicode generally works and there's a pile of tests for this in test/source_files.jl

However there must be a bug in an edge case for non-ascii at line end

@c42f
Copy link
Member

c42f commented Jul 16, 2024

Nice properties of byte ranges are that
(a) the green tree exactly covers the source buffer with unit ranges, with no gaps
(b) the ranges (in principle) require no further interpreting of unicode chars to pull out a string (not yet exploited in practice! as this bug shows!)

@davidanthoff
Copy link
Contributor Author

I think all I'm looking for is some function that returns a range that has valid string indices and can be used directly with the underlying string. Does that exist? If I understand @c42f's comment in my PR correct, then Base.range (to be renamed to byte_range) is also not supposed to do that, so is there any other way?

As a user of this, I would also say that at the level of SyntaxNodes, users are probably thinking in terms of string indices, not byte indices. I understand that byte stuff makes sense at the green tree level, but isn't that an abstraction leakage if I have to deal with this stuff when dealing with SyntaxNodes? I think it would be nice if there was a (documented) API at the SyntaxNode level that just operates on string index level.

For now I'm going to hack around all of this downstream, we can't have this bug crash in the wild, but it would be much nicer if there was some "official" way to get these indices.

@c42f
Copy link
Member

c42f commented Jul 17, 2024

I think all I'm looking for is some function that returns a range that has valid string indices and can be used directly with the underlying string. Does that exist?

Not exactly. Here's what exists so far:

  • JuliaSyntax uses SyntaxNode ranges in conjunction with SourceFile which does the index conversion (as well as other useful stuff like line and column number computation).
  • If you just want the source text covered by the node you can use sourcetext(node).

If I understand @c42f's comment in my PR correct, then Base.range (to be renamed to byte_range) is also not supposed to do that

Correct. By design, JuliaSyntax works in terms of byte ranges and only does the conversion to character indices when necessary.

For now I'm going to hack around all of this downstream

Ok. If you can point me at the place you're using this downstream, I'm sure we can work out the right design to unhack things in the future :-)

@davidanthoff
Copy link
Contributor Author

My hack is https://github.com/julia-vscode/TestItemDetection.jl/blob/e19b96801a10718ca39b2c15cad18d0aae123993/src/packagedef.jl#L2. Essentially I just want that :)

In terms of design, I generally just guess that "end-users" of this package that use SyntaxNode want to deal with string indices, not byte indices. From the consumer's point of view, the whole byte index stuff strikes me as an implementation detail that I shouldn't really be exposed to in terms of the API I deal with :)

I saw the SourceFile type. We have a very similar thing in JuliaWorkspaces, and it is a bit different now than I originally described. Would be nice if we could unify all of that at some point to use the same types.

@c42f
Copy link
Member

c42f commented Jul 18, 2024

Essentially I just want that :)

Right - this is pretty much what SourceFile does here:

j = prevind(source.code, last(rng) + 1 - source.byte_offset)

Except that SourceFile also supports a byte offset (for when we are logically only parsing a piece of the text, not the whole thing). By the way, SourceFile has prevind, etc, supported.

I feel the core issue here is the question "what's a good representation of source code?" SourceFile isn't the whole answer, and neither is a string (probably). Essentially we're back to wanting a proper answer to #190 ?

@c42f
Copy link
Member

c42f commented Jul 18, 2024

For now I guess a quick fix would be to add a char_range() function which would do what your our_range() does? It's not wonderful (not a solution to #190), but it would be an improvement.

For GreenNode, char_range would need additional file::SourceFile and position::Int arguments. Or some such.

@davidanthoff
Copy link
Contributor Author

For now I guess a quick fix would be to add a char_range() function

Yes, I think that would be great.

For GreenNode, char_range would need additional file::SourceFile and position::Int arguments.

I think it would also be reasonable to say that the "char"-based interface is only available at the SyntaxNode level, if someone is digging down to the green node level, they just have to deal with this themselves? That is the more low level interface in any case.

@c42f c42f mentioned this issue Jul 31, 2024
@c42f c42f changed the title Incorrect unicode handling char_range() function for indexing into source strings Jul 31, 2024
@c42f c42f closed this as completed in #486 Aug 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants