Generalizing SourceFile - adding line/column offsets etc #190

timholy · 2023-02-10T13:36:27Z

JuliaDebug/Cthulhu.jl#345 uses JuliaSyntax to represent source code with expression boundaries mapped to character positions within the source text. In that application, CodeTracking is used to extract source-text for specific methods which may be partway into a file.

Question: for "snippets" like these, should the starting line number lineno be added as a new field to SourceFile? Or do you need it to represent "a whole file"? If so, then adding it as a field seems incorrect, since of course a whole file will start at line 1. The other issue is whether functions like source_line should return the offset line or the line index; currently the two are the same. It's for reasons like these that I didn't want to add it without thinking about the consequences.

The text was updated successfully, but these errors were encountered:

c42f · 2023-02-11T04:46:59Z

The main purpose of SourceFile is to map between a lines-and-columns based view of the text and byte offsets. So it would probably be fine to have a line offset and allow it to represent part of a file.

Related, we might need SourceFile to represent part of a file to support incremental reparsing with language server. I haven't tried to get that working yet but it might tie in here at some point.

Closes JuliaLang#190

davidanthoff · 2023-02-13T20:39:26Z

Couple of points from the LS side of things:

we sometimes pass code snippets with an associated line/column offset around in the language server. So, at some level we also sometimes have a need for a type that contains code + offset info for lines/columns, but we would need the ability to offset both the line and the column, not just the line.
it seems to me that a type that includes this kind of offset info shouldn't be called SourceFile simply because it isn't representing an entire file. Maybe SourceSnippet might be better? Or like Roslyn SourceText?
we do have a use-case for the line/column to index conversions in the LS, but the current SourceFile wouldn't work for us. We need uris instead of paths, we need additional versioning info in the file and we will (hopefully soon) have entirely different structures for notebooks. So I'm wondering whether a structure like the following would allow more code-reuse:

struct Position
  line::Int
  column::Int
end

struct SourceSnippet
  code::String
  first_position::Position
  line_starts::Vector{Int}
end

# LS would actually not use this
struct SourceFile
  code::SourceSnippet
  filename::Union{Nothing,String}
end

# LS would use this instead
struct TextDocument
  code::SourceSnippet
  uri::URI
  version::Int
end

struct NotebookCell
  code::SourceSnippet
  other fields...
end

I hope this doesn't derail the original issue too much, I guess my main point is that just adding a first line and nothing about column seems weird to me :)

c42f · 2023-02-14T03:17:19Z

Thanks David this is very useful context. I haven't been able to look at LS integration at all yet or what would be required for incremental parsing.

it seems to me that a type that includes this kind of offset info shouldn't be called SourceFile simply because it isn't representing an entire file

Agreed. SourceText seems good to me. It's shorter than SourceSnippet and I already use the term text in many places.

For now I'm happy with merging #191 to get what Tim needs, but we can go with something more complete in the future.

Related to source abstractions, the internal JuliaSyntax.ParseStream doesn't deal natively with SourceFile because it's trying not to require that the code be copied into a String before we can do anything with it (the C code might pass us a plain old buffer, for example). But maybe this is a bit of a pointless optimization and we could streamline the internals too. (Maybe also improve the Core._parse hook API in Base to make extra copying less usual.)

c42f · 2023-02-14T03:17:56Z

we can go with something more complete in the future

Let's keep this issue open to discuss that

When parsing source code fragments incrementally with * `Meta.parse(str, index)` or * `parsestmt(str, index)` we must avoid scanning the rest of `str` for line numbers for efficiency. In this mode, the user is expected to provide `first_line` to "manually" specify which line number we're counting from. Admittedly this is a bit clunky and should be integrated better with SourceFile (which should also be renamed - see issue #190) but for now seems to be the most consistent way to approach things here.

When parsing source code fragments incrementally with * `Meta.parse(str, index)` or * `parsestmt(str, index)` we must avoid scanning the rest of `str` for line numbers for efficiency. In this mode, the user is expected to provide `first_line` to "manually" specify which line number we're counting from. Admittedly this is a bit clunky and should be integrated better with SourceFile (which should also be renamed - see issue #190) but for now seems to be the most consistent way to approach things here. As part of the refactoring here, switch over to using `Vector{UInt8}` for literal parsing which makes parsing to `ParseStream` and `GreenNode` around 10% faster.

timholy added a commit to timholy/JuliaSyntax.jl that referenced this issue Feb 11, 2023

Add lineno to SourceFile

996095a

Closes JuliaLang#190

timholy added a commit to timholy/JuliaSyntax.jl that referenced this issue Feb 11, 2023

Add lineno to SourceFile

b75e9c2

Closes JuliaLang#190

timholy added a commit to timholy/JuliaSyntax.jl that referenced this issue Feb 11, 2023

Add lineno to SourceFile

cd07641

Closes JuliaLang#190

timholy added a commit to timholy/JuliaSyntax.jl that referenced this issue Feb 11, 2023

Add lineno to SourceFile

ad88376

Closes JuliaLang#190

timholy added a commit to timholy/JuliaSyntax.jl that referenced this issue Feb 11, 2023

Add lineno to SourceFile

5364a23

Closes JuliaLang#190

timholy mentioned this issue Feb 11, 2023

Add lineno to SourceFile #191

Merged

c42f changed the title ~~Add lineno to SourceFile?~~ Generalizing SourceFile - adding line/column offsets etc Feb 14, 2023

c42f mentioned this issue Jun 16, 2023

Fix line numbers with source fragments #310

Merged

c42f mentioned this issue Aug 3, 2023

Make SourceFile.code infer concretely #337

Merged

c42f mentioned this issue Jul 18, 2024

char_range() function for indexing into source strings #457

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generalizing SourceFile - adding line/column offsets etc #190

Generalizing SourceFile - adding line/column offsets etc #190

timholy commented Feb 10, 2023 •

edited

Loading

c42f commented Feb 11, 2023

davidanthoff commented Feb 13, 2023

c42f commented Feb 14, 2023

c42f commented Feb 14, 2023

Generalizing SourceFile - adding line/column offsets etc #190

Generalizing SourceFile - adding line/column offsets etc #190

Comments

timholy commented Feb 10, 2023 • edited Loading

c42f commented Feb 11, 2023

davidanthoff commented Feb 13, 2023

c42f commented Feb 14, 2023

c42f commented Feb 14, 2023

timholy commented Feb 10, 2023 •

edited

Loading