Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generic input for easier long-term support of languages #1734

Open
wadoon opened this issue Apr 26, 2024 · 3 comments
Open

Generic input for easier long-term support of languages #1734

wadoon opened this issue Apr 26, 2024 · 3 comments
Labels
enhancement Issue/PR that involves features, improvements and other changes language PR / Issue deals (partly) with new and/or existing languages for JPlag major Major issue/feature/contribution/change

Comments

@wadoon
Copy link

wadoon commented Apr 26, 2024

I just stumpled across jPlag.

It is a bit pity, that some languages are in legacy state. I just want to suggest, that a common input format tokens streams (or AST) might be useful for support various languages. For example, to write a parser for Python is hard, but also Python delivers a reusable parser that works and creates ASTs, easy to get the token stream from it and to store this into a JSON, sexpr, etc. and to load this list of tokens into jPlag.

If you would have generic input model in which you can declare your tokens (or AST), you can use the existing parser and write a small adapter for translation.

This might get interesting if you look at tree-sitter. This is a parser framework with several hundreds languages and widely used for syntax-highlighting, etc. Tree-sitter provides an uniform AST representation (s-expr). To support this, or similar format can boost the reach tremendously.

Greetings from down the floor, Alexander

@tsaglam
Copy link
Member

tsaglam commented May 2, 2024

The term legacy is probably not clear enough. It just means the language module is mostly still in a state of the legacy version of JPlag (v2.x.x and earlier). Regarding the generic language module that supports a common input format: There was an implementation of this in the fork of @CodeGra-de. However, it might not be there anymore.

We have been thinking about integrating tree-sitter before. However, we probably would prefer direct integration via Java bindings to a more decoupled approach via a generic language module. Tree-sitter would provide us with more means to parse up-to-date language versions, but it would not make language modules obsolete, as a carefully designed tokenization (some parse tree nodes are extracted as tokens, some not. some nodes map to the same token type, others do not.) strategy is crucial for a good detection quality.

The quality and currentness are pain points of some of the antlr grammars (almost all are from https://github.com/antlr/grammars-v4/), so I would not rule out a tree-sitter integration.

@tsaglam tsaglam added enhancement Issue/PR that involves features, improvements and other changes major Major issue/feature/contribution/change language PR / Issue deals (partly) with new and/or existing languages for JPlag labels May 2, 2024
@wadoon
Copy link
Author

wadoon commented May 2, 2024

The target is not to have a tree-sitter integration. The target would rather allow a generic program, that can be triggered from jplag and returns a JSON (or whatever format) object describing the list of token. For example,

jplag --language rust --use-preprocessor './rustAST2json {file}' *.rs`

The given program ./rustAST2json reads the given file, and returns via stdout the token information.

@tsaglam
Copy link
Member

tsaglam commented May 3, 2024

This is roughly the generic language module that CodeGra-de made, but we do not currently plan to implement such a feature. Direct integration of tree-sitter, however, is something we may consider in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Issue/PR that involves features, improvements and other changes language PR / Issue deals (partly) with new and/or existing languages for JPlag major Major issue/feature/contribution/change
Projects
None yet
Development

No branches or pull requests

2 participants