Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enry changes the input byte slice #196

Closed
vmarkovtsev opened this issue Jan 29, 2019 · 4 comments
Closed

Enry changes the input byte slice #196

vmarkovtsev opened this issue Jan 29, 2019 · 4 comments
Assignees
Labels

Comments

@vmarkovtsev
Copy link
Collaborator

enry.GetLanguage changes the contents of its second argument - the byte slice. How to reproduce:

reproduce.go

package main

import (
    "io/ioutil"
    "os"

    "gopkg.in/src-d/enry.v1"
)

func main() {
    data, _ := ioutil.ReadAll(os.Stdin)
    language := enry.GetLanguage("ParabolicPointer.cs", data)
    os.Stdout.Write(data)
    println(language)
}
cat ParabolicPointer.cs | go run reproduce.go > new.cs
diff ParabolicPointer.cs new.cs

You should see

50c50
<         for(int i=0; i<points; i++)
---
>         for(int i=0; i<points;>i++)

ParabolicPointer.cs: ParabolicPointer.zip

This is the root cause of src-d/hercules#178

vmarkovtsev added a commit to vmarkovtsev/hercules that referenced this issue Jan 29, 2019
Copy the input byte slice to defend from src-d/enry#196

Signed-off-by: Vadim Markovtsev <vadim@sourced.tech>
@creachadair
Copy link
Contributor

$ go run repro.go < ParabolicPointer.cs 
C#
before eb7239ae3fc314c45ac58bf03abbd19aff2a0243
 after 37a2b661c1e43cb2ed34b51c924e424278233fb0

I haven't yet figured out which strategy is the culprit, but I'm working on narrowing it down.

@bzz
Copy link
Contributor

bzz commented Jan 29, 2019

Final decision is made by the DefaultClassifier which seems to be cause of this.

@creachadair
Copy link
Contributor

Specifically, the problem happens in the guts of the tokenizer package. It takes slices into the original input, and then makes replacements into them. Since the slices are views, this winds up changing the content.

The tokens should probably have been strings instead of slices, since that's never really safe when you're making changes. But for now I think we can make a reasonable fix of this by having the tokenizer operate on a copy.

@creachadair
Copy link
Contributor

I've started a fairly straightforward test+fix in #197.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants