Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent tokenization from modifying its input. #197

Merged
merged 2 commits into from
Jan 29, 2019
Merged

Prevent tokenization from modifying its input. #197

merged 2 commits into from
Jan 29, 2019

Conversation

creachadair
Copy link
Contributor

@creachadair creachadair commented Jan 29, 2019

Fixes #196. Several steps of tokenization edit the input for the advantage of steps downstream.
However, these edits were on slices from the original input, so they changed the content seen
by the caller. Copy the input before slicing to avert that, and add a test to ensure it.

Signed-off-by: M. J. Fromberger michael.j.fromberger@gmail.com

At present this test fails, since the tokenizer replaces text in shared slices
of the input. A subsequent commit will fix that.

Signed-off-by: M. J. Fromberger <michael.j.fromberger@gmail.com>
Addresses #196. Several of the tokenizer's processing steps wind up editing the
source, and we don't want those changes to be observed by the caller, which may
use the source for other purposes afterward.

Signed-off-by: M. J. Fromberger <michael.j.fromberger@gmail.com>
@bzz bzz self-requested a review January 29, 2019 18:20
Copy link
Contributor

@bzz bzz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏 performance implication should be minimal, as classifer that uses tokenization is a last-resort strategy

@creachadair creachadair changed the title Add a test that tokenization does not modify the input. Prevent tokenization from modifying its input. Jan 29, 2019
@creachadair creachadair merged commit 260dcfe into src-d:master Jan 29, 2019
@creachadair creachadair deleted the muckthebits branch January 29, 2019 19:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants