Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I've got the following problem, the current model:
grobid-material.zip
is used to parse string of materials names, more or less full of garbage and manage to get something useful out of it :-)
In the following example:
(Mo -x 1 T x ) 3 Sb 7 with x 0.1
it extracts three entities(Mo -x 1 T x ) 3 Sb 7
as<formula>
x
as<variable>
and0.1
as<value>
The current master implementation output
0.
as the last entity's text. I've started digging anda) reconstructed the entity text from the original text, and not from the tokens, the current approach concatenates with spaces in the middle which are not always needed.
b) fixed a problem when the entity is laying at the end of the sequence, somehow the last tag is ignored and the start/end offsets are lacking one element.
c) I've followed @oterrier suggestion in #103 and update the joblib import 😂
d) a858404 it's an old modification I've been committed but never pushed months ago, it removes another warning then the sequence is truncated
This might need some additional tests, to run my test case, here the modification in
grobidTagger.py
(in combination with thematerial
model):