Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

structured-prediction-constituency-parser adds extra spaces #5673

Open
10 tasks done
LawlAoux opened this issue Jun 20, 2022 · 5 comments
Open
10 tasks done

structured-prediction-constituency-parser adds extra spaces #5673

LawlAoux opened this issue Jun 20, 2022 · 5 comments

Comments

@LawlAoux
Copy link

Checklist

  • I have verified that the issue exists against the main branch of AllenNLP.
  • I have read the relevant section in the contribution guide on reporting bugs.
  • I have checked the issues list for similar or identical bug reports.
  • I have checked the pull requests list for existing proposed fixes.
  • I have checked the CHANGELOG and the commit log to find out if the bug was already fixed in the main branch.
  • I have included in the "Description" section below a traceback from any exceptions related to this bug.
  • I have included in the "Related issues or possible duplicates" section beloew all related issues and possible duplicate issues (If there are none, check this box anyway).
  • I have included in the "Environment" section below the name of the operating system and Python version that I was using when I discovered this bug.
  • I have included in the "Environment" section below the output of pip freeze.
  • I have included in the "Steps to reproduce" section below a minimally reproducible example.

Description

The consistency parser adds white spaces before and after special characters like ".?-," and etc.
For example, for the sentence: "Hi there, I'm LawlAoux." the output for the root is "Hi there , I 'm LawlAoux ."(full tree in details).

{'word': "Hi there , I 'm LawlAoux .", 'nodeType': 'S', 'attributes': ['S'], 'link': 'S', 'children': [{'word': 'Hi there', 'nodeType': 'INTJ', 'attributes': ['INTJ'], 'link': 'INTJ', 'children': [{'word': 'Hi', 'nodeType': 'UH', 'attributes': ['UH'], 'link': 'UH'}, {'word': 'there', 'nodeType': 'ADVP', 'attributes': ['ADVP'], 'link': 'ADVP', 'children': [{'word': 'there', 'nodeType': 'RB', 'attributes': ['RB'], 'link': 'RB'}]}]}, {'word': ',', 'nodeType': ',', 'attributes': [','], 'link': ','}, {'word': 'I', 'nodeType': 'NP', 'attributes': ['NP'], 'link': 'NP', 'children': [{'word': 'I', 'nodeType': 'PRP', 'attributes': ['PRP'], 'link': 'PRP'}]}, {'word': "'m LawlAoux", 'nodeType': 'VP', 'attributes': ['VP'], 'link': 'VP', 'children': [{'word': "'m", 'nodeType': 'VBP', 'attributes': ['VBP'], 'link': 'VBP'}, {'word': 'LawlAoux', 'nodeType': 'NP', 'attributes': ['NP'], 'link': 'NP', 'children': [{'word': 'LawlAoux', 'nodeType': 'NNP', 'attributes': ['NNP'], 'link': 'NNP'}]}]}, {'word': '.', 'nodeType': '.', 'attributes': ['.'], 'link': '.'}]}
As you can see, it adds spaces for the entire sentence for the root of the tree.

Related issues or possible duplicates

  • None

Environment

OS: OS X

Python version: 3.9.7

Output of pip freeze:

absl-py==1.1.0
agenda @ git+https://github.com/hyroai/agenda.git@1321e49ec433d62901e1e00dfd546c00d43db544
aiohttp==3.8.1
aioredis==2.0.1
aiosignal==1.2.0
allennlp==2.9.3
allennlp-models==2.9.3
analysis @ git+https://gitlab.com/airbud/analysis.git@d3829b9896c9a8af6a498a859695b3a00fa3556d
anyio==3.6.1
appdirs==1.4.4
appnope==0.1.3
argon2-cffi==21.3.0
argon2-cffi-bindings==21.2.0
arrow==1.2.2
asgiref==3.5.2
asttokens==2.0.5
astunparse==1.6.3
async-cache==1.1.1
async-generator==1.10
async-lru==1.0.3
async-timeout==4.0.2
asyncio==3.4.3
attrs==21.4.0
azure-common==1.1.28
azure-storage-blob==2.1.0
azure-storage-common==2.1.0
Babel==2.10.2
backcall==0.2.0
base58==2.1.1
beautifulsoup4==4.11.1
bleach==5.0.0
blis==0.7.7
boto3==1.24.8
botocore==1.27.8
breadability==0.1.20
cached-path==1.1.3
cachetools==4.2.4
catalogue==2.0.7
certifi==2022.5.18.1
cffi==1.15.0
cfgv==3.3.1
chardet==4.0.0
charset-normalizer==2.0.12
click==7.1.2
-e git+https://github.com/hyroai/cloud-utils.git@df2c01628b807064cda2a408b897cc822fb8558c#egg=cloud_utils
commonmark==0.9.1
computation-graph==38
conllu==4.4.1
coverage==6.4.1
cryptography==37.0.2
cycler==0.11.0
cymem==2.0.6
dataclass-type-validator==0.1.2
dataclasses==0.6
dataclasses-json==0.5.5
datasets==2.2.2
dateparser==1.1.1
ddsketch==2.0.3
ddtrace==1.1.4
debugpy==1.6.0
decorator==5.1.1
defusedxml==0.7.1
Deprecated==1.2.13
dill==0.3.4
distlib==0.3.4
dnspython==2.2.1
docker-pycreds==0.4.0
docopt==0.6.2
docutils==0.18.1
ecdsa==0.17.0
emoji==1.7.0
en-core-web-lg @ https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.2.0/en_core_web_lg-3.2.0-py3-none-any.whl
en-core-web-sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl
entrypoints==0.4
et-xmlfile==1.1.0
execnet==1.9.0
executing==0.8.3
fairscale==0.4.6
falcon==3.1.0
fastapi==0.78.0
fastjsonschema==2.15.3
filelock==3.6.0
findimports==2.2.0
flaky==3.7.0
flatbuffers==1.12
fonttools==4.33.3
frozenlist==1.3.0
fsspec==2022.5.0
ftfy==6.1.1
fuzzyset2==0.1.1
gamla==132
gast==0.4.0
Geohash @ https://github.com/uriva/geohash/tarball/master
geotext==0.4.0
gitdb==4.0.9
GitPython==3.1.27
google-api-core==1.31.6
google-api-python-client==2.50.0
google-auth==1.35.0
google-auth-httplib2==0.1.0
google-auth-oauthlib==0.4.6
google-cloud-core==2.3.1
google-cloud-dlp==1.0.0
google-cloud-kms==1.4.0
google-cloud-storage==2.4.0
google-crc32c==1.3.0
google-pasta==0.2.0
google-resumable-media==2.3.3
googleapis-common-protos==1.56.2
grpc-google-iam-v1==0.12.4
grpcio==1.46.3
gspread==5.4.0
gunicorn==20.1.0
h11==0.13.0
h5py==3.7.0
heapq-max==0.21
html2text==2020.1.16
httpcore==0.13.2
httplib2==0.20.4
httptools==0.2.0
httpx==0.18.1
httpx-auth==0.10.0
huggingface-hub==0.7.0
humanize==4.1.0
identify==2.5.1
idna==3.3
immutables==0.18
importlib-metadata==4.11.4
inflect==5.6.0
iniconfig==1.1.1
ipykernel==6.14.0
ipython==8.4.0
ipython-genutils==0.2.0
ipywidgets==7.7.0
iso4217parse==0.5.1
jedi==0.18.1
Jinja2==3.1.2
jmespath==1.0.0
joblib==1.1.0
json5==0.9.8
jsonnet==0.18.0
jsonschema==4.6.0
jupyter-client==7.3.4
jupyter-core==4.10.0
jupyter-server==1.17.1
jupyterlab==3.4.3
jupyterlab-pygments==0.2.2
jupyterlab-server==2.14.0
jupyterlab-widgets==1.1.0
keras==2.9.0
Keras-Preprocessing==1.1.2
kiwisolver==1.4.3
-e git+https://github.com/hyroai/knowledge-graph.git@a3030683577ed4a3828ea72ff71842dab50c1b1c#egg=knowledge_graph
kubernetes==23.6.0
langcodes==3.3.0
libclang==14.0.1
lmdb==1.3.0
lxml==4.9.0
marisa-trie==0.7.7
Markdown==3.3.7
markdownify==0.11.2
MarkupSafe==2.1.1
marshmallow==3.16.0
marshmallow-enum==1.5.1
matplotlib==3.5.2
matplotlib-inline==0.1.3
mistune==0.8.4
mona-sdk==0.0.24
more-itertools==8.13.0
motor==2.5.1
mpu==0.23.1
multidict==6.0.2
multiprocess==0.70.12.2
murmurhash==1.0.7
mypy-extensions==0.4.3
names-dataset==3.1.0
nbclassic==0.3.7
nbclient==0.6.4
nbconvert==6.5.0
nbformat==5.4.0
nest-asyncio==1.5.5
nltk==3.7
-e git+ssh://git@gitlab.com/airbud/nlu-lib.git@e9c3577d3761cc645706d670c85c6727abc4f3cf#egg=nlu
nodeenv==1.6.0
nostril @ https://github.com/casics/nostril/tarball/master
notebook==6.4.12
notebook-shim==0.1.0
number-parser==0.2.1
numpy==1.22.4
oauth2client==4.1.3
oauthlib==3.2.0
openpyxl==3.0.10
opt-einsum==3.3.0
outcome==1.1.0
overrides==6.1.0
packaging==21.3
pandas==1.4.2
pandocfilters==1.5.0
parsec==3.13
parso==0.8.3
pathtools==0.1.2
pathy==0.6.1
pexpect==4.8.0
phonenumbers==8.12.24
pickleshare==0.7.5
Pillow==9.1.1
plac==1.3.5
platformdirs==2.5.2
plotly==5.8.2
pluggy==1.0.0
pre-commit==2.19.0
preshed==3.0.6
presidio-analyzer==2.2.28
prometheus-async==22.2.0
prometheus-client==0.14.1
promise==2.3
prompt-toolkit==3.0.29
protobuf==3.19.4
psutil==5.9.1
ptyprocess==0.7.0
pure-eval==0.2.2
py==1.11.0
py-rouge==1.1
pyap==0.3.1
pyarrow==8.0.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycld2==0.41
pycountry==22.3.5
pycparser==2.21
pydantic==1.7.4
pyee==8.2.2
Pygments==2.12.0
pyhumps==3.7.1
PyJWT==1.7.1
pymongo==3.12.3
pyOpenSSL==22.0.0
pyparsing==3.0.9
pyppeteer==1.0.2
pyrsistent==0.18.1
PySocks==1.7.1
pytest==7.1.2
pytest-asyncio==0.18.3
pytest-forked==1.4.0
pytest-instafail==0.4.2
pytest-mock==3.7.0
pytest-repeat==0.9.1
pytest-sugar==0.9.4
pytest-test-groups==1.0.3
pytest-timeout==2.1.0
pytest-tmnet @ https://testmon.org/static/c28870f08/pytest-tmnet-1.3.2.tar.gz
pytest-xdist==2.5.0
python-dateutil==2.8.2
python-dotenv==0.20.0
python-jose==3.3.0
python-Levenshtein==0.12.2
python-stdnum==1.17
pytimeparse==1.1.8
pytz==2022.1
pytz-deprecation-shim==0.1.0.post0
PyYAML==5.4.1
pyzmq==23.1.0
redis==4.3.3
regex==2022.6.2
requests==2.28.0
requests-file==1.5.1
requests-mock==1.9.3
requests-oauthlib==1.3.1
responses==0.18.0
retrying==1.3.3
rfc3986==1.5.0
rich==12.4.4
rsa==4.8
s3transfer==0.6.0
sacremoses==0.0.53
scikit-learn==1.1.1
scipy==1.8.1
selenium==4.2.0
Send2Trash==1.8.0
sentencepiece==0.1.96
sentry-sdk==1.5.12
setproctitle==1.2.3
shortuuid==1.0.9
six==1.16.0
sklearn==0.0
smart-open==5.2.1
smmap==5.0.0
sniffio==1.2.0
sortedcontainers==2.4.0
soupsieve==2.3.2.post1
spacy==3.2.4
spacy-legacy==3.0.9
spacy-loggers==1.0.2
srsly==2.4.3
stack-data==0.2.0
starlette==0.19.1
stringcase==1.2.0
sumy==0.10.0
tabulate==0.8.9
tenacity==8.0.1
tensorboard==2.9.1
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
tensorboardX==2.5.1
tensorflow==2.9.1
tensorflow-estimator==2.9.0
tensorflow-hub==0.12.0
tensorflow-io-gcs-filesystem==0.26.0
termcolor==1.1.0
terminado==0.15.0
thinc==8.0.17
threadpoolctl==3.1.0
tinycss2==1.1.1
tldextract==3.1.0
tokenizers==0.12.1
toml==0.10.2
tomli==2.0.1
toolz==0.11.2
toposort==1.7
torch==1.11.0
torchvision==0.12.0
tornado==6.1
tqdm==4.64.0
traitlets==5.2.2.post1
transformers==4.18.0
trio==0.21.0
trio-websocket==0.9.2
typeguard==2.13.3
typer==0.4.1
typing-inspect==0.7.1
typing-utils==0.1.0
typing_extensions==4.2.0
tzdata==2022.1
tzlocal==4.2
undetected-chromedriver==3.1.5.post4
Unidecode==1.3.4
uritemplate==4.1.1
urllib3==1.26.9
uvicorn==0.15.0
uvloop==0.16.0
virtualenv==20.14.1
wandb==0.12.18
wasabi==0.9.1
watchgod==0.8.2
wcwidth==0.2.5
webdriver-manager==3.7.0
webencodings==0.5.1
websocket-client==1.3.2
websockets==10.3
Werkzeug==2.1.2
widgetsnbextension==3.6.0
word2number==1.1
wptools @ https://github.com/hyroai/wptools/tarball/master
wrapt==1.14.1
wsproto==1.1.0
xmltodict==0.13.0
xxhash==3.0.0
yappi==1.3.5
yarl==1.7.2
yattag==1.14.0
zipp==3.8.0

Steps to reproduce

Example source:

_PARSER = pretrained.load_predictor("structured-prediction-constituency-parser")
_PARSER("Hi there, I'm LawlAoux.")["hierplane_tree"]["root"]

@LawlAoux LawlAoux added the bug label Jun 20, 2022
@epwalsh
Copy link
Member

epwalsh commented Jun 29, 2022

Thanks for the bug report @LawlAoux! I have to admit I don't have a lot of context here, and this model was implemented before my time. Does this bug break something in your own code? I'm just trying to get a sense of how big of an issue this is.

@LawlAoux
Copy link
Author

LawlAoux commented Jul 3, 2022

@epwalsh Thanks for your reply! It does break something in our code, because for example when you have something like ob-gyn, which is a common specialty, you get ob - gyn, and the parser tree is just wrong.. In addition, in the example I provider in the bug report you can see that it adds extra spaces whenever there is a special character, which is not a desired behaviour for something like chat bot.. (You can argue that we can just remove these spaces, but sometimes we want to retain the original spaces which will be stripped if we do that, like in tts for example)

@epwalsh
Copy link
Member

epwalsh commented Jul 7, 2022

Hmm I see, thanks. I just took a deeper look at this. Here's what I found:

The root of the issue is that the predictor uses this Spacy tokenizer which discards spaces, unlike more "modern" tokenizers such as GPT2's BPE tokenizer. So then this line here naively joins all of the tokens with spaces to reconstruct each span.

Now, the Spacy Token instances themselves contain enough information to recreate the exact input span since they contain the start and end index of the token within the originally string, but the problem is we don't actually pass Token objects through the model - we just pass the string form of the tokens - and so we only get strings back out. If we were able to get the Token instances back out (in the Tree object, specifically) instead of strings, then we'd be able to properly reconstruct each span on this line.

I think this change is doable. We'd have to modify code in several places:

  • The Penn Tree Bank dataset reader would have to accept Union[List[str], List[Token]] instead of just List[str] in the text_to_instance() method.
  • Similarly, the construct_trees() method on the model needs to accept List[List[Token]] instead of List[List[str]] and return a Tree object that contains the Token instances instead of strings.
  • In the predictor we need to pass the Token objects themselves instead of pulling out the strings on this line.
  • And finally, in the _build_heirplane_tree method of the predictor, we need to properly reconstruct each span using the additional information in each token.

I might be missing some details, but I think that's the gist of it. I'm putting the "Contributions Welcome" label on this because I probably won't have time to tackle this anytime soon, but I'm happy to review a PR and help where I can.

@borosilicate
Copy link

@epwalsh and @LawlAoux Hello if you still need someone to complete this I would be willing to work on it.

@epwalsh
Copy link
Member

epwalsh commented Sep 13, 2022

Hi @borosilicate, yes please go ahead with a PR when you have a chance

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants