Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Errno 32] Broken pipe for Parser in parallel execution on OSX #47

Closed
mladvladimir opened this issue Apr 8, 2018 · 13 comments
Closed
Assignees
Labels
bug Something isn't working

Comments

@mladvladimir
Copy link

mladvladimir commented Apr 8, 2018

Hi,

In fonduer-tutorials, after running cell:

corpus_parser = OmniParser(structural=True, lingual=True, visual=True, pdf_path=pdf_path)
%time corpus_parser.apply(doc_preprocessor, parallelism=PARALLEL)

whenever is PARALLEL smaller than max_docs, I've got:

Traceback (most recent call last):
  File "/anaconda3/lib/python3.6/multiprocessing/queues.py", line 240, in _feed
    send_bytes(obj)
  File "/anaconda3/lib/python3.6/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/anaconda3/lib/python3.6/multiprocessing/connection.py", line 398, in _send_bytes
    self._send(buf)
  File "/anaconda3/lib/python3.6/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

Otherwise (with PARALLEL bigger or equal than max_docs) result is empty tables in Postgresql.
When turning off parallelisation, it works.

Best regards

@senwu
Copy link
Collaborator

senwu commented Apr 11, 2018

Hi @mladvladimir, Thanks for reporting this issue. Unfortunately, we don't see this issue on our side. Can you give us more information about your environment and setup?

@lukehsiao lukehsiao added the needs-info Needs more information to replicate label Apr 11, 2018
@mladvladimir
Copy link
Author

Hi @senwu,
I installed everything according to the doc except postgresql; I use Postgres.app installer instead of brew.

Machine is:
Model Name: MacBook Pro
Processor Name: Intel Core i7
Processor Speed: 2 GHz
Number of Processors: 1
Total Number of Cores: 4
L2 Cache (per Core): 256 KB
L3 Cache: 6 MB
Memory: 8 GB

Regards

@lukehsiao
Copy link
Contributor

Hi @mladvladimir

Can you try a fresh install using brew for postgres and using a virtualenv rather than conda?

@lukehsiao
Copy link
Contributor

@mladvladimir Also, did you make sure to both create the db used in the tutorial, and download the tutorial data before running the tutorial?

@mladvladimir
Copy link
Author

@lukehsiao

I tried again.
New version of Postgresql is installed with brew (PostgreSQL 10.3);
virtualenv is created without anaconda;
database is created and data files downloaded according to tutorial;
storage schema was initialized successfully (tables are created).
HTMLPreprocessor is defined without problems.
It failing constantly on corpus_parser.apply(doc_preprocessor, parallelism=PARALLEL) step.
What is even more strange that all other steps which use UDFRunner with parallelism (CandidateExtractor, BatchFeatureAnnotator) are executed without problems.
I tried partially max_storage_temp_tutorial on ubuntu 18.04 Docker and it's succeed in mode (structural=True, lingual=True, visual=False) but error remains if all tree parameters are True.
Is there any particular version of python3 that you recommend?

By the way, in shouldn't be parralelism instead of parallel in test_parser.py?

corpus_parser.apply(doc_preprocessor, parallel=PARALLEL)

This makes that test run in single threaded mode.

Generally speaking, complete project is precious :)

Regards
Vladimir

@lukehsiao
Copy link
Contributor

Unfortunately, I'm still not sure quite sure what is wrong. I haven't personally tested in ubuntu 18.04 either, we have mostly been using 16.04, as you can see in our e2e Travis test and in the tutorials. If you have any more error logs or information that might help us recreate the issue, we can keep looking at it.

Is there any particular version of python3 that you recommend?

We have tested with Python 3.5 and 3.6.

By the way, in shouldn't be parralelism instead of parallel in test_parser.py?

Nice catch! I've opened a new PR (#48) to fix that.

@lukehsiao
Copy link
Contributor

We've been able to reproduce this and are looking into it.

@lukehsiao lukehsiao added bug Something isn't working and removed needs-info Needs more information to replicate labels May 9, 2018
@lukehsiao lukehsiao added this to the v0.1.8 milestone May 9, 2018
@mladvladimir
Copy link
Author

Thanks for info;
After some observing, it seems like Spacy pipeline call is problematic with parallelism:

for proc in self.pipeline:
proc(doc)

maybe I'm wrong.

@lukehsiao
Copy link
Contributor

lukehsiao commented May 10, 2018

This may be related: explosion/spaCy#1572

Update: It does appear to be spaCy related. In particular, parallelism works fine for other processes (e.g. candidate extraction), just not for the parser.

@lukehsiao lukehsiao changed the title [Errno 32] Broken pipe for OmniParser in parallel execution [Errno 32] Broken pipe for OmniParser in parallel execution on OSX May 10, 2018
@lukehsiao lukehsiao removed this from the v0.1.8 milestone May 31, 2018
@lukehsiao lukehsiao changed the title [Errno 32] Broken pipe for OmniParser in parallel execution on OSX [Errno 32] Broken pipe for Parser in parallel execution on OSX Jul 23, 2018
@sivasrc
Copy link

sivasrc commented Oct 31, 2018

Hi @senwu,

When I try execute the given sample
%time corpus_parser.apply(doc_preprocessor,parallelism=PARALLEL), getting the below issue, not sure why, any help would be appreciated.

[INFO] fonduer.utils.udf - Running UDF...
0% 0/78 [00:00<?, ?it/s]
---------------------------------------------------------------------------
SystemExit                                Traceback (most recent call last)
<timed eval> in <module>

~/.local/lib/python3.6/site-packages/fonduer/parser/parser.py in apply(self, doc_loader, pdf_path, clear, parallelism, progress_bar)
    108             clear=clear,
    109             parallelism=parallelism,
--> 110             progress_bar=progress_bar,
    111         )
    112 

~/.local/lib/python3.6/site-packages/fonduer/utils/udf.py in apply(self, doc_loader, clear, parallelism, progress_bar, **kwargs)
     70             self._apply_st(doc_loader, clear=clear, **kwargs)
     71         else:
---> 72             self._apply_mt(doc_loader, parallelism, clear=clear, **kwargs)
     73 
     74         # Close progress bar

~/.local/lib/python3.6/site-packages/fonduer/utils/udf.py in _apply_mt(self, doc_loader, parallelism, **kwargs)
    128                 out_queue=out_queue,
    129                 worker_id=i,
--> 130                 **self.udf_init_kwargs
    131             )
    132             udf.apply_kwargs = kwargs

~/.local/lib/python3.6/site-packages/fonduer/parser/parser.py in __init__(self, structural, blacklist, flatten, lingual, strip, replacements, tabular, visual, pdf_path, language, **kwargs)
    169         if self.lingual_parser.has_tokenizer_support():
    170             self.tokenize_and_split_sentences = self.lingual_parser.split_sentences
--> 171             self.lingual_parser.load_lang_model()
    172         else:
    173             self.tokenize_and_split_sentences = SimpleTokenizer().parse

~/.local/lib/python3.6/site-packages/fonduer/parser/spacy_parser.py in load_lang_model(self)
    125         if self.lang in self.languages:
    126             if not Spacy.model_installed(self.lang):
--> 127                 download(self.lang)
    128             model = spacy.load(self.lang)
    129         elif self.lang in self.alpha_languages:

~/.local/lib/python3.6/site-packages/spacy/cli/download.py in download(model, direct, *pip_args)
     36                             .format(m=model_name, v=version), pip_args)
     37         if dl != 0:  # if download subprocess doesn't return 0, exit
---> 38             sys.exit(dl)
     39         try:
     40             # Get package path here because link uses

SystemExit: 1

@senwu
Copy link
Collaborator

senwu commented Oct 31, 2018

Hi @sivasrc, it seems like your issue is related to downloading spacy language module. I suggest you try it again or download the module manually (here is the reference: https://spacy.io/usage/models).

@senwu
Copy link
Collaborator

senwu commented Nov 5, 2018

This issue is fixed by Spacy and test it in #176.

@senwu senwu closed this as completed Nov 5, 2018
@sivasrc
Copy link

sivasrc commented Dec 4, 2018 via email

stackoverflowed pushed a commit to stackoverflowed/multimodal that referenced this issue Dec 4, 2021
Enclose by double quotes otherwise version not specified and stdout written to a file "=0.5.0"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants