Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pipeline breaks with longer html #93

Open
fsasaki opened this issue Aug 26, 2016 · 6 comments
Open

pipeline breaks with longer html #93

fsasaki opened this issue Aug 26, 2016 · 6 comments
Labels

Comments

@fsasaki
Copy link

fsasaki commented Aug 26, 2016

See
http://api.freme-project.eu/current/pipelining/chain/37
and the two attached requests. The only difference between the is the length of the files processed. The longer file is less than 100K - is this still an issue?

request-with-short-html-doc.txt
request-with-long-html-doc.txt

@jnehring
Copy link
Member

The first pipeline step detects 1052 named entities, the second creates 1052 sparql queries and sends them to dbpedia. This takes a long time. There is a timeout of 60 seconds configured.

I transfered the pipeline to freme-dev and changed the timeouts to 600 seconds in three places on freme-dev

  • apache mod_proxy
  • timeout of the rest controller in application.properties
  • timeout of requests in pipelines in the source code of PipelineService.java

Now the requests fail after 10 minutes. I am not sure how to deal with this. These timeouts make sense, but on the other hand it should still be possible to process large files.

The problematic service here is e-Link with the slow dbpedia endpoint. A client side solution that I did not explore yet is to download the entities via freme-ner and then send them in smaller batches to e-Link. A server side solution would be to set timeouts to 1 hour, or to load the dbpedia in our own triple store and hope that this improves response times.

@jnehring jnehring assigned fsasaki and m1ci and unassigned fsasaki Aug 29, 2016
@jnehring
Copy link
Member

jnehring commented Sep 5, 2016

In last developers call @m1ci said he will check if the implementation of e-Link can be speed up somehow. Possibilities to explore from what I recall from the discussion:

  1. reduce the number of sparql queries, by fetching information about multiple links in one go
  2. implement caching / avoid redundant calls about the same link

@jnehring jnehring removed their assignment Sep 14, 2016
@jnehring
Copy link
Member

any update here?

@jnehring
Copy link
Member

jnehring commented Dec 5, 2016

Pipeline 37 does not exist anymore. But one can reproduce the problem using this curl request.

The problem still occurs.

@jnehring jnehring added bug and removed question labels Dec 5, 2016
m1ci added a commit to freme-project/e-services that referenced this issue Dec 6, 2016
@m1ci
Copy link

m1ci commented Dec 6, 2016

just did an optimization update at e-link to perform enrichment only on unique entities. In other words, if there are multiple occurrences of a same entity, the enrichment will be performed only once. @jnehring can you please test now?

@jnehring jnehring self-assigned this Dec 6, 2016
@jnehring
Copy link
Member

jnehring commented Dec 7, 2016

I had issues executing the curl request I posted earlier. Therefore I created the pipeline id 56 on freme-dev that executes freme ner first and then e-link.

It still fails on the long document

curl -X POST -H "Content-Type: text/html" -d '@long-document.html' "http://api-dev.freme-project.eu/current/pipelining/chain/56"

@m1ci could you process the long document succesfuly?

Your update makes sense, even if it cannot process the long document. Since the http requests time out after a while there has to be a maximum length of text / maximum number of entities that the service can process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants