Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Publishing grobid-core to Maven central #59

Closed
sujen1412 opened this issue Jul 29, 2015 · 19 comments
Closed

Publishing grobid-core to Maven central #59

sujen1412 opened this issue Jul 29, 2015 · 19 comments

Comments

@sujen1412
Copy link
Contributor

I am trying to integrate grobid into Apache Tika for metadata extracion. It would be nice to have grobid-core published to maven central to make adding the dependency in pom.xml easier.

@sujen1412
Copy link
Contributor Author

@kermitt2
Copy link
Owner

Hello!

Thanks for the issue !

These are the problems I can remember so far about that:

  1. Grobid, in addition to the grobid-core.jar file, need the resources located under grobid-home for running which include several large ML models. I think this is very large for uploading to a maven repo when producing a release.

I don't know if there is a mechanism in Maven to download large resource files that might not be artifacts on a maven repository. If it is the case, we could think about hosting the resources files on Amazon S3 for example.

We can diminish a lot the size of grobid-home by not including CRF++ models and by relying only on Wapiti. CRF++ is still included and can be optionally used, but does not present any advantages in comparison to Wapiti. On the other hand, we might use another CRF library in the future, so the size of the resource files could also increase.

  1. Grobid has the following dependencies on libraries also not in maven central:
  • wapiti, based on a fork where we included JNI and fix a couple of things, https://github.com/kermitt2/Wapiti
  • CRF++, which could be removed as mentioned above
  • the ImageIO plugin for supporting PPM format (this is used to convert PPM images extracted from the PDF into PNG images),
  • wipo-analysers, which are custom analysers based on Lucene for supporting CJK languages (a contribution of WIPO to Grobid).

For these dependencies, Grobid uses currently a local file-based repo.

@sujen1412
Copy link
Contributor Author

Hi @kermitt2 thank you for quick reply and the context. Some comments from me below

Grobid, in addition to the grobid-core.jar file, need the resources located under grobid-home for running which include several large ML models. I think this is very large for uploading to a maven repo when producing a release.
I don't know if there is a mechanism in Maven to download large resource files that might not be artifacts on a maven repository. If it is the case, we could think about hosting the resources files on Amazon S3 for example.

I don't think this is an issue for pushing Maven artifacts. I say this because

  • we can just host/make available the large artifacts somewhere else.
  • we require the user to include these in the resources path when running their code. This removes the requirement for Grobid to worry about this

wapiti, based on a fork where we included JNI and fix a couple of things, https://github.com/kermitt2/Wapiti

How is this licensed? I see it is copyrighted however is it licensed? If it is licensed permissively enough then we can maybe also make this available somewhere.

the ImageIO plugin for supporting PPM format (this is used to convert PPM images extracted from the PDF into PNG images),

Do you have a link to the above project please?

wipo-analysers, which are custom analysers based on Lucene for supporting CJK languages (a contribution of WIPO to Grobid).

Same with the above. Do you have a link to this component(s) so I can check it out and see what the status of the error is?

Thank you very much in advance.

@kermitt2
Copy link
Owner

Hi Sujen!

Normally all these packages have a licence compatible with the licence of Grobid, Apache 2 so that they could be included directly with Grobid.

Now the more difficult part: the wipo-analysers has no project page. I've packaged everything as maven project, so I could create a GitHub repo for this. I will double check with the WIPO people, but as it is Apache 2, that should be no problem.

Thanks a lot for you effort !

@chrismattmann
Copy link
Contributor

thanks @kermitt2 and @sujen1412 right now we should be able to publish Wapiti and ImageIO and the analyzers to the Central repo under the Grobid group, indicating these are used in conjunction with Grobid. If at such time those particular developer communities would like to take management and ownership of publishing their artfiacts to Central they can, and we can update Grobid to use them at that time. For now though it's safe to publish the rest of the jars.

GREAT library @kermitt2 ! We used it in DARPA Memex with great success.

kermitt2 added a commit that referenced this issue Aug 7, 2015
Fix for Grobid #59 - Publishing to Maven Central
@chrismattmann
Copy link
Contributor

OK I got it working! https://issues.apache.org/jira/browse/TIKA-1699
It's going to be committed now, great work @sujen1412. Great library @kermitt2. Love it. Thanks and you're now integrated into Apache Tika!

@chrismattmann
Copy link
Contributor

@sujen1412 this issue can be closed.

@chrismattmann
Copy link
Contributor

@sujen1412 I am seeing build errors on Jenkins for Tika: https://builds.apache.org/job/tika-trunk-jdk1.7/822/ It seems that the other jars aren't in Central like we talked about. Can you please take care of that?

@kermitt2
Copy link
Owner

Many thanks @sujen1412 and @chrismattmann for your efforts to integrate GROBID in Apache Tika!

And thank you Chris for your nice words. It's really a pleasure to see the library used and considered useful!

@chrismattmann
Copy link
Contributor

Hi @sujen1412 can you re-open this? We need either those jars published in to Central or another mechanism here to integrate into Tika. One thing I was thinking of was just connecting to the GROBID server. See discussion on http://issues.apache.org/jira/browse/TIKA-1699

@chrismattmann
Copy link
Contributor

OK I filed an issue to upload the Wapiti jar fork:
https://issues.sonatype.org/browse/OSSRH-17124

@chrismattmann
Copy link
Contributor

OK here is the issue for EUGFC ImageIO plugin: https://issues.sonatype.org/browse/OSSRH-17126

@chrismattmann
Copy link
Contributor

Here's the one for Language Detection: https://issues.sonatype.org/browse/OSSRH-17127

@chrismattmann
Copy link
Contributor

For Chasen CRFPP: https://issues.sonatype.org/browse/OSSRH-17128

@chrismattmann
Copy link
Contributor

Here's the WIPO analysers: https://issues.sonatype.org/browse/OSSRH-17129 That should be all of them.

@rawsh
Copy link

rawsh commented Sep 21, 2017

@kermitt2 I tried to use the maven central grobid but maven does not build:

[ivy:resolve] 		:: com.cybozu#language-detection;09-13-2011: not found
[ivy:resolve] 		:: eugfc#imageio-pnm;1.0: not found
[ivy:resolve] 		:: org.wipo.analysers#wipo-analysers;0.0.1: not found

Do you know what might be going on? Can I add these locally?

@kermitt2
Copy link
Owner

Hello, these libraries come locally with Grobid (under grobid/lib or grobid-core/lib), so they are not loaded from maven central when Grobid builds. You can have a look at grobid-core/pom.xml how the local repository is defined for resolving these dependencies without troubles.

@rawsh
Copy link

rawsh commented Sep 29, 2017

@kermitt2 Thanks, I fixed these issues by including the jars and fixing to
com.cybozu.labs#langdetect;1.1-20120112

@lfoppiano
Copy link
Collaborator

I think at the moment with the deployment in bintray we can close this issue, isn't it?

https://bintray.com/rookies/maven/grobid

de-code pushed a commit to elifesciences/grobid that referenced this issue Nov 29, 2019
Fix for Grobid kermitt2#59 - Publishing to Maven Central

Former-commit-id: 999c43a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants