Skip to content
This repository has been archived by the owner on Feb 19, 2020. It is now read-only.

Packaging refactor #38

Open
briantopping opened this issue Jul 25, 2015 · 22 comments
Open

Packaging refactor #38

briantopping opened this issue Jul 25, 2015 · 22 comments

Comments

@briantopping
Copy link

Hiyas, I'd like to create a PR for packaging and wanted to see if it would be accepted before spending time on it. I'm still stuck on e0238ce given epic-parser-en-span_2.11/2015.2.19 being incompatible with anything newer, so I could just as easily make the changes I need locally and remain forked.

What I'm after at the minimum is to create two modules, one for "core" and one for "tools". Goal here is to get Tika out of the core dependencies, used only in epic.preprocess.TextExtractor, which is really a command-line tool. I don't know how many other tools there are like this or what other effects it might have on the dependency closure, but I think it will be significant.

The reason I am even doing that is Tika depends on Apache POI and a kitchen sink of other detritus. POI has a split-package problem. Once everything is cleaned up, I should at least be able to make the core module into an OSGi bundle.

@dlwh
Copy link
Owner

dlwh commented Jul 25, 2015

That'd be great!
On Jul 25, 2015 11:12 AM, "Brian Topping" notifications@github.com wrote:

Hiyas, I'd like to create a PR for packaging and wanted to see if it would
be accepted before spending time on it. I'm still working from e0238ce
e0238ce
given a lack of epic-parser-en-span_2.11 being regenerated since then, so I
could just as easily make the changes I need locally.

What I'm after at the minimum is to create two modules, one for "core" and
one for "tools". Goal here is to get Tika out of the core dependencies,
used only in epic.preprocess.TextExtractor, which is really a command-line
tool. I don't know how many other tools there are like this or what other
effects it might have on the dependency closure, but I think it will be
significant.

The reason I am even doing that is Tika depends on Apache POI and a
kitchen sink of other detritus. POI has a split-package
http://wiki.osgi.org/wiki/Split_Packages problem. Once everything is
cleaned up, I should at least be able to make the core module into an OSGi
bundle.


Reply to this email directly or view it on GitHub
#38.

@briantopping
Copy link
Author

Awesome! Any chance I could twist your arm to schedule a rebuild of the parsers?

@dlwh
Copy link
Owner

dlwh commented Jul 25, 2015

The current parser (2015.2.19) works against master.
On Jul 25, 2015 12:18 PM, "Brian Topping" notifications@github.com wrote:

Awesome! Any chance I could twist your arm to schedule a rebuild of the
parsers?


Reply to this email directly or view it on GitHub
#38 (comment).

@briantopping
Copy link
Author

Ohhh, interesting, ok I'll try it again, thanks

@dlwh
Copy link
Owner

dlwh commented Jul 25, 2015

I'd prefer anything with "Train" or "Pipeline" be left in core, along with
anything in corpora. Basically I consider "training" code to be core.

Actually, ideally, I'd rather set it up in such a way that all of the main
classes are in core, and Tika is an optional dependency that enables
content extraction from things other than plain text.

That seems like the best of all worlds?

On Sat, Jul 25, 2015 at 1:11 PM, Brian Topping notifications@github.com
wrote:

I think these are all most of the main methods. Are there any in this
list that shouldn't move to 'epic-tools'?

epic.corpora  (2 usages found)
    CONLLSequenceReader.scala  (1 usage found)
        (70: 3) def main(args: Array[String]) {
    MascUtil.scala  (1 usage found)
        (31: 3) def main(args: Array[String]) {
epic.features  (1 usage found)
    HackyHeadFinderTest.scala  (1 usage found)
        (15: 3) def main(args: Array[String]) {
epic.parser  (2 usages found)
    ParserPipeline.scala  (1 usage found)
        (102: 3) def main(args: Array[String]) {
    ParserTester.scala  (1 usage found)
        (45: 3) def main(args: Array[String]) {
epic.parser.kbest  (1 usage found)
    KBestParseTreebank.scala  (1 usage found)
        (34: 3) def main(args: Array[String]) = {
epic.parser.models  (1 usage found)
    ParserTrainer.scala  (1 usage found)
        (174: 3) def main(args: Array[String]):Unit = {
epic.parser.projections  (2 usages found)
    ConstraintAnchoring.scala  (1 usage found)
        (376: 3) def main(args: Array[String]) {
    OracleParser.scala  (1 usage found)
        (185: 3) def main(args: Array[String]) {
epic.preprocess  (5 usages found)
    MLSentenceSegmenter.scala  (1 usage found)
        (371: 3) def main(args: Array[String]):Unit = {
    SentenceSegmenter.scala  (1 usage found)
        (24: 3) def main(args: Array[String]):Unit = {
    StreamSentenceSegmenter.scala  (1 usage found)
        (53: 3) def main(args: Array[String]) {
    Textify.scala  (1 usage found)
        (12: 3) def main(args: Array[String]): Unit= {
    TreebankTokenizer.scala  (1 usage found)
        (62: 3) def main(args: Array[String]) = {
epic.sentiment  (2 usages found)
    SentimentEvaluator.scala  (1 usage found)
        (129: 3) def main(args: Array[String]) {
    SentimentTreebankPipeline.scala  (1 usage found)
        (38: 3) def main(args: Array[String]):Unit = {
epic.sequences  (4 usages found)
    SemiNERPipeline.scala  (2 usages found)
        (34: 3) def main(args: Array[String]) {
        (124: 3) def main(args: Array[String]) {
    TrainPosTagger.scala  (2 usages found)
        (17: 3) def main(args: Array[String]) {
        (42: 3) def main(args: Array[String]) {
epic.trees  (1 usage found)
    TraceToSlashCategoryConverter.scala  (1 usage found)
        (62: 3) def main(args: Array[String]):Unit = {
epic.trees.util  (1 usage found)
    FilterTreesByLength.scala  (1 usage found)
        (17: 3) def main(args: Array[String]) = {
epic.util  (1 usage found)
    ProcessTextMain.scala  (1 usage found)
        (29: 3) def main(args: Array[String]) = {


Reply to this email directly or view it on GitHub
#38 (comment).

@briantopping
Copy link
Author

Ok, I'm pretty far along with this already, so let me take a look at it. I think you'd be surprised that the workflow is not disrupted by these kinds of changes, if you are importing via sbt, you'll get both projects as a transitive. If you are working in the IDE, you can still just right-click on the main method and run it, same results. Both projects will load into the IDE transparently. Maybe you could try it and give specific feedback?

@dlwh
Copy link
Owner

dlwh commented Jul 25, 2015

I mostly run as part of an assembly jar. So as long as there's a
straightforward way to build an assembly that will be ok.

In addition, the train and pipeline main methods being in the core jar is
important. Everything else is negotiable.

Thanks!

-- David

On Sat, Jul 25, 2015 at 2:22 PM, Brian Topping notifications@github.com
wrote:

Ok, I'm pretty far along with this already, so let me take a look at it. I
think you'd be surprised that the workflow is not disrupted by these kinds
of changes, if you are importing via sbt, you'll get both projects as a
transitive. If you are working in the IDE, you can still just right-click
on the main method and run it, same results. Both projects will load into
the IDE transparently. Maybe you could try it and give specific feedback?


Reply to this email directly or view it on GitHub
#38 (comment).

@briantopping
Copy link
Author

Ok thanks for the feedback.

Part of decomposing modules into a DAG of modules means users who are trying to work with the code can get a faster map of what they are doing and where to focus their energy (it's a lot easier to dismiss whole swaths of code inaccessible to a given module). That said, it's never a good idea to create multiple modules from code that is always used together. Developers have to make arbitrary choices where code should go and it becomes a PITA.

I don't imagine generating models in production but building them in CI and deploying them wholesale in an upgrade process. Are you training in production? I'm starting to believe "tools" should be called "training", since that's what it seems most of the command line tools are oriented toward. But if all the training functionality needs to be a part of core... ?

@briantopping
Copy link
Author

Okay, been nonstop on this, was useful to learn the code a bit more. Generated three separate trees, the last was as you suggested just kind of punted on the refactor. For the second one, modules were broken out as preprocess->client->core. Core started with everything in the existing monolith with elements gradually lifted into higher modules with the intent of being able to abstract out code that isn't necessary to understand the lower levels.

There's some issues that are common for a code base like this, for instance JavaSentenceSegmenter being used by SlabTest, TreebankTokenizer being used by Tree. Both of these are more arguably more client UI oriented than core functionality, but the dependencies in tests and core code mean they can't be lifted. Runtime injection helps a lot here as I see you found with the ServiceLoader, but that becomes very clumsy very quickly without a global DI framework or use of implicits (both of which bring their own headaches). I've checked both of them in on my repo for reference if you are curious.

The other option is API changes, but that's usually not well received without discussion, if even then.

In any event the PR that's generated has the build bits that will make it easier to start extracting modules when there's a better idea where to go with this.

@dlwh
Copy link
Owner

dlwh commented Jul 28, 2015

awesome thanks! I'll look now.

I'd be curious to hear your recommendations about API refactoring.

It's important to me that "core" always have the training code and everything needed for it. It helps researchers (esp. my labmates) get up and running quickly. And, especially because it doesn't add too much dependency weight, there's relatively little harm.

Tika is a beast and it definitely needs to split out as much as possible.

@briantopping
Copy link
Author

Hey sorry for the delay to get back to you. This week has been a blur.

Totally understand what you're saying about the dependencies. Just curious though, is your crew using a build tool that deals with transitive dependencies well?

If so, there shouldn't ever be a problem that a user needs to include a dependency in their build except one they have direct client linkages to. The build tool will work out the classpath such that the indirect dependencies are included automatically.

In exchange for needing to know which library to link to initially, there's some positive tradeoffs:

  • The user isn't overwhelmed by a huge API, just the stuff they need to work with for the task at hand. This pays dividends because it's much easier to read the APIs for the more focused library and have a better chance that it was worth reading (instead of finding everything under the sun and giving up).
  • Releases are more focused. For instance, if the Parser is a separate module, the models can be bound to the parser. The build tool figures out the best closure for the parser to be both compatible and latest release, not require the user to sort that out.
  • More focused releases can lead to better testing. There's just less to test.
  • Iterations on modules show people with version numbers where the action in the code is.

As an occasional project owner, the reason I like modules is it keeps developers from making weird dependencies that are hard to unwind. It might be very useful that a debugging library knows how to print a tree (even though the tree printer imports a quarter of the universe), but it's far better if that's left to the deployer. It's the same problem as Tika dependencies in microcosm.

I totally appreciate such a refactor is much easier to say "yes" to after it's done, but I can almost assure you that you'll choke on your lunch if you saw one of these PRs. It's just really a leap of faith to approve it and a bigger deal for everyone to update their builds. But if you plan on continuing to improve the library (and I see that with the recent Neural commits), the sooner you start on this path, the better.

What can I do to help?

@dlwh
Copy link
Owner

dlwh commented Jul 30, 2015

hey, first, publish-signed went away and it doesn't ask for my p/w for my key anymore. what can i do?

@briantopping
Copy link
Author

Oof, just got this. let me get on it. How much time do I have?

On Jul 29, 2015, at 8:57 PM, David Hall notifications@github.com wrote:

hey, first, publish-signed went away and it doesn't ask for my p/w for my key anymore. what can i do?


Reply to this email directly or view it on GitHub #38 (comment).

@dlwh
Copy link
Owner

dlwh commented Jul 30, 2015

I mean, you replied in 10 min, that's pretty good. :)

I just figured it out

On Wed, Jul 29, 2015 at 8:09 PM, Brian Topping notifications@github.com
wrote:

Oof, just got this. let me get on it. How much time do I have?

On Jul 29, 2015, at 8:57 PM, David Hall notifications@github.com
wrote:

hey, first, publish-signed went away and it doesn't ask for my p/w for
my key anymore. what can i do?


Reply to this email directly or view it on GitHub <
https://github.com/dlwh/epic/issues/38#issuecomment-126167551>.


Reply to this email directly or view it on GitHub
#38 (comment).

@dlwh
Copy link
Owner

dlwh commented Jul 30, 2015

(I just added the sbt-pgp plugin again)

On Wed, Jul 29, 2015 at 8:10 PM, David Hall david.lw.hall@gmail.com wrote:

I mean, you replied in 10 min, that's pretty good. :)

I just figured it out

On Wed, Jul 29, 2015 at 8:09 PM, Brian Topping notifications@github.com
wrote:

Oof, just got this. let me get on it. How much time do I have?

On Jul 29, 2015, at 8:57 PM, David Hall notifications@github.com
wrote:

hey, first, publish-signed went away and it doesn't ask for my p/w for
my key anymore. what can i do?


Reply to this email directly or view it on GitHub <
https://github.com/dlwh/epic/issues/38#issuecomment-126167551>.


Reply to this email directly or view it on GitHub
#38 (comment).

@briantopping
Copy link
Author

briantopping commented Jul 30, 2015 via email

@dlwh
Copy link
Owner

dlwh commented Jul 30, 2015

i'm one of those pesky android people :)

On Wed, Jul 29, 2015 at 8:12 PM, Brian Topping notifications@github.com
wrote:

Whew! I was away from the computer for a while.

In the future, if you have an iPhone, you can reach me via messages on my
email address...


Reply to this email directly or view it on GitHub
#38 (comment).

@briantopping
Copy link
Author

briantopping commented Jul 30, 2015 via email

@briantopping
Copy link
Author

Should the epic-parser-en-span_2.11.2015.7.29-SNAPSHOT work with master branch? I'm getting the following exception:

  Cause: java.io.InvalidClassException: epic.util.NotProvided$; local class incompatible: stream classdesc serialVersionUID = -649101350749082174, local class serialVersionUID = -4684693011277973677
  at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:621)
  at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1623)
  at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
  at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
  at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
  at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
  at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
  at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)

@briantopping
Copy link
Author

Meh, my bad. sbt publish-m2 didn't do the right thing the first time I ran it. All clear. Thanks for publishing the model snapshots!!

@dlwh
Copy link
Owner

dlwh commented Jul 31, 2015

you're welcome!

@dlwh
Copy link
Owner

dlwh commented Jul 31, 2015

On Wed, Jul 29, 2015 at 12:09 PM, Brian Topping notifications@github.com
wrote:

Hey sorry for the delay to get back to you. This week has been a blur.

Totally understand what you're saying about the dependencies. Just curious
though, is your crew using a build tool that deals with transitive
dependencies well?

If so, there shouldn't ever be a problem that a user needs to include a
dependency in their build except one they have direct client linkages to.
The build tool will work out the classpath such that the indirect
dependencies are included automatically.

Except they need to develop Epic, and so it's useful if everything they
could need is in one project. Ideally one module. Research code bases (and
epic is one) benefit from being able to quickly iterate, even if it
degrades overall software engineering quality some.

In exchange for needing to know which library to link to initially,
there's some positive tradeoffs:

  • The user isn't overwhelmed by a huge API, just the stuff they need
    to work with for the task at hand. This pays dividends because it's much
    easier to read the APIs for the more focused library and have a better
    chance that it was worth reading (instead of finding everything under the
    sun and giving up).

I'm not entirely unsympathetic to this, but isn't this more an argument
for putting pure client stuff in a leaf project and everything needed for
training and decoding (and researching) in the core upstream library?

  • Releases are more focused. For instance, if the Parser is a separate
    module, the models can be bound to the parser. The build tool figures out
    the best closure for the parser to be both compatible and latest release,
    not require the user to sort that out.

It's already the case that the parser model files depend on a particular
version of epic? Sometimes (as with 2015.2.19 for a while) it works for a
bit longer, but...

  • More focused releases can lead to better testing. There's just less
    to test.

I don't have much to say on this one.

  • Iterations on modules show people with version numbers where the
    action in the code is.

As an occasional project owner, the reason I like modules is it keeps
developers from making weird dependencies that are hard to unwind. It might
be very useful that a debugging library knows how to print a tree (even
though the tree printer imports a quarter of the universe), but it's far
better if that's left to the deployer. It's the same problem as Tika
dependencies in microcosm.

Splitting out Tika was clearly a good idea and I had kind of wanted to do
it anyway.

I go back and forth between splitting and merging. If you split too much
it's hard to decide where things go and which modules are needed to just
"parse stuff", or whatever. Breeze's ancestors used to be heavily split up,
and it was annoying to deal with. (Where'd I put that file? Ah crap I need
to move this test harness code into main because this other module needs to
use it in its tests.) I now mostly segment things based on dependencies
needed, which gives breeze-core, breeze-viz, and breeze-natives. The tika
split works nicely with that (especially since epic doesn't have all that
many dependencies besides Tika and whatever Breeze brings in.)

I totally appreciate such a refactor is much easier to say "yes" to after
it's done, but I can almost assure you that you'll choke on your lunch if
you saw one of these PRs. It's just really a leap of faith to approve it
and a bigger deal for everyone to update their builds. But if you plan on
continuing to improve the library (and I see that with the recent Neural
commits), the sooner you start on this path, the better.

Anyway, I'm actually totally happy with the preprocess -> client -> core
version of the pipeline, except that the corpus utils (MascUtil) need to be
in core. (Clients don't need to deal with corpora like MASC directly...)

Clients can know that they just need to look at the classes in either
preprocess or client. The researchers only have to look at one library.
That seems ideal to me.

-- David

What can I do to help?


Reply to this email directly or view it on GitHub
#38 (comment).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants