Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JsoupGenerator -> DefaultLuceneDocumentGenerator in doc #1355

Merged
merged 1 commit into from
Aug 18, 2020

Conversation

adamyy
Copy link
Contributor

@adamyy adamyy commented Aug 18, 2020

Seems like JsoupGenerator was removed in commit 9a28a0 (04/07/2020), but there are some lingering appearances of it:

docs/document-matchzoo.md:13: -generator JsoupGenerator -threads 16 -input /path/to/robust04 \
docs/regressions-log.md:25:Previously, Core17 used the `NewYorkTimesCollection` and was indexed with `JsoupGenerator` as the generator, which assumes that the input is HTML (or XML) and removes tags.
docs/runbook-ecir2019-ccrf.md:68: -generator JsoupGenerator -threads 16 -input /path/to/robust04 \
docs/runbook-ecir2019-ccrf.md:73: -generator JsoupGenerator -threads 16 -input /path/to/robust05 \
docs/runbook-ecir2019-ccrf.md:78: -generator JsoupGenerator -threads 16 -input /path/to/core17 \
docs/runbook-trec2018-anserini.md:117: -generator JsoupGenerator -threads 44 -input /path/to/cw12 -index lucene-index.cw12.pos+docvectors+rawdocs \
docs/runbook-trec2018-anserini.md:135: -generator JsoupGenerator -threads 44 -storePositions -storeDocvectors -storeRawDocs -optimize \
docs/runbook-trec2018-anserini.md:145: -generator JsoupGenerator -threads 44 -input /path/to/cw12 -index \
docs/runbook-trec2018-anserini.md:164: -generator JsoupGenerator -threads 44 -storePositions -storeDocvectors -storeRawDocs \
docs/runbook-trec2018-anserini.md:174: -generator JsoupGenerator -threads 44 -input /path/to/cw12 -index \
docs/runbook-trec2018-anserini.md:185: -generator JsoupGenerator -threads 8 -uniqueDocid -storePositions -storeDocvectors \
docs/runbook-trec2018-anserini.md:195:    -generator JsoupGenerator -threads 8 -uniqueDocid -storePositions -storeDocvectors \
docs/runbook-trec2018-anserini.md:209:-generator JsoupGenerator -threads 16 -input enwiki-20180620-pages-articles.xml.bz2 -index \
docs/runbook-trec2018-h2oloo.md:20: -generator JsoupGenerator -threads 16 -input /path/to/robust04 \
docs/runbook-trec2018-h2oloo.md:29: -generator JsoupGenerator -threads 16 -input /path/to/robust05 \
docs/runbook-trec2018-h2oloo.md:38: -generator JsoupGenerator -threads 16 -input /path/to/core17 \
src/main/python/paragraph_indexing/README.md:48: -input /path/to/disk45/ -generator JsoupGenerator \
src/main/python/passage_retrieval/example/robust04.md:10:--generator JsoupGenerator \
src/main/python/passage_retrieval/example/robust04.md:20:--generator JsoupGenerator \
src/main/python/passage_retrieval/example/robust04.md:31:--generator JsoupGenerator \
src/main/python/passage_retrieval/example/robust04.md:39:nohup sh target/appassembler/bin/IndexCollection -collection JsonCollection -generator JsoupGenerator
src/main/resources/fine_tuning/collections.yaml:16:      generator: JsoupGenerator
src/main/resources/fine_tuning/collections.yaml:49:      generator: JsoupGenerator

Most of these are non-code usages with the exception of collections.yaml which seems to be used by fine_tuning/run_batch.py

@lintool should these be updated accordingly? I fixed the one in solrini.md because I ran into it.

@codecov
Copy link

codecov bot commented Aug 18, 2020

Codecov Report

Merging #1355 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #1355   +/-   ##
=========================================
  Coverage     51.87%   51.87%           
  Complexity      810      810           
=========================================
  Files           154      154           
  Lines          8627     8627           
  Branches       1224     1224           
=========================================
  Hits           4475     4475           
  Misses         3781     3781           
  Partials        371      371           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 857f6da...d425dd9. Read the comment docs.

@lintool
Copy link
Member

lintool commented Aug 18, 2020

hi @adamyy thanks for noting! I generally don't update non-regression, non-running code... because otherwise it'd be impossible to verify correctness. I'll merge in this PR, but can you file a new issue and assign to me? Just drop in #1355 (comment)

@lintool lintool self-requested a review August 18, 2020 13:40
@lintool lintool merged commit 04ef3bb into castorini:master Aug 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants