Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support KILT for Pyserini's h/d/search #405

Merged
merged 52 commits into from
Apr 29, 2021

Conversation

yuxuan-ji
Copy link
Contributor

@yuxuan-ji yuxuan-ji commented Mar 8, 2021

Changes to the pyserini h/d/search scripts:

  • Adds non-required flag --topics-format which defaults to default. Other option is kilt
  • Adds non-required flag --tokenizer to dsearch. This is useful when the model used does not have a tokenizer specified, such as KILT's (it uses the bert-base-uncased tokenizer).
  • Adds non-required flag --output-format which defaults to trec. Other options are msmarco and kilt.
  • Deprecated the --msmarco flag, as it is now specified through the above.

Adds the KILT evaluation script to pyserini.eval

Introduces the following abstractions:

  • There is now a QueryIterator class for each topic format. It does the job of doing any pre/post processing required when loading/iterating through the dataset. Notably, KILT does some post-processing on the queries, and needs to be loaded from a file currently.
  • There is now a OutputWriter class for TREC, MSMARCO, and KILT's formats.

Added integration tests are in:
yuxuan-ji#1

Successfully ran integration tests for (can run more if needed, they take quite a while though):

  • DPR curated, nq (hsearch + dsearch)
  • TCT-Colbert, msmarco doc (hsearch)

@yuxuan-ji yuxuan-ji marked this pull request as draft March 8, 2021 01:37
@ronakice ronakice self-requested a review March 8, 2021 02:44
pyserini/query_iterator.py Outdated Show resolved Hide resolved
pyserini/output_writer.py Outdated Show resolved Hide resolved
@lintool
Copy link
Member

lintool commented Apr 1, 2021

hey @ronakice can you coordinate with @yuxuan-ji to see where this is going? should be be part of next release? https://github.com/castorini/pyserini/projects/1

@lintool
Copy link
Member

lintool commented Apr 1, 2021

We'll probably need integration tests, like https://github.com/castorini/pyserini/tree/master/integrations

To make sure things don't break moving forward...

@yuxuan-ji yuxuan-ji marked this pull request as ready for review April 22, 2021 01:29
Comment on lines +27 to +33
@unique
class TopicsFormat(Enum):
DEFAULT = 'default'
KILT = 'kilt'


class QueryIterator(ABC):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

diff too big, but here's the queryiterator class

@@ -63,7 +63,7 @@ def _get_predictions_thread(arguments):
doc_scores = []

if use_bigrams:
tokens = filter(lambda word: word not in STOPWORDS, word_tokenize(query))
tokens = filter(lambda word: word.lower() not in STOPWORDS, word_tokenize(query))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix to previous script

# to convert KILT's dpr_multi_set_f_bert.0 model into a PyTorch checkpoint

if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Convert KILT-dpr corpus into the index & docid file read by pyserini')
Copy link
Contributor Author

@yuxuan-ji yuxuan-ji Apr 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

script to convert the KILT's pickled encoded wiki into a faiss index

@@ -26,7 +26,7 @@
doc["id"] = raw["_id"]
doc["contents"] = "".join(raw["text"])
if args.bigrams:
tokens = filter(lambda word: word not in STOPWORDS, word_tokenize(doc["contents"]))
tokens = filter(lambda word: word.lower() not in STOPWORDS, word_tokenize(doc["contents"]))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix to previous script

@@ -42,7 +42,7 @@
doc["id"] = f"{raw['_id']}-{i}"
p = texts[i]
if args.bigrams:
tokens = filter(lambda word: word not in STOPWORDS, word_tokenize(p))
tokens = filter(lambda word: word.lower() not in STOPWORDS, word_tokenize(p))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix to previous script



if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Compute embeddings for KILT topics')
Copy link
Contributor Author

@yuxuan-ji yuxuan-ji Apr 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

script to encode KILT topics w/ dpr

Copy link

@TambourineMan42 TambourineMan42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, most of this looks good, didn't see anything glaringly wrong, @MXueguang do you have any suggestions?
Yikes, commented from the wrong account hahaha!

Copy link
Member

@ronakice ronakice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay switched back to my other system! An LGTM from the right account! 😄

@MXueguang
Copy link
Member

lgtm. but we'd better run the entire integration tests on tuna/orca before we merge it

@lintool
Copy link
Member

lintool commented Apr 27, 2021

Agreed. @yuxuan-ji can you fix conflicts? I'll run tests, and then I can merge.

@MXueguang
Copy link
Member

just fyi @lintool, the integration tests for this PR is here yuxuan-ji#1
@yuxuan-ji since we review the above changes, maybe merge the update for tests into this branch too?
(then we can run test on the HEAD of this branch)

@lintool
Copy link
Member

lintool commented Apr 27, 2021

Sure!

@lintool
Copy link
Member

lintool commented Apr 27, 2021

I'm currently verifying #513 right now. Will queue this up next.

@yuxuan-ji
Copy link
Contributor Author

@MXueguang @lintool merged the integration tests into this PR, so tests can be ran - as a result it does make this PR quite bloated unfortunately

one thing: I have an anserini index that needs to be pushed to https://git.uwaterloo.ca/jimmylin/anserini-indexes for the kilt test to work, it's currently sitting on tuna, what would be the best way to upload it?

@lintool
Copy link
Member

lintool commented Apr 28, 2021

Hi all, I've started running integration tests on my iMac Pro. Please no more pushes to this branch. Will report back when done.

@lintool
Copy link
Member

lintool commented Apr 29, 2021

Regressions just completed:

======================================================================
ERROR: test_kilt_search (test_kilt.TestSearchIntegration)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/jimmylin/workspace/pyserini/integrations/test_kilt.py", line 63, in test_kilt_search
    self.assertAlmostEqual(score, 0.3821, delta=0.0001)
  File "/anaconda3/envs/python36/lib/python3.6/unittest/case.py", line 861, in assertAlmostEqual
    if abs(first - second) <= delta:
TypeError: unsupported operand type(s) for -: 'NoneType' and 'float'

----------------------------------------------------------------------
Ran 104 tests in 47456.780s

FAILED (errors=1)

The failure is due to the fact that the Anserini index hasn't been installed yet. I will merge, then circle back to fix.

@lintool lintool merged commit ecfed61 into castorini:master Apr 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants