Support KILT for Pyserini's h/d/search #405

yuxuan-ji · 2021-03-08T01:37:37Z

Changes to the pyserini h/d/search scripts:

Adds non-required flag --topics-format which defaults to default. Other option is kilt
Adds non-required flag --tokenizer to dsearch. This is useful when the model used does not have a tokenizer specified, such as KILT's (it uses the bert-base-uncased tokenizer).
Adds non-required flag --output-format which defaults to trec. Other options are msmarco and kilt.
Deprecated the --msmarco flag, as it is now specified through the above.

Adds the KILT evaluation script to pyserini.eval

Introduces the following abstractions:

There is now a QueryIterator class for each topic format. It does the job of doing any pre/post processing required when loading/iterating through the dataset. Notably, KILT does some post-processing on the queries, and needs to be loaded from a file currently.
There is now a OutputWriter class for TREC, MSMARCO, and KILT's formats.

Added integration tests are in:
yuxuan-ji#1

Successfully ran integration tests for (can run more if needed, they take quite a while though):

DPR curated, nq (hsearch + dsearch)
TCT-Colbert, msmarco doc (hsearch)

pyserini/query_iterator.py

pyserini/output_writer.py

lintool · 2021-04-01T18:41:29Z

hey @ronakice can you coordinate with @yuxuan-ji to see where this is going? should be be part of next release? https://github.com/castorini/pyserini/projects/1

lintool · 2021-04-01T18:44:05Z

We'll probably need integration tests, like https://github.com/castorini/pyserini/tree/master/integrations

To make sure things don't break moving forward...

yuxuan-ji · 2021-04-27T02:36:57Z

pyserini/query_iterator.py

+@unique
+class TopicsFormat(Enum):
+    DEFAULT = 'default'
+    KILT = 'kilt'
+
+
+class QueryIterator(ABC):


diff too big, but here's the queryiterator class

yuxuan-ji · 2021-04-27T02:38:05Z

scripts/kilt/anserini_retriever.py

@@ -63,7 +63,7 @@ def _get_predictions_thread(arguments):
        doc_scores = []

        if use_bigrams:
-            tokens = filter(lambda word: word not in STOPWORDS, word_tokenize(query))
+            tokens = filter(lambda word: word.lower() not in STOPWORDS, word_tokenize(query))


fix to previous script

yuxuan-ji · 2021-04-27T02:38:33Z

scripts/kilt/convert_kilt_dpr_to_pyserini_format.py

+# to convert KILT's dpr_multi_set_f_bert.0 model into a PyTorch checkpoint
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description='Convert KILT-dpr corpus into the index & docid file read by pyserini')


script to convert the KILT's pickled encoded wiki into a faiss index

yuxuan-ji · 2021-04-27T02:40:12Z

scripts/kilt/convert_kilt_to_document_jsonl.py

@@ -26,7 +26,7 @@
            doc["id"] = raw["_id"]
            doc["contents"] = "".join(raw["text"])
            if args.bigrams:
-                tokens = filter(lambda word: word not in STOPWORDS, word_tokenize(doc["contents"]))
+                tokens = filter(lambda word: word.lower() not in STOPWORDS, word_tokenize(doc["contents"]))


fix to previous script

yuxuan-ji · 2021-04-27T02:40:16Z

scripts/kilt/convert_kilt_to_passage_jsonl.py

@@ -42,7 +42,7 @@
                doc["id"] = f"{raw['_id']}-{i}"
                p = texts[i]
                if args.bigrams:
-                    tokens = filter(lambda word: word not in STOPWORDS, word_tokenize(p))
+                    tokens = filter(lambda word: word.lower() not in STOPWORDS, word_tokenize(p))


fix to previous script

yuxuan-ji · 2021-04-27T02:40:29Z

scripts/kilt/encode_kilt_topics.py

+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description='Compute embeddings for KILT topics')


script to encode KILT topics w/ dpr

TambourineMan42

Okay, most of this looks good, didn't see anything glaringly wrong, @MXueguang do you have any suggestions?
Yikes, commented from the wrong account hahaha!

ronakice

Okay switched back to my other system! An LGTM from the right account! 😄

MXueguang · 2021-04-27T20:42:34Z

lgtm. but we'd better run the entire integration tests on tuna/orca before we merge it

lintool · 2021-04-27T21:17:07Z

Agreed. @yuxuan-ji can you fix conflicts? I'll run tests, and then I can merge.

MXueguang · 2021-04-27T22:35:08Z

just fyi @lintool, the integration tests for this PR is here yuxuan-ji#1
@yuxuan-ji since we review the above changes, maybe merge the update for tests into this branch too?
(then we can run test on the HEAD of this branch)

lintool · 2021-04-27T22:36:03Z

Sure!

lintool · 2021-04-27T22:36:55Z

I'm currently verifying #513 right now. Will queue this up next.

…port-kilt

Kilt integration tests

yuxuan-ji · 2021-04-27T22:58:10Z

@MXueguang @lintool merged the integration tests into this PR, so tests can be ran - as a result it does make this PR quite bloated unfortunately

one thing: I have an anserini index that needs to be pushed to https://git.uwaterloo.ca/jimmylin/anserini-indexes for the kilt test to work, it's currently sitting on tuna, what would be the best way to upload it?

…into support-kilt

lintool · 2021-04-28T13:13:45Z

Hi all, I've started running integration tests on my iMac Pro. Please no more pushes to this branch. Will report back when done.

lintool · 2021-04-29T02:26:23Z

Regressions just completed:

======================================================================
ERROR: test_kilt_search (test_kilt.TestSearchIntegration)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/jimmylin/workspace/pyserini/integrations/test_kilt.py", line 63, in test_kilt_search
    self.assertAlmostEqual(score, 0.3821, delta=0.0001)
  File "/anaconda3/envs/python36/lib/python3.6/unittest/case.py", line 861, in assertAlmostEqual
    if abs(first - second) <= delta:
TypeError: unsupported operand type(s) for -: 'NoneType' and 'float'

----------------------------------------------------------------------
Ran 104 tests in 47456.780s

FAILED (errors=1)

The failure is due to the fact that the Anserini index hasn't been installed yet. I will merge, then circle back to fix.

yuxuan-ji added 2 commits March 7, 2021 20:28

Support KILT for pyserini h/d/search

0c981e8

Lazy import

fe47a9b

yuxuan-ji marked this pull request as draft March 8, 2021 01:37

ronakice self-requested a review March 8, 2021 02:44

yuxuan-ji force-pushed the support-kilt branch from 77f0442 to 63fd94a Compare March 8, 2021 05:05

Support KILT output

829bc9b

yuxuan-ji force-pushed the support-kilt branch from 63fd94a to 829bc9b Compare March 8, 2021 05:14

yuxuan-ji commented Mar 8, 2021

View reviewed changes

pyserini/query_iterator.py Outdated Show resolved Hide resolved

yuxuan-ji commented Mar 8, 2021

View reviewed changes

pyserini/output_writer.py Outdated Show resolved Hide resolved

yuxuan-ji added 2 commits March 14, 2021 16:21

Delete write_result

bdd9973

Create dir

5414bca

yuxuan-ji force-pushed the support-kilt branch from 135e95d to bdcc956 Compare March 21, 2021 03:52

Convert kilt dpr corpus

dcb528c

yuxuan-ji force-pushed the support-kilt branch from bdcc956 to dcb528c Compare March 21, 2021 03:54

yuxuan-ji added 5 commits March 21, 2021 13:16

Rename param for clarity and only passage delim when max_passage is on

e9cafdb

Only need one kilt format for now

32fc284

Support explicit tokenizer

5fb2ddb

Update script with index writer

ef0f3df

Add some comments

35e7f86

yuxuan-ji force-pushed the support-kilt branch from 3824dc9 to 35e7f86 Compare March 21, 2021 23:34

yuxuan-ji added 3 commits March 21, 2021 19:50

Raise exception if file not fully iterate through

3ceb6c0

Add script to precompute embeddings

f1ae599

Load precompute embeddings from file instead of hardcoded dir filename

507c948

Merge & fix conflicts

9a6ce9f

yuxuan-ji marked this pull request as ready for review April 22, 2021 01:29

yuxuan-ji added 2 commits April 21, 2021 21:48

Add and fix inegration tests for KILT support

11d8228

Add test and prebuilt index

29f0dc6

yuxuan-ji added 4 commits April 26, 2021 19:40

Merge branch 'support-kilt' into kilt-integration-tests

5c8b3e7

Add triviaQA

de2e221

Merge branch 'support-kilt' into kilt-integration-tests

fe44775

Typo

be574e8

yuxuan-ji commented Apr 27, 2021

View reviewed changes

TambourineMan42 approved these changes Apr 27, 2021

View reviewed changes

ronakice approved these changes Apr 27, 2021

View reviewed changes

Merge branch 'master' into support-kilt

1bfbe3c

yuxuan-ji added 4 commits April 27, 2021 18:41

Use tsvinttopicreader

c62ed7c

Merge branch 'support-kilt' of github.com:yuxuan-ji/pyserini into sup…

da472a5

…port-kilt

Merge pull request #1 from yuxuan-ji/kilt-integration-tests

8011295

Kilt integration tests

Fix unittest

83a11d2

yuxuan-ji added 4 commits April 27, 2021 22:14

Script fix

ec2baf3

Script fix

5240b14

Add script to convert 100w tsv into jsonl

bebbc8a

Merge branch 'support-kilt' of https://github.com/yuxuan-ji/pyserini …

51457f1

…into support-kilt

MXueguang mentioned this pull request Apr 28, 2021

Inconsistency in loading topics #516

Closed

lintool merged commit ecfed61 into castorini:master Apr 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support KILT for Pyserini's h/d/search #405

Support KILT for Pyserini's h/d/search #405

yuxuan-ji commented Mar 8, 2021 •

edited

Loading

lintool commented Apr 1, 2021

lintool commented Apr 1, 2021

yuxuan-ji Apr 27, 2021

yuxuan-ji Apr 27, 2021

yuxuan-ji Apr 27, 2021 •

edited

Loading

yuxuan-ji Apr 27, 2021

yuxuan-ji Apr 27, 2021

yuxuan-ji Apr 27, 2021 •

edited

Loading

TambourineMan42 left a comment •

edited

Loading

ronakice left a comment •

edited

Loading

MXueguang commented Apr 27, 2021

lintool commented Apr 27, 2021

MXueguang commented Apr 27, 2021

lintool commented Apr 27, 2021

lintool commented Apr 27, 2021

yuxuan-ji commented Apr 27, 2021

lintool commented Apr 28, 2021

lintool commented Apr 29, 2021



		if __name__ == '__main__':
		parser = argparse.ArgumentParser(description='Compute embeddings for KILT topics')

Support KILT for Pyserini's h/d/search #405

Support KILT for Pyserini's h/d/search #405

Conversation

yuxuan-ji commented Mar 8, 2021 • edited Loading

lintool commented Apr 1, 2021

lintool commented Apr 1, 2021

yuxuan-ji Apr 27, 2021

Choose a reason for hiding this comment

yuxuan-ji Apr 27, 2021

Choose a reason for hiding this comment

yuxuan-ji Apr 27, 2021 • edited Loading

Choose a reason for hiding this comment

yuxuan-ji Apr 27, 2021

Choose a reason for hiding this comment

yuxuan-ji Apr 27, 2021

Choose a reason for hiding this comment

yuxuan-ji Apr 27, 2021 • edited Loading

Choose a reason for hiding this comment

TambourineMan42 left a comment • edited Loading

Choose a reason for hiding this comment

ronakice left a comment • edited Loading

Choose a reason for hiding this comment

MXueguang commented Apr 27, 2021

lintool commented Apr 27, 2021

MXueguang commented Apr 27, 2021

lintool commented Apr 27, 2021

lintool commented Apr 27, 2021

yuxuan-ji commented Apr 27, 2021

lintool commented Apr 28, 2021

lintool commented Apr 29, 2021

yuxuan-ji commented Mar 8, 2021 •

edited

Loading

yuxuan-ji Apr 27, 2021 •

edited

Loading

yuxuan-ji Apr 27, 2021 •

edited

Loading

TambourineMan42 left a comment •

edited

Loading

ronakice left a comment •

edited

Loading