-
Notifications
You must be signed in to change notification settings - Fork 385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support KILT for Pyserini's h/d/search #405
Conversation
hey @ronakice can you coordinate with @yuxuan-ji to see where this is going? should be be part of next release? https://github.com/castorini/pyserini/projects/1 |
We'll probably need integration tests, like https://github.com/castorini/pyserini/tree/master/integrations To make sure things don't break moving forward... |
@unique | ||
class TopicsFormat(Enum): | ||
DEFAULT = 'default' | ||
KILT = 'kilt' | ||
|
||
|
||
class QueryIterator(ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
diff too big, but here's the queryiterator class
@@ -63,7 +63,7 @@ def _get_predictions_thread(arguments): | |||
doc_scores = [] | |||
|
|||
if use_bigrams: | |||
tokens = filter(lambda word: word not in STOPWORDS, word_tokenize(query)) | |||
tokens = filter(lambda word: word.lower() not in STOPWORDS, word_tokenize(query)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix to previous script
# to convert KILT's dpr_multi_set_f_bert.0 model into a PyTorch checkpoint | ||
|
||
if __name__ == '__main__': | ||
parser = argparse.ArgumentParser(description='Convert KILT-dpr corpus into the index & docid file read by pyserini') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
script to convert the KILT's pickled encoded wiki into a faiss index
@@ -26,7 +26,7 @@ | |||
doc["id"] = raw["_id"] | |||
doc["contents"] = "".join(raw["text"]) | |||
if args.bigrams: | |||
tokens = filter(lambda word: word not in STOPWORDS, word_tokenize(doc["contents"])) | |||
tokens = filter(lambda word: word.lower() not in STOPWORDS, word_tokenize(doc["contents"])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix to previous script
@@ -42,7 +42,7 @@ | |||
doc["id"] = f"{raw['_id']}-{i}" | |||
p = texts[i] | |||
if args.bigrams: | |||
tokens = filter(lambda word: word not in STOPWORDS, word_tokenize(p)) | |||
tokens = filter(lambda word: word.lower() not in STOPWORDS, word_tokenize(p)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix to previous script
|
||
|
||
if __name__ == '__main__': | ||
parser = argparse.ArgumentParser(description='Compute embeddings for KILT topics') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
script to encode KILT topics w/ dpr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, most of this looks good, didn't see anything glaringly wrong, @MXueguang do you have any suggestions?
Yikes, commented from the wrong account hahaha!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay switched back to my other system! An LGTM from the right account! 😄
lgtm. but we'd better run the entire integration tests on tuna/orca before we merge it |
Agreed. @yuxuan-ji can you fix conflicts? I'll run tests, and then I can merge. |
just fyi @lintool, the integration tests for this PR is here yuxuan-ji#1 |
Sure! |
I'm currently verifying #513 right now. Will queue this up next. |
@MXueguang @lintool merged the integration tests into this PR, so tests can be ran - as a result it does make this PR quite bloated unfortunately one thing: I have an anserini index that needs to be pushed to https://git.uwaterloo.ca/jimmylin/anserini-indexes for the kilt test to work, it's currently sitting on tuna, what would be the best way to upload it? |
Hi all, I've started running integration tests on my iMac Pro. Please no more pushes to this branch. Will report back when done. |
Regressions just completed:
The failure is due to the fact that the Anserini index hasn't been installed yet. I will merge, then circle back to fix. |
Changes to the pyserini h/d/search scripts:
--topics-format
which defaults todefault
. Other option iskilt
--tokenizer
to dsearch. This is useful when the model used does not have a tokenizer specified, such as KILT's (it uses the bert-base-uncased tokenizer).--output-format
which defaults totrec
. Other options aremsmarco
andkilt
.--msmarco
flag, as it is now specified through the above.Adds the KILT evaluation script to
pyserini.eval
Introduces the following abstractions:
QueryIterator
class for each topic format. It does the job of doing any pre/post processing required when loading/iterating through the dataset. Notably, KILT does some post-processing on the queries, and needs to be loaded from a file currently.OutputWriter
class for TREC, MSMARCO, and KILT's formats.Added integration tests are in:
yuxuan-ji#1
Successfully ran integration tests for (can run more if needed, they take quite a while though):