-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data Loader for Domain Specific Datasets #41
Comments
Hi @michelecatani , Not all the loaders need to be the same, and they don't have to come exclusively from the ir-dataset. It's easier when the information is in the ir-datasets, but it's fine if it's not. I'd really appreciate it if you could add other loaders too. I've noticed in some papers on backtranslation that it works especially well in specific domains like law, legal, and medical. If you could create data loaders for datasets in these domains, that would be fantastic. |
Hi all, I have completed the loader of nfCorpus and it's outputting the indexed file to whatever folder you pass to it. I have tested it with the beir/nfcorpus/test dataset. This dataset was very small, and I believe it was akin to what you were talking about when you mentioned a toy @hosseinfani . I am aiming to build two more loaders for the next meeting and I'll look to merge them all before then as well. Before I merge the code, I would like to run it by either @hosseinfani or @yogeswarl as @DelaramRajaei mentioned that the previous loaders were built by the two of you. I had to make some changes to the code locally to get it to run and ideally I'd like to revert those changes before I merge so that it matches the other loaders. Are either/both of you available to meet virtually Tuesday or Thursday afternoon sometime after 1PM? |
Hi @michelecatani, Could you please send a Merge Request to Repair or point me to your version of code/ Repo, So I can take a look Also provided me with a detailed guide on how and where can I fetch the dataset and run it for the test. The easiest thing to do is Push your code to your version(fork) of Repair and send me the GitHub. I will run the test and ask you to send a Merge Request to Repair. If you are stuck anywhere please feel free to reach out to me or Dr. @hosseinfani Thanks |
Sounds good @yogeswarl , I've put through the pull request, you can see it here: #43 You can also pull my fork from here: https://github.com/michelecatani/RePair You don't need to get the dataset anywhere, it will be loaded from ir-datasets. However, you will need to instantiate an nfCorpus object. This is how I did it in Colab: |
Hello @michelecatani , There are few changes we need to make to successfully integrate the Since I was unable to push changes to your repo. I will paste the code here. param.py # just after aol
'nfCorpus' : {
"index":'../data/raw/nfCorpus/lucene-index/',
"index_item":['title'],
'pairing': [None, 'docs', 'query'],
} main.py if domain == 'nfCorpus':
from dal.nfCorpus import nfCorpus
ds = nfCorpus(param.settings[domain], datapath, param.settings['ncore']) nfcorpus.py from ds import Dataset
# change to
from dal.ds import Dataset Let me know if you encounter any error. I would be happy to help. |
Hi @yogeswarl I have implemented your suggested changes, thank you for those! The updates are in this PR: #44 You'll notice there's a few commits. Some of the output files somehow got deleted somewhere so I had to reintroduce them. However, the final PR only contains changes to main.py, param.py, and the addition of nfCorpus.py. Before the next meeting on Friday I will have two more loaders ready. I will provide them to you ready to test and upon your confirmation of success I can merge those too! |
Did you get a chance to run the nfCorpus in the main pipeline? |
Hi @yogeswarl no, I have not had a chance to do so yet. I will do so tomorrow and let you know how that goes |
Hi @yogeswarl , I tested it and it worked with the main pipeline. This is the output: ![]() |
@yogeswarl Actually, there seems to be an error below but I don't think this applies to my changes. It references a stats.py file and the msmarco.passage, which I have not touched... if I have misinterpreted, let me know |
@yogeswarl I have pushed a second loader for a trec dataset with Covid data. I have tested it with the main pipeline and it works and builds the index, however I get the same error I've posted above. The new loader is in this fork too: https://github.com/michelecatani/RePair |
Great Job @michelecatani . Let me have a look at the code. Don't worry about the code. It is irrelevant error. We just need to comment out a line. I will give this as a task later on for you to rewrite. Unfortunately I do not have write access to Repair anymore. Dr. @hosseinfani could you please merge this pull request I have tested this code. |
Hello @michelecatani, There is one thing I would like you to do before Dr. @hosseinfani can merge this code. If you see in these screenshots: Both of these loaders have the same functionality as the Do let me know if you are stuck somewhere. Here are the test cases: |
Hey @yogeswarl sounds good. I have started implementing the common loader, I will let you know if I face any roadblocks. @DelaramRajaei I was unable to find any legal datasets on ir_datasets (with the exception of gov and gov2, but they are behind a paywall and aren't strictly "legal"). I will pivot to building a loader for the CLEF-IP track which assumes the files are already located on the local filesystem, although I will not have that ready by tomorrow. |
@michelecatani |
@yogeswarl Hello, I have pushed some changes for the common loader class. Right now I have simply added some abstract methods to be implemented by the subclasses while the build_index is explicitly defined in the parent class. I should be able to convert the pair and create_jsonl functions to something abstract, I just haven't gotten around to it yet. I have tested the changes in the main pipeline and all three loaders still work. Let me know what you think of these changes and my plans to flesh them out further! @DelaramRajaei upon looking into the CLEF-IP dataset, there is no text value for the queries. Each entry there links to some sort of tfile, but I can't seem to discern what the queries are supposed to be. I was wondering if you could take a look and let me know if I'm missing something, but as of right now I don't think this dataset will work. Here is the link for the dataset: http://www.ifs.tuwien.ac.at/~clef-ip/download/2013/index.shtml I downloaded the test set and the qrels. |
Hey @michelecatani , I've examined the documents, and you're correct. The previous versions only included the 'tfile' tag, which is only beneficial for classification tasks. However, in the 2013 version, they introduced another tag called 'claim.' More detailed information can be found in this link. Here is another explaination of what is a passage:
By downloading the topic from this link, as mentioned in the first link, you will find references to XML files named as "claims" within the topics.txt file.
Every XML file contains the necessary passage. In this context, we can utilize either the "description" or "abstract" tag as the query. Additionally, it's important to note that the CLEF-IP dataset contains documents in three languages, and our focus is specifically on the English ones. Please let me know if you have any other question. |
Hey @michelecatan, Did you get a chance to read the document and modify the data loader? |
Hi @DelaramRajaei , I've read your explanation. Thank you, that makes sense. I'm coming around to working on that tonight, and I'll work on it tomorrow before our meeting too. I'll reach out to you if I have any questions |
Hi all, I have built 2 of the 3 indices and am working on the third. Dr @hosseinfani mentioned there was a link to submit the indices too, could somebody please provide me with that link? @yogeswarl @DelaramRajaei |
@michelecatani. You must be part of the query refinement channel. In the |
Hello @yogeswarl @DelaramRajaei I have posted the indices and raw data for each of the three datasets to the Teams file share. I have also pushed all my changes to my active pull request. @DelaramRajaei I noticed you made a few pushes to the main branch, did that include Zahra's code? If so, I can merge my changes... I believe Dr. @hosseinfani wanted me to merge last. @DelaramRajaei Regarding the ClefIP indices and work, I constructed the indices off of the topics file listed at the link you provided me. I used the training set. It seemed suspiciously small to me, so I downloaded the documents folder just above the topics but ran into some trouble with the file extraction. I figured they were 7zip given the fact that 7z is in the file name, but was unable to extract any files from them using extraction software. By the time we filter out all the non-English documents of the training, we are left with only about 20 documents for the ClefIP index. Let me know what you want my next steps to be. |
Hey @michelecatani. Thank you for the update. The modifications on the main branch represent the changes I implemented while merging the ReQue and RePair projects. Please review the code, and if there are any conflicts, kindly resolve them. Dr. Fani accepted the pull request after examining my work. I have provided a detailed explanation of the changes in this issue. Regarding the issue you explained above, it appears that the majority of the documents are in languages other than English. Currently, the pipeline supports English queries. While backtranslation allows us to change the source language to others using a universal machine translation, I am uncertain about the compatibility with other refiners. For now, please gather English documents. Additionally, I have a side task for you. To obtain dense indices, we require the raw corpus of our datasets. Could you investigate how we can acquire the corpus for the following datasets:
For instance, clueweb09b is available for purchase, and for robust04, you need to contact them via email. Could you please explore how we can obtain the other datasets? |
Hi @DelaramRajaei I will merge the code today. I'll ping you on this issue when the pull request has been updated. As for the other datasets you've mentioned, it appears that antique and dbpedia can simply be downloaded from ir_datasets like the rest of them. One can simply use this bit of python code: Then you can simply iterate through the queries/docs/qrels to download them, so pretty much what our loaders already do. For Gov2 however, there is a process you need to follow. As per the ir_datasets website: They are provided as hard drives. Somebody would need to fill out all of the forms indicated here: http://ir.dcs.gla.ac.uk/test_collections/access_to_data.html After that's complete, they should mail out a copy of the hard drives. |
Hi @DelaramRajaei this is my pull request: #48 I have tested my parts of the code in the main pipeline and everything seems to work; the docs get saved and the indices get built. I had to make some small changes to the code I pushed for it to work because I tested my code from main.py. I had to change a lot of the references (ie, remove the "src." before "src.{reference}" in quite a few files). I did not include these changes in the pull request because I am not sure where you run the code from when you test. |
Hey @michelecatani , |
Hello @hosseinfani @DelaramRajaei
It appears that the CLEF-IP track is not accessible through ir_datasets. It's possible to still build a loader but it would need to be markedly different from the other loaders in the repo. Would you like me to explore this option?
Alternatively, I could start building loaders for other datasets which are present in ir_datasets if we want to stay within that. The option would be to start building one for nfCorpus (medical) while I look for a legal dataset within ir_datasets.
The text was updated successfully, but these errors were encountered: