Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Loader for Domain Specific Datasets #41

Open
michelecatani opened this issue Oct 29, 2023 · 26 comments
Open

Data Loader for Domain Specific Datasets #41

michelecatani opened this issue Oct 29, 2023 · 26 comments
Assignees
Labels
Dataset Data loaders, datasests enhancement New feature or request

Comments

@michelecatani
Copy link
Contributor

michelecatani commented Oct 29, 2023

Hello @hosseinfani @DelaramRajaei

It appears that the CLEF-IP track is not accessible through ir_datasets. It's possible to still build a loader but it would need to be markedly different from the other loaders in the repo. Would you like me to explore this option?

Alternatively, I could start building loaders for other datasets which are present in ir_datasets if we want to stay within that. The option would be to start building one for nfCorpus (medical) while I look for a legal dataset within ir_datasets.

@DelaramRajaei DelaramRajaei added the enhancement New feature or request label Oct 30, 2023
@DelaramRajaei
Copy link
Member

Hi @michelecatani ,

Not all the loaders need to be the same, and they don't have to come exclusively from the ir-dataset. It's easier when the information is in the ir-datasets, but it's fine if it's not.

I'd really appreciate it if you could add other loaders too. I've noticed in some papers on backtranslation that it works especially well in specific domains like law, legal, and medical. If you could create data loaders for datasets in these domains, that would be fantastic.

@michelecatani
Copy link
Contributor Author

Hi all, I have completed the loader of nfCorpus and it's outputting the indexed file to whatever folder you pass to it. I have tested it with the beir/nfcorpus/test dataset. This dataset was very small, and I believe it was akin to what you were talking about when you mentioned a toy @hosseinfani . I am aiming to build two more loaders for the next meeting and I'll look to merge them all before then as well.

Before I merge the code, I would like to run it by either @hosseinfani or @yogeswarl as @DelaramRajaei mentioned that the previous loaders were built by the two of you. I had to make some changes to the code locally to get it to run and ideally I'd like to revert those changes before I merge so that it matches the other loaders. Are either/both of you available to meet virtually Tuesday or Thursday afternoon sometime after 1PM?

@yogeswarl
Copy link
Member

yogeswarl commented Nov 5, 2023

Hi @michelecatani, Could you please send a Merge Request to Repair or point me to your version of code/ Repo, So I can take a look

Also provided me with a detailed guide on how and where can I fetch the dataset and run it for the test.

The easiest thing to do is Push your code to your version(fork) of Repair and send me the GitHub. I will run the test and ask you to send a Merge Request to Repair.

If you are stuck anywhere please feel free to reach out to me or Dr. @hosseinfani

Thanks

@michelecatani
Copy link
Contributor Author

michelecatani commented Nov 5, 2023

Sounds good @yogeswarl , I've put through the pull request, you can see it here: #43

You can also pull my fork from here: https://github.com/michelecatani/RePair

You don't need to get the dataset anywhere, it will be loaded from ir-datasets. However, you will need to instantiate an nfCorpus object. This is how I did it in Colab:

image

@yogeswarl
Copy link
Member

Hello @michelecatani ,
I have tested out the code.

There are few changes we need to make to successfully integrate the nfCorpus to Repair as the main pipeline

Since I was unable to push changes to your repo. I will paste the code here.
Please add them to the respective files

param.py

# just after aol
  'nfCorpus' : {
        "index":'../data/raw/nfCorpus/lucene-index/',
        "index_item":['title'],
        'pairing': [None, 'docs', 'query'],
    }

main.py
Line 26:

if domain == 'nfCorpus': 
            from dal.nfCorpus import nfCorpus
            ds = nfCorpus(param.settings[domain], datapath, param.settings['ncore'])

nfcorpus.py
Line 8

from ds import Dataset 
# change to 
from dal.ds import Dataset

Let me know if you encounter any error. I would be happy to help.

@michelecatani
Copy link
Contributor Author

Hi @yogeswarl I have implemented your suggested changes, thank you for those!

The updates are in this PR: #44

You'll notice there's a few commits. Some of the output files somehow got deleted somewhere so I had to reintroduce them. However, the final PR only contains changes to main.py, param.py, and the addition of nfCorpus.py.

Before the next meeting on Friday I will have two more loaders ready. I will provide them to you ready to test and upon your confirmation of success I can merge those too!

@yogeswarl
Copy link
Member

Did you get a chance to run the nfCorpus in the main pipeline?

@michelecatani
Copy link
Contributor Author

Hi @yogeswarl no, I have not had a chance to do so yet. I will do so tomorrow and let you know how that goes

@michelecatani
Copy link
Contributor Author

Hi @yogeswarl , I tested it and it worked with the main pipeline. This is the output:

image

@michelecatani
Copy link
Contributor Author

michelecatani commented Nov 9, 2023

@yogeswarl Actually, there seems to be an error below but I don't think this applies to my changes. It references a stats.py file and the msmarco.passage, which I have not touched... if I have misinterpreted, let me know
image

@michelecatani
Copy link
Contributor Author

@yogeswarl I have pushed a second loader for a trec dataset with Covid data. I have tested it with the main pipeline and it works and builds the index, however I get the same error I've posted above.

The new loader is in this fork too: https://github.com/michelecatani/RePair

@yogeswarl
Copy link
Member

Great Job @michelecatani . Let me have a look at the code. Don't worry about the code. It is irrelevant error. We just need to comment out a line. I will give this as a task later on for you to rewrite.

Unfortunately I do not have write access to Repair anymore. Dr. @hosseinfani could you please merge this pull request I have tested this code.

@yogeswarl
Copy link
Member

Hello @michelecatani, There is one thing I would like you to do before Dr. @hosseinfani can merge this code.

If you see in these screenshots:
trec-covid
Screenshot 2023-11-08 at 9 48 33 PM

nfcourpus
Screenshot 2023-11-08 at 9 49 16 PM

Both of these loaders have the same functionality as the aol-ia loader.
This is because they all come from ir-Datasets. Can you write a common loader where you pass only the corpus name. This way we can avoid creating the same file for different loaders.

Do let me know if you are stuck somewhere.

Here are the test cases: aol-ia, nfCorpus, tree-covid.
On passing these dataset, Create json files and build an index in their respective folder under raw/loader-name

@michelecatani
Copy link
Contributor Author

Hey @yogeswarl sounds good. I have started implementing the common loader, I will let you know if I face any roadblocks.

@DelaramRajaei I was unable to find any legal datasets on ir_datasets (with the exception of gov and gov2, but they are behind a paywall and aren't strictly "legal"). I will pivot to building a loader for the CLEF-IP track which assumes the files are already located on the local filesystem, although I will not have that ready by tomorrow.

@DelaramRajaei
Copy link
Member

@michelecatani
Thank you for the update. That's ok, let me know when you have done it.

@michelecatani
Copy link
Contributor Author

@yogeswarl Hello, I have pushed some changes for the common loader class. Right now I have simply added some abstract methods to be implemented by the subclasses while the build_index is explicitly defined in the parent class. I should be able to convert the pair and create_jsonl functions to something abstract, I just haven't gotten around to it yet. I have tested the changes in the main pipeline and all three loaders still work. Let me know what you think of these changes and my plans to flesh them out further!

@DelaramRajaei upon looking into the CLEF-IP dataset, there is no text value for the queries.
image

Each entry there links to some sort of tfile, but I can't seem to discern what the queries are supposed to be. I was wondering if you could take a look and let me know if I'm missing something, but as of right now I don't think this dataset will work. Here is the link for the dataset: http://www.ifs.tuwien.ac.at/~clef-ip/download/2013/index.shtml

I downloaded the test set and the qrels.

@DelaramRajaei
Copy link
Member

DelaramRajaei commented Nov 14, 2023

Hey @michelecatani ,

I've examined the documents, and you're correct. The previous versions only included the 'tfile' tag, which is only beneficial for classification tasks. However, in the 2013 version, they introduced another tag called 'claim.' More detailed information can be found in this link.

Here is another explaination of what is a passage:

What is a 'passage'?
A 'passage' is any child element of the abstract, description, or of the claims. They could be 'p' tags elements, but also or elements. We are aware that headings are not particularly informative, but could not exclude them apriori if the portions of patent text indicated as relevant in the search reports covered them.
So a relevant passage to a given set of claims is one or more children elements of the abstract, description or claim tags. When the whole abstract, description, or claims are considered relevant, we chose to list all children of the corresponding XML elements. Participants should do the same.

By downloading the topic from this link, as mentioned in the first link, you will find references to XML files named as "claims" within the topics.txt file.

<claims>xpaths_to_claims</tclaims>

Every XML file contains the necessary passage. In this context, we can utilize either the "description" or "abstract" tag as the query. Additionally, it's important to note that the CLEF-IP dataset contains documents in three languages, and our focus is specifically on the English ones.

Please let me know if you have any other question.

@DelaramRajaei
Copy link
Member

Hey @michelecatan,

Did you get a chance to read the document and modify the data loader?

@michelecatani
Copy link
Contributor Author

Hi @DelaramRajaei , I've read your explanation. Thank you, that makes sense.

I'm coming around to working on that tonight, and I'll work on it tomorrow before our meeting too. I'll reach out to you if I have any questions

@michelecatani
Copy link
Contributor Author

Hi all, I have built 2 of the 3 indices and am working on the third. Dr @hosseinfani mentioned there was a link to submit the indices too, could somebody please provide me with that link? @yogeswarl @DelaramRajaei

@yogeswarl
Copy link
Member

@michelecatani. You must be part of the query refinement channel. In the Files section inside Datasets & Indices -> Corpora you should store all the relevant indices and their respective raw format from which they were built

@michelecatani
Copy link
Contributor Author

Hello @yogeswarl @DelaramRajaei

I have posted the indices and raw data for each of the three datasets to the Teams file share. I have also pushed all my changes to my active pull request. @DelaramRajaei I noticed you made a few pushes to the main branch, did that include Zahra's code? If so, I can merge my changes... I believe Dr. @hosseinfani wanted me to merge last.

@DelaramRajaei Regarding the ClefIP indices and work, I constructed the indices off of the topics file listed at the link you provided me. I used the training set. It seemed suspiciously small to me, so I downloaded the documents folder just above the topics but ran into some trouble with the file extraction. I figured they were 7zip given the fact that 7z is in the file name, but was unable to extract any files from them using extraction software.

By the time we filter out all the non-English documents of the training, we are left with only about 20 documents for the ClefIP index. Let me know what you want my next steps to be.

@DelaramRajaei
Copy link
Member

Hey @michelecatani.

Thank you for the update. The modifications on the main branch represent the changes I implemented while merging the ReQue and RePair projects. Please review the code, and if there are any conflicts, kindly resolve them. Dr. Fani accepted the pull request after examining my work. I have provided a detailed explanation of the changes in this issue.

Regarding the issue you explained above, it appears that the majority of the documents are in languages other than English. Currently, the pipeline supports English queries. While backtranslation allows us to change the source language to others using a universal machine translation, I am uncertain about the compatibility with other refiners. For now, please gather English documents.

Additionally, I have a side task for you. To obtain dense indices, we require the raw corpus of our datasets. Could you investigate how we can acquire the corpus for the following datasets:

  1. dbpedia
  2. gov2
  3. antique

For instance, clueweb09b is available for purchase, and for robust04, you need to contact them via email. Could you please explore how we can obtain the other datasets?

@michelecatani
Copy link
Contributor Author

michelecatani commented Nov 23, 2023

Hi @DelaramRajaei I will merge the code today. I'll ping you on this issue when the pull request has been updated.

As for the other datasets you've mentioned, it appears that antique and dbpedia can simply be downloaded from ir_datasets like the rest of them. One can simply use this bit of python code:

image

Then you can simply iterate through the queries/docs/qrels to download them, so pretty much what our loaders already do.

For Gov2 however, there is a process you need to follow.

As per the ir_datasets website:

image

They are provided as hard drives. Somebody would need to fill out all of the forms indicated here: http://ir.dcs.gla.ac.uk/test_collections/access_to_data.html

image

After that's complete, they should mail out a copy of the hard drives.

@michelecatani
Copy link
Contributor Author

michelecatani commented Nov 23, 2023

Hi @DelaramRajaei this is my pull request: #48

I have tested my parts of the code in the main pipeline and everything seems to work; the docs get saved and the indices get built.

I had to make some small changes to the code I pushed for it to work because I tested my code from main.py. I had to change a lot of the references (ie, remove the "src." before "src.{reference}" in quite a few files). I did not include these changes in the pull request because I am not sure where you run the code from when you test.

@DelaramRajaei
Copy link
Member

Hey @michelecatani ,
Thank you for the update and the information on the datasets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dataset Data loaders, datasests enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants