Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kaggle dataset #49

Closed
DiTo97 opened this issue May 30, 2024 · 6 comments · Fixed by #50
Closed

Kaggle dataset #49

DiTo97 opened this issue May 30, 2024 · 6 comments · Fixed by #50

Comments

@DiTo97
Copy link

DiTo97 commented May 30, 2024

I extracted the manually curated search results as a Kaggle dataset citing @heinrichreimer as the author; is that okay?

@janheinrichmerker
Copy link
Contributor

Of course, that's fine as you are giving attribution to the GitHub repository.
It would be very kind of you to also cite my co-authors: Sebastian Schmidt, Maik Fröbe, Lukas Gienapp, Harrisen Scells, Benno Stein, Matthias Hagen, and Martin Potthast.
And the paper DOI would be: https://doi.org/10.1145/3539618.3591890
Thanks for putting the data on Kaggle!
(By the way, clever idea to extract the examples of the unit tests 😄 Did you know that we also have some more example data in the repo? https://github.com/webis-de/archive-query-log/tree/main/data/examples)

@DiTo97
Copy link
Author

DiTo97 commented May 30, 2024

@heinrichreimer, added the other authors!

AFAIK, I cannot set the existing DOI to the dataset (only prompts for DOI generation creating a novel one for the Kaggle dataset), so I just added that as link to the paper in the dataset description.

didn't know you had more examples in the repository! at first glance, they seem to have a different data format/schema from manual annotations; might add those to the dataset as well if standardizing the schema is straightforward

@janheinrichmerker
Copy link
Contributor

Alright, thanks for the clarification for the DOI 👍
The conversion should be straightforward as our examples also use a very similar JSON format.
I think it would also be cool to add the link to the Kaggle dataset to the GitHub readme!

@DiTo97
Copy link
Author

DiTo97 commented May 30, 2024

sounds great, feel free to add the link to the dataset in the README!

I will expand the dataset using the additional examples; likely keeping the manual annotations format.

FYI, the reason why I put the dataset together is because at scrapegraph AI we are developing the deep search graph, and I was looking for annotated SERPs for the evaluation of the link re-ranker node, which is arguably the most critical node.

I felt like I could share it as a dataset as I found it quite useful; happy you liked the idea!

@DiTo97
Copy link
Author

DiTo97 commented May 30, 2024

@heinrichreimer exploring the additional examples, it seems like they are not in line with the dataset spirit.

IIUC, serps.jsonl contains search queries with "url query" and "serp query" in place of "query" and "interpreted query", with lots of additional information, and a separation between "url" and "wayback url" which I have yet to understand.

if that could be fine, results.jsonl instead contains a single search result per line, with its rank, and not the full set of search results per query, with the corresponding ranks, which is critical to train a semantic model for re-ranking.

@janheinrichmerker
Copy link
Contributor

IIUC, serps.jsonl contains search queries with "url query" and "serp query" in place of "query" and "interpreted query", with lots of additional information, and a separation between "url" and "wayback url" which I have yet to understand.

Yes correct, the url is the "original" URL of the website and the wayback_raw_url, for example, is the same site as archived on the Internet Archive. So you could parse the SERP's HTML from there if you want to extract more data.

If you care only for SERPs that have results, it's relatively easy to filter:

cat serps.jsonl | grep "\"results\": \[{" > /path/to/filtered.jsonl

if that could be fine, results.jsonl instead contains a single search result per line, with its rank, and not the full set of search results per query, with the corresponding ranks, which is critical to train a semantic model for re-ranking.

Yes, the results.json is just a "flipped" version of serps.json where we have one JSON line per search result instead of one JSON line per SERP.

janheinrichmerker added a commit that referenced this issue Jun 3, 2024
Thanks @DiTo97!
Fixes #49


Co-authored-by: Federico Minutoli <fede97.minutoli@gmail.com>

Signed-off-by: Jan Heinrich Reimer <heinrich.reimer@uni-jena.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants