-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kaggle dataset #49
Comments
Of course, that's fine as you are giving attribution to the GitHub repository. |
@heinrichreimer, added the other authors! AFAIK, I cannot set the existing DOI to the dataset (only prompts for DOI generation creating a novel one for the Kaggle dataset), so I just added that as link to the paper in the dataset description. didn't know you had more examples in the repository! at first glance, they seem to have a different data format/schema from manual annotations; might add those to the dataset as well if standardizing the schema is straightforward |
Alright, thanks for the clarification for the DOI 👍 |
sounds great, feel free to add the link to the dataset in the README! I will expand the dataset using the additional examples; likely keeping the manual annotations format. FYI, the reason why I put the dataset together is because at scrapegraph AI we are developing the deep search graph, and I was looking for annotated SERPs for the evaluation of the link re-ranker node, which is arguably the most critical node. I felt like I could share it as a dataset as I found it quite useful; happy you liked the idea! |
@heinrichreimer exploring the additional examples, it seems like they are not in line with the dataset spirit. IIUC, serps.jsonl contains search queries with "url query" and "serp query" in place of "query" and "interpreted query", with lots of additional information, and a separation between "url" and "wayback url" which I have yet to understand. if that could be fine, results.jsonl instead contains a single search result per line, with its rank, and not the full set of search results per query, with the corresponding ranks, which is critical to train a semantic model for re-ranking. |
Yes correct, the If you care only for SERPs that have results, it's relatively easy to filter: cat serps.jsonl | grep "\"results\": \[{" > /path/to/filtered.jsonl
Yes, the |
I extracted the manually curated search results as a Kaggle dataset citing @heinrichreimer as the author; is that okay?
The text was updated successfully, but these errors were encountered: