Skip to content

datasets hackathon

Albert Villanova del Moral edited this page Dec 3, 2021 · 23 revisions

BigScience🌸 Datasets Hackathon

Thank you for participating in the BigScience🌸 Datasets hackathon!

Video demo

Watch the video demo:

Watch the video

Setup

Install:

How-to Guide

How to add a Collection

By default, collections are added as private community raw datasets in the 🤗 Hub, under the bigscience-catalogue-data namespace.

  1. Choose an unassigned open issue from Collections.

    The issues are sorted by priority depending on their license, size, among other criteria.

    In each Issue page, you can find detailed information of the collection, such as its identifier (UID) and location.

  2. Self-assign you to that issue.

    In the Issue page, make a comment containing only the keyword:

    #self-assign
    
  3. Check if the dataset already exists in the 🤗 Hub:

    • Search for it in the 🤗 Hub: https://huggingface.co/datasets
    • If it already exists, comment this in the issue page, add the link to the 🤗 Hub dataset and choose another unassigned open issue
  4. Create a 🤗 Dataset repository: https://huggingface.co/new-dataset

    • Set Owner: bigscience-catalogue-data
    • Set Dataset name: the collection identifier (UID)
    • Select Private
    • Create dataset
  5. Clone the 🤗 Dataset repository:

    Replace <collection UID> with the collection identifier.

    git clone https://huggingface.co/datasets/bigscience-catalogue-data/<collection UID>
  6. Initialize Git LFS in the <collection UID> directory:

    cd <collection UID>
    git lfs install
  7. Download the collection to the <collection UID> directory.

    Expected formats are:

    • TXT
    • JSON/JSONL
    • CSV
    • HTML/XML
    • WARC

    If you find another format:

    • Search (or create) an Issue labeled as "data format" to decide whether/how to convert that format. That will be addressed as an optional subsequent step.

    • Make a comment in the "data catalog" collection issue referring the "data format" issue number:

      Replace <"data format" issue number> with the corresponding "data format" issue number:

      This dataset format needs being converted:
      - #<"data format" issue number>
      

    Other formats you may find:

    • PDF
  8. If you need help:

    • In the issue page, make a comment explaining your problem.

    • Make another comment below, containing only the keyword:

      #help
      

    If you would like to help others: list of open issues requiring help

  9. Compress the files, with gzip or zip.

    If you are compressing each file separately, please use gzip, so that the original extension of the file is preserved in the resulting filename (this will help for file format inference).

  10. Commit the files and push:

    git add .
    git commit -m "Add dataset"
    git push
Clone this wiki locally