Skip to content

datasets hackathon

Albert Villanova del Moral edited this page Nov 24, 2021 · 23 revisions

BigScience🌸 Datasets Hackathon

Thank you for participating in the BigScience🌸 Datasets hackathon!

Setup

Install:

How-to Guide

How to add a Collection

By default, collections are added as private community raw datasets in the 🤗 Hub, under the bigscience namespace.

  1. Take an unassigned open issue from the Collections.

    The issues are sorted by priority depending on their license, size, among other criteria.

    In each Issue page, you can find detailed information of the collection, such as the identifier (UID) and location.

  2. Create a 🤗 Dataset repository: https://huggingface.co/new-dataset

    • Set Owner: bigscience
    • Set Dataset name: the collection identifier (UID)
    • Select Private
    • Create dataset
  3. Clone the 🤗 Dataset repository:

    Replace <collection UID> with the collection identifier.

    git clone https://huggingface.co/datasets/bigscience/<collection UID>
    cd <collection UID>
  4. Initialize Git LFS in the <collection UID> directory:

    cd <collection UID>
    git lfs install
  5. Download the collection to the <collection UID> directory.

  6. Compress the files, with gzip or zip.

  7. Commit the files and push:

    git add .
    git commit -m "Add dataset"
    git push
Clone this wiki locally