-
Notifications
You must be signed in to change notification settings - Fork 47
datasets hackathon
Thank you for participating in the BigScience🌸 Datasets hackathon!
- Create an account at Hugging Face
- Request to be added to the BigScience organization
Install:
By default, collections are added as private community raw datasets in the 🤗 Hub, under the bigscience
namespace.
-
Choose an unassigned open issue from Collections.
The issues are sorted by priority depending on their license, size, among other criteria.
In each Issue page, you can find detailed information of the collection, such as its identifier (UID) and location.
-
Self-assign you to that issue.
In the Issue page, on the right column, under Assignees, click assign yourself.
-
Create a 🤗 Dataset repository: https://huggingface.co/new-dataset
- Set Owner: bigscience
- Set Dataset name: the collection identifier (UID)
- Select Private
- Create dataset
-
Clone the 🤗 Dataset repository:
Replace
<collection UID>
with the collection identifier.git clone https://huggingface.co/datasets/bigscience/<collection UID> cd <collection UID>
-
Initialize Git LFS in the
<collection UID>
directory:cd <collection UID> git lfs install
-
Download the collection to the
<collection UID>
directory.Expected formats are:
- TXT
- JSON/JSONL
- CSV
- HTML/XML
- WARC
If you find another format, create an Issue (labeled as "data format") to decide whether/how to convert that format.
- PDF: this format is hard to convert.
-
Compress the files, with
gzip
orzip
. -
Commit the files and push:
git add . git commit -m "Add dataset" git push