📑 Paper | 🔨 fastText Classifier | 🤗 Released Dataset | 📦 Repo
We release our trained fasttext calssifier and a 100B token filtered high-quality dataset in Huggingface for direct use.
Name | Type | Huggingface Link |
---|---|---|
preselect-fasttext-calssifier | Model | 🤗Huggingface |
preselect-100B | Dataset | 🤗Huggingface |
We provide a dockerfile that contains the environment for filtering, trianing and evaluation.
docker build -t preselect:latest .
docker run --gpus all --network host -it --shm-size=20g --privileged preselect:latest
After that, you need to prepare your pretrianing corpus (i.e. download commoncrawl subset). We provide a example to download the DCLM's Refinedweb. Note this will require you to set up aws service beforehand.
cd data_processing/data/clean_pool
pythondownload.py
python unzip.py
You can also prepare your own data.
If you want to directly use our trained fasttext, you can download it from huggingface and run the following code:
import os
import argparse
from pathlib import Path
parser = argparse.ArgumentParser("Filter")
parser.add_argument("--input_path",type=str, help="input path name")
parser.add_argument("--output_path",type=str, help="output path name")
args = parser.parse_args()
from datatrove.executor import LocalPipelineExecutor
from datatrove.pipeline.filters import FastTextClassifierFilter
from datatrove.pipeline.readers import ParquetReader,JsonlReader
from datatrove.pipeline.writers.jsonl import JsonlWriter
Path(f"{args.output_path}").mkdir(parents=True,exist_ok=True)
dist_executor = LocalPipelineExecutor(
skip_completed=True,
pipeline=[
JsonlReader(f"{args.input_path}", text_key="text", default_metadata= {}),
FastTextClassifierFilter(f"PreSelect-classifier.bin", keep_labels=[("1",0.5)]),
JsonlWriter(f"{args.output_path}", compression=None)
],
tasks=100,
)
dist_executor.run()
The first step is to pick a small subset and calculate the bpc for each example for each model.
cd data_processing/bpc
python -u main.py\
--model_name {MODEL_NAME}\
--block_size 1900\
--stride 512 \
--batch_size 1\
Then you can train the fasttext using the data computed in Step 1.
cd data_processing/fasttext
python train_fasttext.py
Finally, you can filter your large corpus using the fasttext. The provided script works on one cpu machine, but it can be easily extend to multi machine filtering.
bash pipelie.sh {FASTTEXT_NAME} filter NO NO NO NO 0 NO 1 0.1
If you are training with single node (e.g. 8 gpus), you can use the following command
bash pipeline.sh {FASTTEXT_NAME} NO tokenize train convert NO 0 NO 1 0.1 {HOME_PATH} 1 {TRAINING_STEPS}
If you are training with multi node (e.g. 8 gpus * 4 node), you can use the following command
bash pipeline_multi_node.sh {FASTTEXT_NAME} NO tokenize train convert NO {MAIN_NODE_ADDRESS} NO 1 0.1 {HOME_PATH} {N_NODE} {TRAINING_STEPS}
For more information, you can refer to the pipeline script.
You can refer Opencompass and LM-Evaluation-Harness to setup the evaluation for trained checkpoints to fit your need.
If you find this work helpful, please kindly cite as:
@article{shum2025predictivedataselectiondata,
title={Predictive Data Selection: The Data That Predicts Is the Data That Teaches},
author={Kashun Shum and Yuzhen Huang and Hongjian Zou and Ding Qi and Yixuan Liao and Xiaoxin Chen and Qian Liu and Junxian He},
journal={arXiv preprint arXiv:2503.00808},
year={2025},
eprint={2503.00808},
}
Thanks for the open-source of the following projects where some code in this project is cited and modified from them: