Phase 2: Preprocessing
Convert screenshot captures to HuggingFace datasets in Qwen chat format, apply quality filtering, compute statistics, and upload to HF Hub.
Status: complete
Key Metrics
| Metric | Value |
|---|---|
total samples | 32658 |
train samples | 26126 |
val samples | 3265 |
test samples | 3267 |
Technologies
HuggingFace DatasetsPillowWebP encoding
Outputs
- 32,658-sample HF dataset
- train/val/test splits
Commands
python -m src.phase2_preprocessing.scripts.build_dataset --config src/config/v6_config.yaml
python -m src.phase2_preprocessing.scripts.upload_to_hub --config src/config/v6_config.yaml