Phase 2: Preprocessing

Convert screenshot captures to HuggingFace datasets in Qwen chat format, apply quality filtering, compute statistics, and upload to HF Hub.

Status: complete

Key Metrics

MetricValue
total samples 32658
train samples 26126
val samples 3265
test samples 3267

Technologies

  • HuggingFace Datasets
  • Pillow
  • WebP encoding

Outputs

  • 32,658-sample HF dataset
  • train/val/test splits

Commands

python -m src.phase2_preprocessing.scripts.build_dataset --config src/config/v6_config.yaml

python -m src.phase2_preprocessing.scripts.upload_to_hub --config src/config/v6_config.yaml
← Phase 1 Phase 3 →