Phase 1: Data Collection

Scrape high-quality GitHub repositories, filter code files, and capture Monaco Editor screenshots with 8 VS Code themes across 8 programming languages.

Status: complete

Key Metrics

MetricValue
repos scraped 4000
captures 32727
languages 8
themes 8

Technologies

  • Playwright
  • GitPython
  • SQLite
  • asyncio

Outputs

  • 32,727 screenshot captures
  • catalog.db with repo metadata

Commands

python -m src.phase1_data_collection.scripts.run_collection --config src/config/v6_config.yaml

python -m src.phase1_data_collection.scripts.validate_samples --config src/config/v6_config.yaml
Phase 2 →