Phase 1: Data Collection
Scrape high-quality GitHub repositories, filter code files, and capture Monaco Editor screenshots with 8 VS Code themes across 8 programming languages.
Status: complete
Key Metrics
| Metric | Value |
|---|---|
repos scraped | 4000 |
captures | 32727 |
languages | 8 |
themes | 8 |
Technologies
PlaywrightGitPythonSQLiteasyncio
Outputs
- 32,727 screenshot captures
- catalog.db with repo metadata
Commands
python -m src.phase1_data_collection.scripts.run_collection --config src/config/v6_config.yaml
python -m src.phase1_data_collection.scripts.validate_samples --config src/config/v6_config.yaml