Architecture
Pipeline Overview
GitHub Repos
│
▼
Phase 1: Data Collection
├─ github_scraper.py → filter high-quality repos
├─ vscode_automation.py → headless Monaco screenshot capture
└─ sqlite_catalog.py → track in catalog.db
│
▼
Phase 2: Preprocessing
├─ hf_dataset_converter.py → PNG + source.txt → HF Dataset
├─ chat_formatter.py → Qwen chat message format
└─ upload_to_hub.py → push to HuggingFace Hub
│
▼
Phase 3: Vision Model (local RTX 5060 Ti)
├─ Swin-B (frozen) vision encoder
├─ MLP projector
└─ Qwen2.5-Coder-1.5B + LoRA decoder
│
▼
Phase 4: Qwen Fine-tuning (cloud A100)
├─ Qwen2.5-Coder-14B-Instruct
├─ LoRA sweep: conservative / standard / aggressive
└─ Full training on top-2 configs
│
▼
Phase 5: GGUF Deployment
├─ Merge LoRA weights
├─ GGUF Q4_K_M quantization via llama.cpp
└─ Upload to HuggingFace Hub
│
▼
Phase 6: Inference (planned)
├─ vLLM serving
├─ Qwen-Agent integration
└─ MCP tool interface Capture Format
Each screenshot capture is stored as:
data/sample-data/captures/
<2-char-hex>/
<16-char-sha256>/
0000.png # First viewport screenshot
0001.png # Scrolled continuation
source.txt # Full source code
metadata.json # language, line_count, theme, viewport Dataset Format
Each training sample follows the Qwen chat format:
{
"messages": [
{
"role": "user",
"content": [
{"type": "image", "image": "<base64-webp>"},
{"type": "text", "text": "Generate the source code shown in this VS Code screenshot."}
]
},
{
"role": "assistant",
"content": "<source code>"
}
]
}