Architecture

Pipeline Overview

GitHub Repos
    │
    ▼
Phase 1: Data Collection
  ├─ github_scraper.py   → filter high-quality repos
  ├─ vscode_automation.py → headless Monaco screenshot capture
  └─ sqlite_catalog.py   → track in catalog.db
    │
    ▼
Phase 2: Preprocessing
  ├─ hf_dataset_converter.py → PNG + source.txt → HF Dataset
  ├─ chat_formatter.py       → Qwen chat message format
  └─ upload_to_hub.py        → push to HuggingFace Hub
    │
    ▼
Phase 3: Vision Model (local RTX 5060 Ti)
  ├─ Swin-B (frozen) vision encoder
  ├─ MLP projector
  └─ Qwen2.5-Coder-1.5B + LoRA decoder
    │
    ▼
Phase 4: Qwen Fine-tuning (cloud A100)
  ├─ Qwen2.5-Coder-14B-Instruct
  ├─ LoRA sweep: conservative / standard / aggressive
  └─ Full training on top-2 configs
    │
    ▼
Phase 5: GGUF Deployment
  ├─ Merge LoRA weights
  ├─ GGUF Q4_K_M quantization via llama.cpp
  └─ Upload to HuggingFace Hub
    │
    ▼
Phase 6: Inference (planned)
  ├─ vLLM serving
  ├─ Qwen-Agent integration
  └─ MCP tool interface

Capture Format

Each screenshot capture is stored as:

data/sample-data/captures/
  <2-char-hex>/
    <16-char-sha256>/
      0000.png        # First viewport screenshot
      0001.png        # Scrolled continuation
      source.txt      # Full source code
      metadata.json   # language, line_count, theme, viewport

Dataset Format

Each training sample follows the Qwen chat format:

{
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "image", "image": "<base64-webp>"},
        {"type": "text",  "text": "Generate the source code shown in this VS Code screenshot."}
      ]
    },
    {
      "role": "assistant",
      "content": "<source code>"
    }
  ]
}