Phase 3: Vision Model

Train a Swin-B vision encoder + MLP projector + Qwen2.5-Coder-1.5B decoder with LoRA adapters locally on RTX 5060 Ti. Establishes multimodal baseline.

Status: ready

Key Metrics

MetricValue
vision encoder Swin-B (frozen)
decoder Qwen2.5-Coder-1.5B
lora r 16
batch size 2
epochs 10

Technologies

  • PyTorch
  • Transformers
  • PEFT
  • TRL
  • W&B

Outputs

  • LoRA adapter weights
  • BLEU/ROUGE baseline metrics

Commands

python -m src.phase3_vision_model.scripts.train --config src/config/v6_config.yaml

python -m src.phase3_vision_model.scripts.evaluate --config src/config/v6_config.yaml
← Phase 2 Phase 4 →