Phase 3: Vision Model
Train a Swin-B vision encoder + MLP projector + Qwen2.5-Coder-1.5B decoder with LoRA adapters locally on RTX 5060 Ti. Establishes multimodal baseline.
Status: ready
Key Metrics
| Metric | Value |
|---|---|
vision encoder | Swin-B (frozen) |
decoder | Qwen2.5-Coder-1.5B |
lora r | 16 |
batch size | 2 |
epochs | 10 |
Technologies
PyTorchTransformersPEFTTRLW&B
Outputs
- LoRA adapter weights
- BLEU/ROUGE baseline metrics
Commands
python -m src.phase3_vision_model.scripts.train --config src/config/v6_config.yaml
python -m src.phase3_vision_model.scripts.evaluate --config src/config/v6_config.yaml