Phase 5: GGUF Deployment
Merge LoRA weights into base model, quantize to GGUF Q4_K_M via llama.cpp, and upload to HF Hub for local serving via llama.cpp or Ollama.
Status: ready
Key Metrics
| Metric | Value |
|---|---|
quantization | Q4_K_M |
context length | 4096 |
serve port | 8080 |
Technologies
llama.cppGGUF Q4_K_MHF Hub
Outputs
- GGUF model on HF Hub
- llama.cpp server config
Commands
python -m src.phase5_deployment.scripts.convert_to_gguf --config src/config/v6_config.yaml
python -m src.phase5_deployment.scripts.benchmark --config src/config/v6_config.yaml