Phase 5: GGUF Deployment

Merge LoRA weights into base model, quantize to GGUF Q4_K_M via llama.cpp, and upload to HF Hub for local serving via llama.cpp or Ollama.

Status: ready

Key Metrics

MetricValue
quantization Q4_K_M
context length 4096
serve port 8080

Technologies

  • llama.cpp
  • GGUF Q4_K_M
  • HF Hub

Outputs

  • GGUF model on HF Hub
  • llama.cpp server config

Commands

python -m src.phase5_deployment.scripts.convert_to_gguf --config src/config/v6_config.yaml

python -m src.phase5_deployment.scripts.benchmark --config src/config/v6_config.yaml
← Phase 4 Phase 6 →