Introduction
Running large language models (LLMs) locally gives you complete control over your AI infrastructure, ensuring privacy, eliminating API costs, and enabling offline usage. In this tutorial, we’ll walk through setting up Ollama with Podman to run the Qwen2.5-14B model on your local machine.
Prerequisites
Before starting, ensure you have:
- A Linux system (Ubuntu/Debian-based in this example)
- Sufficient disk space (at least 10GB for the model)
- Basic familiarity with command-line operations
- Sudo privileges
Step 1: Install Podman
Podman is a daemonless container engine that’s an excellent alternative to Docker. Let’s install it:
sudo apt update
sudo apt install -y podman
Step 2: Pull the Ollama Container Image
Ollama provides a convenient interface for running LLMs. Pull the official image:
sudo podman pull docker.io/ollama/ollama
Step 3: Run the Ollama Container
Launch Ollama as a background service with persistent storage:
sudo podman run -d \
--name ollama \
-p 11434:11434 \
-v ollama:/root/.ollama \
ollama/ollama
Single-line version:
sudo podman run -d --name ollama -p 11434:11434 -v ollama:/root/.ollama ollama/ollama
Verify the container is running:
sudo podman ps
Step 4: Download the Model Files
Create a directory for your models:
mkdir -p ~/models/qwen2.5-14b/
cd ~/models/qwen2.5-14b/
The Qwen2.5-14B model in GGUF format is split into multiple files. Download all parts:
wget https://huggingface.co/Qwen/Qwen2.5-14B-Instruct-GGUF/resolve/main/qwen2.5-14b-instruct-q4_k_m-00001-of-00003.gguf
wget https://huggingface.co/Qwen/Qwen2.5-14B-Instruct-GGUF/resolve/main/qwen2.5-14b-instruct-q4_k_m-00002-of-00003.gguf
wget https://huggingface.co/Qwen/Qwen2.5-14B-Instruct-GGUF/resolve/main/qwen2.5-14b-instruct-q4_k_m-00003-of-00003.gguf
Note: The q4_k_m quantization provides a good balance between model size and quality. It uses 4-bit quantization with the K-means method.
Step 5: Build llama.cpp for Merging Split Files
Since the model files are split, we need llama.cpp’s merge utility. Clone and build it:
cd ~/models
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
Build with CMake:
cmake -B build
cmake --build build --config Release -j $(nproc)
This will use all available CPU cores for faster compilation. The build process may take several minutes.
Verify the installation:
./build/bin/llama-gguf-split --version
You should see output similar to:
version: 7850 (f2571df8b)
built with GNU 13.3.0 for Linux x86_64
Step 6: Merge the Split Model Files
Use the llama-gguf-split tool to merge the parts. You only need to specify the first file:
cd ~/models
sudo ./llama.cpp/build/bin/llama-gguf-split --merge \
qwen2.5-14b/qwen2.5-14b-instruct-q4_k_m-00001-of-00003.gguf \
qwen2.5-14b/qwen2.5-14b-instruct-q4_k_m.gguf
Expected output:
gguf_merge: reading metadata done
gguf_merge: writing tensors done
gguf_merge: merged from 3 split with 579 tensors.
The merged file will be created at ~/models/qwen2.5-14b/qwen2.5-14b-instruct-q4_k_m.gguf.
Step 7: Copy the Model into the Container
Transfer the merged model file to the Ollama container:
sudo podman cp ~/models/qwen2.5-14b/qwen2.5-14b-instruct-q4_k_m.gguf \
ollama:/root/.ollama/models/
Step 8: Create a Modelfile
The Modelfile defines how Ollama should load and configure your model. Create it on your host system:
vi ~/Modelfile
Add the following content:
FROM /root/.ollama/models/qwen2.5-14b-instruct-q4_k_m.gguf
TEMPLATE """{{ .System }}{{ .Prompt }}"""
PARAMETER temperature 0.7
Alternative: If you’ve already registered the model with Ollama (via ollama pull or ollama create), you can reference it directly:
FROM gguf:qwen2.5-14b-instruct-q4_k_m
TEMPLATE """{{ .System }}{{ .Prompt }}"""
PARAMETER temperature 0.7
Copy the Modelfile into the container:
sudo podman cp ~/Modelfile ollama:/root/Modelfile
Step 9: Register the Model with Ollama
Create a named model within Ollama:
sudo podman exec -it ollama ollama create qwen25 -f /root/Modelfile
This registers the model as “qwen25” (you can choose any name you prefer).
Step 10: Run and Test Your Model
Interactive Mode
Start a chat session with your model:
podman exec -it ollama ollama run qwen25
Type your questions and press Enter. Use /bye to exit.
REST API
Query the model programmatically:
curl http://localhost:11434/api/generate -d '{
"model": "qwen25",
"prompt": "Explain quantum computing in simple terms."
}'
List Available Models
Check all models in your Ollama instance:
podman exec -it ollama ollama list
Troubleshooting Tips
Container won’t start: Check if port 11434 is already in use:
sudo netstat -tulpn | grep 11434
Out of memory errors: The Qwen2.5-14B model requires approximately 8-10GB of RAM. Consider using a smaller model or adding swap space.
Model not found: Verify the model was copied correctly:
sudo podman exec -it ollama ls -lh /root/.ollama/models/
Slow responses: GGUF models run on CPU by default. For better performance, consider:
- Using GPU acceleration (requires additional setup)
- Selecting a smaller quantization (e.g., q4_0 instead of q4_k_m)
- Using a smaller model (e.g., 7B instead of 14B)
Understanding Quantization
The q4_k_m quantization in the filename means:
- q4: 4-bit quantization (reduces model size significantly)
- k: K-means quantization method (improves quality vs. simple rounding)
- m: Medium variant (balances speed and accuracy)
Other common quantization options include q4_0 (smallest), q5_k_m (better quality), and q8_0 (highest quality, largest size).
Conclusion
You now have a fully functional local LLM setup running in a containerized environment. This approach gives you the flexibility to experiment with different models, maintain privacy, and avoid API costs. The Podman + Ollama combination provides a robust, production-ready solution for running AI models on your infrastructure.
Next Steps
- Experiment with different models from Hugging Face
- Integrate the REST API with your applications
- Set up GPU acceleration for faster inference
- Create custom Modelfiles with specific prompting templates
- Explore Ollama’s embedding and completion APIs
Related Resources: