Running Large Language Models Locally with Podman and Ollama

Introduction

Running large language models (LLMs) locally gives you complete control over your AI infrastructure, ensuring privacy, eliminating API costs, and enabling offline usage. In this tutorial, we’ll walk through setting up Ollama with Podman to run the Qwen2.5-14B model on your local machine.

Prerequisites

Before starting, ensure you have:

A Linux system (Ubuntu/Debian-based in this example)
Sufficient disk space (at least 10GB for the model)
Basic familiarity with command-line operations
Sudo privileges

Step 1: Install Podman

Podman is a daemonless container engine that’s an excellent alternative to Docker. Let’s install it:

sudo apt update
sudo apt install -y podman

Step 2: Pull the Ollama Container Image

Ollama provides a convenient interface for running LLMs. Pull the official image:

sudo podman pull docker.io/ollama/ollama

Step 3: Run the Ollama Container

Launch Ollama as a background service with persistent storage:

sudo podman run -d \
  --name ollama \
  -p 11434:11434 \
  -v ollama:/root/.ollama \
  ollama/ollama

Single-line version:

sudo podman run -d --name ollama -p 11434:11434 -v ollama:/root/.ollama ollama/ollama

Verify the container is running:

sudo podman ps

Step 4: Download the Model Files

Create a directory for your models:

mkdir -p ~/models/qwen2.5-14b/
cd ~/models/qwen2.5-14b/

The Qwen2.5-14B model in GGUF format is split into multiple files. Download all parts:

wget https://huggingface.co/Qwen/Qwen2.5-14B-Instruct-GGUF/resolve/main/qwen2.5-14b-instruct-q4_k_m-00001-of-00003.gguf
wget https://huggingface.co/Qwen/Qwen2.5-14B-Instruct-GGUF/resolve/main/qwen2.5-14b-instruct-q4_k_m-00002-of-00003.gguf
wget https://huggingface.co/Qwen/Qwen2.5-14B-Instruct-GGUF/resolve/main/qwen2.5-14b-instruct-q4_k_m-00003-of-00003.gguf

Note: The q4_k_m quantization provides a good balance between model size and quality. It uses 4-bit quantization with the K-means method.

Step 5: Build llama.cpp for Merging Split Files

Since the model files are split, we need llama.cpp’s merge utility. Clone and build it:

cd ~/models
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

Build with CMake:

cmake -B build
cmake --build build --config Release -j $(nproc)

This will use all available CPU cores for faster compilation. The build process may take several minutes.

Verify the installation:

./build/bin/llama-gguf-split --version

You should see output similar to:

version: 7850 (f2571df8b)
built with GNU 13.3.0 for Linux x86_64

Step 6: Merge the Split Model Files

Use the llama-gguf-split tool to merge the parts. You only need to specify the first file:

cd ~/models
sudo ./llama.cpp/build/bin/llama-gguf-split --merge \
  qwen2.5-14b/qwen2.5-14b-instruct-q4_k_m-00001-of-00003.gguf \
  qwen2.5-14b/qwen2.5-14b-instruct-q4_k_m.gguf

Expected output:

gguf_merge: reading metadata done
gguf_merge: writing tensors done
gguf_merge: merged from 3 split with 579 tensors.

The merged file will be created at ~/models/qwen2.5-14b/qwen2.5-14b-instruct-q4_k_m.gguf.

Step 7: Copy the Model into the Container

Transfer the merged model file to the Ollama container:

sudo podman cp ~/models/qwen2.5-14b/qwen2.5-14b-instruct-q4_k_m.gguf \
  ollama:/root/.ollama/models/

Step 8: Create a Modelfile

The Modelfile defines how Ollama should load and configure your model. Create it on your host system:

vi ~/Modelfile

Add the following content:

FROM /root/.ollama/models/qwen2.5-14b-instruct-q4_k_m.gguf
TEMPLATE """{{ .System }}{{ .Prompt }}"""
PARAMETER temperature 0.7

Alternative: If you’ve already registered the model with Ollama (via ollama pull or ollama create), you can reference it directly:

FROM gguf:qwen2.5-14b-instruct-q4_k_m
TEMPLATE """{{ .System }}{{ .Prompt }}"""
PARAMETER temperature 0.7

Copy the Modelfile into the container:

sudo podman cp ~/Modelfile ollama:/root/Modelfile

Step 9: Register the Model with Ollama

Create a named model within Ollama:

sudo podman exec -it ollama ollama create qwen25 -f /root/Modelfile

This registers the model as “qwen25” (you can choose any name you prefer).

Step 10: Run and Test Your Model

Interactive Mode

Start a chat session with your model:

podman exec -it ollama ollama run qwen25

Type your questions and press Enter. Use /bye to exit.

REST API

Query the model programmatically:

curl http://localhost:11434/api/generate -d '{
  "model": "qwen25",
  "prompt": "Explain quantum computing in simple terms."
}'

List Available Models

Check all models in your Ollama instance:

podman exec -it ollama ollama list

Troubleshooting Tips

Container won’t start: Check if port 11434 is already in use:

sudo netstat -tulpn | grep 11434

Out of memory errors: The Qwen2.5-14B model requires approximately 8-10GB of RAM. Consider using a smaller model or adding swap space.

Model not found: Verify the model was copied correctly:

sudo podman exec -it ollama ls -lh /root/.ollama/models/

Slow responses: GGUF models run on CPU by default. For better performance, consider:

Using GPU acceleration (requires additional setup)
Selecting a smaller quantization (e.g., q4_0 instead of q4_k_m)
Using a smaller model (e.g., 7B instead of 14B)

Understanding Quantization

The q4_k_m quantization in the filename means:

q4: 4-bit quantization (reduces model size significantly)
k: K-means quantization method (improves quality vs. simple rounding)
m: Medium variant (balances speed and accuracy)

Other common quantization options include q4_0 (smallest), q5_k_m (better quality), and q8_0 (highest quality, largest size).

Conclusion

You now have a fully functional local LLM setup running in a containerized environment. This approach gives you the flexibility to experiment with different models, maintain privacy, and avoid API costs. The Podman + Ollama combination provides a robust, production-ready solution for running AI models on your infrastructure.

Next Steps

Experiment with different models from Hugging Face
Integrate the REST API with your applications
Set up GPU acceleration for faster inference
Create custom Modelfiles with specific prompting templates
Explore Ollama’s embedding and completion APIs

Related Resources: