Deploy your Local LLM Web App in Kubernetes using Ollama and OpenWebUI
- Phase I
- Phase II
Get up and running with large language models. ollama.com
Donwload and Install
- Windows
- Linux
curl -fsSL https://ollama.com/install.sh | sh
wsl.exe --install --no-distribution
wsl --set-version Ubuntu-18.04 2
https://ollama.com/search?q=&p=1
- Models:
llama3.1
codellama => debug code
mistral
llava => multi modal / images vision transformer
CAN MY MACHINE RUN LLM ? go to this HF portal and check: can-it-run-llm
ollama -h
commands :
Available Commands:
serve Start ollama
create Create a model from a Modelfile
show Show information for a model
run Run a model
pull Pull a model from a registry
push Push a model to a registry
list List models
cp Copy a model
rm Remove a model
help Help about any command
- Ollama Commands
- Start Ollama Server
ollama serve
- Run Ollama Model
ollama run llama2-uncensored
- Download Ollama Model
ollama pull llama2-uncensored
- List Installed Ollama Models
ollama list
- Delete Installed Ollama Models
ollama rm llama2-uncensored
New With Ollama You can now run private GGUFs from the Hugging Face Hub.
ollama run hf.co/{username}/{repository}:{quantization}
- example with GGUF model QWEN 2.5 4Q_K_M and Deepseek-R1
ollama pull hf.co/bartowski/Qwen2-VL-7B-Instruct-GGUF:Q4_K_M
ollama pull hf.co/unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF:Q4_K_M
Note:
By default, the Q4_K_M quantization scheme is used, when it’s present inside the model repo. If not, we default to picking one reasonable quant type present inside the repo.
Custom Chat Template and Parameters
- tempate : create
template
file inside repo. must be a Go template, - system prompt : create a
system
file inside the repo. - param : create a
parameters
file inside the repo. The file must be in JSON format. For the list of all available parameters, please refer to documentation.
curl http://localhost:11434/api/generate -d '{
"model": "llama3",
"prompt": "Why is the sky blue?"
}'
A stream of JSON objects is returned:
{
"model": "llama3",
"created_at": "2023-08-04T08:52:19.385406455-07:00",
"response": "The",
"done": false
}
The final response in the stream also includes additional data about the generation:
total_duration
: time spent generating the responseload_duration
: time spent in nanoseconds loading the modelprompt_eval_count
: number of tokens in the promptprompt_eval_duration
: time spent in nanoseconds evaluating the prompteval_count
: number of tokens in the responseeval_duration
: time in nanoseconds spent generating the responsecontext
: an encoding of the conversation used in this response, this can be sent in the next request to keep a conversational memoryresponse
: empty if the response was streamed, if not streamed, this will contain the full response
To calculate how fast the response is generated in tokens per second (token/s), divide eval_count
/ eval_duration
* 10^9
.
HuggingFace meets ollama
- Hf takes the weights and ollama uses a model which includes weights and system prompt and template etc.
- Where are models stored?
- macOS :
~/.ollama/models.
- Linux :
/usr/share/ollama/.ollama/models.
- Windows :
C:\Users\%username%\.ollama\models.
- VRAM PREREQ:
Remember that you will need a GPU with sufficient memory (VRAM) to run models with Ollama. You can check out a calculator HuggingFace created called "Model Memory Calculator" here model_size_estimator here is an article that runs you through the exact mathematical calculation for "Calculating GPU memory for serving LLMs".
- CPU compatible Modells:
llama.cpp can run using CPU only :
- llama.cpp. The source project for GGUF. Offers a CLI and a server option.
How:
- Applies a custom quantization approach to compress the models in a GGUF format. This reduces their size and resource needs.
Where
-
Lamma3: Thanks to The Bloke, there are already pre-made models which can be used directly with the mentioned framework.
-
Example: Llama-2-7B-Chat-GGUF
- smallest => llama-2–7b-chat.Q2_K.gguf which is the most compressed version of the 7B chat model and requires the least resources.
-
llama3.1: GGUF models are also available thanks to accounts like Bartowski with his Meta-Llama-3.1-8B-Instruct-GGUF model
-
QWEN VL 2.5 : quantized models available here Qwen2-VL-7B-Instruct-GGUF
-
Deepseek R1: quantized unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF
text to Image
- 1111 DirectML list see [link] (https://medium.com/ai-bytes/top-6-open-source-text-to-image-generation-models-in-2024-ee5a2fc39046)
- DeepFloyd IF
- Stable Diffusion v1–5
- [OpenJourney] (/~https://github.com/paddleboard-ai/useful-notebooks/blob/main/LocalJourney_CPU.ipynb)
- DreamShaper
- Dreamlike Photoreal
- Waifu Diffusion
- Guidelines on designing the chatbot's functionality, including understanding user inputs, generating responses, and handling different types of conversations.
Test using docker
- Install python
-
pip install llama-cpp-python pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu pip-versions latest llama-cpp-python
- Create a python script
llama_cpu.py
to query the API
from llama_cpp import Llama
# Put the location of to the GGUF model that you've download from HuggingFace here
model_path = "**path to your llama-2–7b-chat.Q2_K.gguf**"
# Create a llama model
model = Llama(model_path=model_path)
# Prompt creation
system_message = "You are a helpful assistant"
user_message = "Generate a list of 5 funny dog names"
prompt = f"""<s>[INST] <<SYS>>
{system_message}
<</SYS>>
{user_message} [/INST]"""
# Model parameters
max_tokens = 100
# Run the model
output = model(prompt, max_tokens=max_tokens, echo=True)
# Print the model output
print(output)
- Run the script:
- Donwload model using HF CLI: QuantFactory/Meta-Llama-3-8B-Instruct-GGUF: Meta-Llama-3-8B-Instruct.Q3_K_S.gguf-3.67GB
--- install cli
pip install huggingface-hub
pip install --upgrade huggingface_hub
--- download model
huggingface-cli download \
QuantFactory/Meta-Llama-3-8B-Instruct-GGUF Meta-Llama-3-8B-Instruct.Q3_K_S.gguf \
--local-dir ai_models --local-dir-use-symlinks False
- Create llama<*>.loc modelfile lamma3.loc
# set the base model
FROM ./ai_models/Meta-Llama-3-8B-Instruct.Q3_K_S.gguf
# Set custom paramter values
PARAMETER temperature 1
PARAMETER stop <|start_header_id|>
PARAMETER stop <|end_header_id|>
PARAMETER stop <|eot_id|>
PARAMETER stop <|reserved_special_token
# Set the model template
TEMPLATE """
{{ if .System }}<|start_header_id|>system<|end_header_id|>
{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>
{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>
{{ .Response }}<|eot_id|>
"""
# Set the system message
SYSTEM You are a helpful AI assistant named Llama3 Droid
- We then build an Ollama model using the following command:
ollama create llama2 -f llama2.loc
ollama create llama3 -f llama3.loc
---
transferring model data
using existing layer sha256:774ba422eeac30b2390e72960694b35eba746acd82785b2d644c92716ed479bb
creating new layer sha256:549c786ef375489a2379ecfe1d244fde24c5ab78b9398a9343375556fffc6a14
creating new layer sha256:0d6f8e890c228c44821390929a20714ca1990ac586ca2abf9e78ac907d33d173
creating new layer sha256:319eccebdb61c018484f30aa1f24c381dff9843888b02daef9d1795a4f27cbad
creating new layer sha256:b2d73007e6a1f6485fb56a0e73a6ede58416132abff6de29c00791425ebb71c8
writing manifest
success
> ollama list
NAME ID SIZE MODIFIED
ollama3:latest 37773b35f97c 3.7 GB 7 seconds ago
- And now let’s see if we can get the model to tell us all about Famous king of Pop:
ollama run ollama2 "Who is Michael jackson?"
--- Use Ctrl + d or /bye to exit.
lamma31.loc
- donwload
huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF Meta-Llama-3.1-8B-Instruct-Q4_K_S.gguf \
--local-dir ai_models \
--local-dir-use-symlinks False
- modelfile
# set the base model
FROM ./ai_models/Meta-Llama-3.1-8B-Instruct-Q4_K_S.gguf
# Set custom parameter values
PARAMETER temperature 1.0
PARAMETER stop <|start_header_id|>
PARAMETER stop <|end_header_id|>
PARAMETER stop <|eot_id|>
# Define the model template
TEMPLATE """
{{- if or .System .Tools }}<|start_header_id|>system<|end_header_id|>
{{- if .System }}
{{ .System }}
{{- end }}
{{- if .Tools }}
Cutting Knowledge Date: December 2023
When you receive a tool call response, use the output to format an answer to the original user question.
You are a helpful assistant with tool calling capabilities.
{{- end }}
{{- end }}<|eot_id|>
{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1 }}
{{- if eq .Role "user" }}<|start_header_id|>user<|end_header_id|>
{{- if and $.Tools $last }}
Given the following functions, please respond with a JSON for a function call with its proper arguments that best answers
the given prompt.
Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}. Do not use variables.
{{ range $.Tools }}
{{- . }}
{{ end }}
{{- end }}
{{ .Content }}<|eot_id|>{{ if $last }}<|start_header_id|>assistant<|end_header_id|>
{{ end }}
{{- else if eq .Role "assistant" }}<|start_header_id|>assistant<|end_header_id|>
{{- if .ToolCalls }}
{{- range .ToolCalls }}{"name": "{{ .Function.Name }}", "parameters": {{ .Function.Arguments }}}{{ end }}
{{- else }}
{{ .Content }}{{ if not $last }}<|eot_id|>{{ end }}
{{- end }}
{{- else if eq .Role "tool" }}<|start_header_id|>ipython<|end_header_id|>
{{ .Content }}<|eot_id|>{{ if $last }}<|start_header_id|>assistant<|end_header_id|>
{{ end }}
{{- end }}
{{- end }}
"""
# Set the system message
SYSTEM You are a helpful AI assistant named e-llmo Assistant
- load te model in ollama
ollama create llama3.1 -f llama31.loc
Streamlit is a popular Python library that makes it easy to create and deploy web applications for machine learning and data science projects. It's widely used for its simplicity in turning Python scripts into interactive web apps.
install
pip install llama-cpp-python streamlit
Detailed steps on how to integrate the chatbot model or framework (Ollama) into a Streamlit web application, including handling user inputs from the web interface and displaying responses.
Description-
ollama_chatbot.py
uses Streamlit to create a web-based chat interface with the Ollama language models. -
The
config.py
sets up the configuration for the app, including available models and page title, -
the
requirements.txt
specifies the Python dependencies needed. -
the
helpers/llm_helper.py
module defines the logic for interacting with the Ollama models and parsing the streamed responses. -
Chatbot Features:
- a. Ability to select different Ollama models to be used by the chatbot
- b. Streaming output when responding to users like ChatGPT
Step 1: Prepare the Dockerfile
Create a Dockerfile in the root directory of your application with the following content:
# Use an official Python runtime as a parent image
FROM python:3.8
# Set the working directory in the container to /app
WORKDIR /app
# Copy the requirements.txt file and install Python dependencies
COPY requirements.txt /app/
RUN pip install --no-cache-dir -r requirements.txt
# Copy the rest of the application files and folders into the container at /app
COPY . /app
# Expose the port Streamlit will run on
EXPOSE 8501
# If ollama runs on a different port, expose it as well
# EXPOSE <ollama_port>
EXPOSE 11434
# Start the Ollama model server and the Streamlit app
#CMD ["streamlit", "run", "ollama_chatbot.py"]
CMD ollama serve & streamlit run ollama_chatbot.py
Note: You might want to consider using a process manager like supervisord to handle multiple processes in a container more robustly.
Step 2: Build the Docker Image
- Run the following command in the same directory as your Dockerfile to build your Docker image:
docker build -t streamlit-ollama-chatbot .
Step 3: Run the Docker Container:
- Once the image is built, you can run your container:
docker run -p 8501:8501 -p 11434:11434 streamlit-ollama-chatbot
- This command starts the streamlit-ollama-chatbot container, maps the local port the container's
8501
, and runs your web app.
This guide walks you through the process of containerizing the Streamlit-Ollama-Chatbot, pushing the Docker image to Oracle Container Registry, and deploying it to a Kubernetes cluster.
Step 1: Create Docker Image with the Model preloaded
2 options :
- Either include the model in your Docker image.
- Or add the model download task to the
Dockerfile
(during the build process).
Step 2: Build the Docker Image
Execute the following command in your project directory to build your Docker image:
docker build -t your-username/streamlit-ollama-chatbot:latest .
Step 3: Log in to Oracle Container Registry
- Use the Docker CLI to log in to the Oracle Container Registry:Enter your Oracle Cloud credentials when prompted.
docker login container-registry.oracle.com
Step 4: Push the Docker Image
- After logging in, push your Docker image to the Oracle Container Registry: Ensure image name/tag match OCR repo's.
docker push your-username/streamlit-ollama-chatbot:latest
Step 5: Deploy to OCI Kubernetes engine
Create a Kubernetes Deployment
Write a Kubernetes manifestthat referencing the app image in the OCR.
apiVersion: apps/v1
kind: Deployment
metadata:
name: streamlit-ollama-chatbot-deployment
spec:
replicas: 1
selector:
matchLabels:
app: streamlit-ollama-chatbot
template:
metadata:
labels:
app: streamlit-ollama-chatbot
spec:
containers:
- name: streamlit-ollama-chatbot
image: container-registry.oracle.com/your-username/streamlit-ollama-chatbot:latest
ports:
- containerPort: 8501 # Streamlit's port
- containerPort: 11434 # Ollama's port, for internal communication only
Streamlit app can communicate with Ollama API over
localhost:11434
within the pod, but this port won't be accessible from outside the pod.
Create a Kubernetes Service
- Expose your application to the internet by creating a service of type LoadBalancer.
- If external access to the Ollama API is not required, you don't need to expose the Ollama port (11434) through the Kubernetes service.
apiVersion: v1
kind: Service
metadata:
name: streamlit-ollama-chatbot-service
spec:
selector:
app: streamlit-ollama-chatbot
ports:
- protocol: TCP
port: 80 # This is the port you'll access externally, like through a LoadBalancer
targetPort: 8501 # This is the internal port on which Streamlit is running
type: LoadBalancer
Apply Your Configuration Apply the configuration to your Kubernetes cluster:
kubectl apply -f deployment.yaml
kubectl apply -f service.yaml
Step 6: Access Your Application Retrieve the external IP address assigned to your service by Oracle's cloud infrastructure:
kubectl get service streamlit-ollama-chatbot-service
Access your Streamlit app in the browser via http://[external-ip].
Oracle Cloud Infrastructure Specifics
Ensure your oci-cli is configured correctly to manage your container registry and Kubernetes cluster. Set the proper permissions and policies within OCI to allow your Kubernetes cluster to pull images from your container registry. Secure sensitive data using Kubernetes secrets or a secure mechanism; never include it in your Dockerfile or image.