🇫🇷 Français

Ollama and Open WebUI: Local LLM Deployment with Docker

Self-hosted architecture for generative AI: configuration, optimization, and use cases

By Angelo Lima

Local inference architecture: challenges and technical solutions

Deploying language models in production environments raises critical challenges: recurring API costs, cloud service dependencies, and data confidentiality issues. These constraints require robust and autonomous architectural alternatives.

Ollama, an open-source tool for running local language models, combined with Open WebUI, an extensible web interface for generative AI¹, constitutes a mature technical solution for high-performance self-hosted AI inference.


Ollama technical architecture: optimization and performance

Optimized inference engine

Ollama integrates specific optimizations for modern hardware architectures² to maximize inference performance:

  • Dynamic quantization: Memory footprint reduction without significant quality loss
  • GPU parallelization: Optimal exploitation of CUDA and ROCm architectures
  • Adaptive memory management: Dynamic allocation according to system constraints
  • Smart caching: Activation caching to reduce latency

Supported model ecosystem

The Ollama architecture supports a wide range of reference models:

  • Meta Llama 3: High-performance conversational models (8B, 70B parameters)
  • Mistral 7B: Architecture optimized for computational efficiency
  • CodeLlama: Specialized models for code generation and analysis
  • Gemma: Google models optimized for local deployments

Open WebUI: User interface and advanced features

Modular architecture

Open WebUI constitutes an extensible self-hosted AI platform, designed to work entirely offline³. Its capabilities include:

  • Multi-backend support: Ollama integration, OpenAI-compatible APIs
  • Integrated RAG engine: Retrieval-Augmented Generation for document querying
  • Multi-user management: Authentication and session isolation
  • RESTful APIs: Programmatic integration into existing workflows

Enterprise features

The platform integrates advanced capabilities for professional deployments:

  • Conversation persistence: Complete history with semantic search
  • Custom models: Fine-tuning and deployment of specific models
  • Performance metrics: Response time monitoring and resource usage
  • Export/Import: Configuration backup and migration

Containerized deployment with Docker Compose

Reference configuration

Docker Compose deployment optimizes dependency management and scalability⁴:

services:
  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 2
              capabilities: [gpu]
  
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    ports:
      - "3001:8080"
    volumes:
      - open-webui:/app/backend/data
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    depends_on:
      - ollama

Performance optimizations

GPU configuration requires specific adjustments to maximize performance⁵:

  • GPU allocation: Explicit CUDA resource reservation
  • Persistent volumes: Optimized storage for models (up to 7GB per model)
  • Network configuration: Service isolation with inter-container communication
  • Environment variables: Timeout and memory limit configuration

Model management and optimizations

Download and caching

Models require initial download via the interface or command line⁶:

# Model download via CLI
ollama pull llama3:8b
ollama pull mistral:7b  
ollama pull codellama:13b

# Check installed models
ollama list

Deployment strategies

Multi-model configuration: The architecture allows simultaneous execution of multiple models according to hardware constraints⁷.

Memory optimization: Automatic model selection based on available RAM (8GB minimum recommended, 16GB optimal).


Use cases and enterprise integration

Practical applications

Development assistant: Code generation and analysis with CodeLlama, integration into CI/CD workflows for automatic documentation.

Document analysis: RAG (Retrieval-Augmented Generation) for querying technical document bases⁸, enabling precise contextual information extraction.

Conversational prototyping: Test interface for chatbot applications before production deployment.

OpenAI integration

API compatibility enables cloud/local hybridization⁹:

  • Local models for sensitive or repetitive tasks
  • Cloud APIs for use cases requiring the latest models
  • Automatic switching based on load and costs

Performance metrics and sizing

System requirements

Ollama + Open WebUI deployments require appropriate sizing¹⁰:

  • Minimum RAM: 16GB (32GB recommended for 13B+ models)
  • Recommended GPU: NVIDIA GTX 1060 4GB minimum, RTX 4090 optimal
  • Storage: 50-100GB depending on number of deployed models
  • Docker images: Ollama (4.76GB), Open WebUI (3.77GB)

Performance optimizations

  • Startup time: 30-60 seconds depending on model size
  • Inference latency: 50-200ms per token depending on GPU architecture
  • Concurrent throughput: Up to 10 simultaneous users on RTX 4090 hardware

Security and governance

Data isolation

The self-hosted architecture guarantees:

  • Absolute confidentiality: No data transmission to third-party services
  • Access control: Local authentication and permission management
  • Audit trails: Complete traceability of interactions and queries
  • Backup: Full export capability for business continuity

Conclusion: Autonomy and performance for generative AI

The Ollama + Open WebUI combination establishes a new standard for self-hosted AI deployments. This architecture solves critical challenges of cost, confidentiality, and cloud dependency while maintaining performance comparable to SaaS solutions.

The containerized approach facilitates deployment and maintenance, while platform extensibility ensures scalability according to business needs. This solution positions organizations for autonomous and controlled AI adoption.


Sources

  1. Open WebUI: User-friendly AI Interface - GitHub
  2. Run LLMs locally or in Docker with Ollama & Ollama-WebUI - Medium
  3. How to setup Open WebUI with Ollama and Docker Desktop - DEV Community
  4. How to use Ollama with Open WebUI with Docker and Docker Compose - Geshan Blog
  5. How to locally deploy ollama and Open-WebUI with Docker Compose - Medium
  6. Running LLM locally with Ollama and Open WebUI - My Playground
  7. Setting Up Ollama with Open-WebUI: A Docker Compose Guide - Archy.net
  8. Running LLMs Locally Using Ollama and Open WebUI on Linux - Linux TLDR
  9. Running Ollama and Open WebUI with Docker Compose - A-Chacon
  10. A local AI setup using Ollama and Open WebUI - Emcken.dk
Tags: AI Development
Share: