Share via

Are there any quick/ready deploy self hosted containers + images ?

matt spring 0 Reputation points
2026-01-03T09:18:42.9833333+00:00

To create a self hosted LLM is my goal. It's important that images are contained and virtual machines are used for separation. I am open to various model and container types, that are in a near complete, ready to deploy config for my windows 32gb. Advice on locating a setup that will deploy with more ease than a complete build would be appreciated. Thanks !

Azure Container Instances
0 comments No comments

2 answers

Sort by: Most helpful
  1. Jilakara Hemalatha 11,600 Reputation points Microsoft External Staff Moderator
    2026-01-05T02:50:53.85+00:00

    Hi Matt,

    Thanks for your question. Based on your requirements, you are looking for self-hosted LLMs that are ready to deploy, run in containers with VM isolation, and work on a Windows system with ~32 GB RAM — ideally without building everything from scratch.

    Here are some community-proven options that fit your scenario:

    1.Ollama provides an official Docker image that lets you run local models like Mistral, Llama 3, Gemma, etc., with almost no configuration. You can deploy it in minutes inside a Linux VM (Hyper‑V / VMware) for full isolation.

    Reference: https://docs.ollama.com/docker

    2.If you want a web UI, Open WebUI has a turnkey Docker image and pairs perfectly with Ollama.

    https://docs.openwebui.com/getting-started/quick-start/

    3.vLLM (High‑performance inference server, OpenAI‑compatible)

    If you want an API server similar to OpenAI’s but running locally, vLLM's official Docker image provides optimized inference with strong performance.

    https://docs.vllm.ai/en/latest/deployment/docker/

    1. If you want a more feature‑rich environment with support for many loaders (Transformers, GGUF, ExLlama, etc.) and GPU options, this project provides pre-built Docker images for NVIDIA, AMD, Intel, and CPU‑only systems.

    https://github.com/Atinoda/text-generation-webui-docker

    https://www.virtualizationhowto.com/2025/05/self-hosting-llms-with-docker-and-proxmox-how-to-run-your-own-gpt/

    Reference Links:

    Tutorial: Build and deploy container images in the cloud with Azure Container Registry Tasks

    Deploy and run containers on Azure Container Instance
    Hope this helps! Please let me know if you have any queries.


  2. Marcin Policht 85,255 Reputation points MVP Volunteer Moderator
    2026-01-03T12:17:20.5833333+00:00

    Yep - there are near-turnkey options that fit your constraints: self-hosted LLMs, containerized images, and VM-level isolation, with minimal build effort on a Windows host with 32 GB RAM. Probably the easiest path is to treat the LLM stack as an appliance rather than a framework.

    For the fastest deployment with strong isolation, run a Linux VM (Hyper-V, VMware Workstation, or VirtualBox) and deploy prebuilt Docker images inside it. This gives you VM separation from Windows while still benefiting from container images that are already wired together. WSL2 works but does not give you the same hard isolation boundary as a VM, so if isolation matters, prefer a full VM.

    Ollama is a low-friction way to self-host an LLM. It ships as a single binary and also has an official Docker image. Inside a Linux VM you can deploy it in minutes, pull a model, and serve an API immediately. Models are already quantized and tuned for local inference, which matters on a 32 GB system.

    Example inside a Linux VM with Docker installed:

    docker run -d \
      --name ollama \
      -p 11434:11434 \
      -v ollama:/root/.ollama \
      ollama/ollama
    

    Then from inside the VM:

    docker exec -it ollama ollama pull mistral
    docker exec -it ollama ollama run mistral
    

    This gives you a local LLM with an HTTP API at port 11434, minimal configuration, and clean separation via VM + container. Swapping models is just a pull command.

    If you want a more “LLM server appliance” feel with a web UI, Open WebUI pairs well with Ollama. It can run as a second container and talk to Ollama over Docker networking.

    docker run -d \
      --name open-webui \
      -p 3000:8080 \
      -e OLLAMA_BASE_URL=http://ollama:11434 \
      --network container:ollama \
      ghcr.io/open-webui/open-webui:main
    

    That yields a ready-to-use ChatGPT-like interface without writing any glue code.

    If you want something closer to “enterprise inference server” with explicit model loading and performance tuning, you can try vLLM. It has official Docker images and supports OpenAI-compatible APIs. This requires more RAM awareness and model selection but still avoids a full build.

    Example:

    docker run -d \
      --name vllm \
      -p 8000:8000 \
      vllm/vllm-openai \
      --model mistralai/Mistral-7B-Instruct-v0.2 \
      --dtype float16
    

    On a 32 GB machine you will generally want 7B models or smaller, or quantized variants. vLLM is best if you care about throughput and API compatibility, less so if you want zero-thinking deployment.

    If GPU is not involved and you want maximum simplicity, llama.cpp-based containers are fairly stable and predictable. There are community images that already expose a REST API and include GGUF models. These tend to be slower but are deterministic and easy to isolate.

    Example:

    docker run -d \
      -p 8080:8080 \
      ghcr.io/ggerganov/llama.cpp:server \
      -m /models/mistral-7b-instruct.Q4_K_M.gguf
    

    You mount a model directory and the server is live. No Python, no CUDA, no extra services.

    For finding “ready to deploy” setups, check GitHub Container Registry and Docker Hub . Look for repositories that include a docker-compose.yml and reference “inference server” or “OpenAI compatible API”. If a project requires conda, source builds, or multi-stage scripts, it is not what you want.


    If the above response helps answer your question, remember to "Accept Answer" so that others in the community facing similar issues can easily find the solution. Your contribution is highly appreciated.

    hth

    Marcin

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.