Case Study

Self-Hosted LLMs: Full AI Capabilities Without Cloud API Costs

Most businesses exploring AI assume it means sending their data to OpenAI or paying per-query API fees indefinitely. It doesn't. We deploy production-grade language and vision models on your own infrastructure so the capability is yours, the data stays private, and the cost is a fixed server bill instead of an ever-growing API invoice.

Cloud AI APIs work well for prototyping. For production use at any real volume, the economics stop making sense quickly. Every query costs money. Every document sent to an external model leaves your infrastructure. Every API key is a dependency on a vendor's pricing, availability, and terms.

Self-hosted models solve all of these. The tradeoff used to be quality. That gap has closed significantly. Modern open-weight models run well on modest hardware and handle the majority of business use cases without touching a cloud API.

We run our own AI chat platform at chat.steelcitysolutions.io, built on Open WebUI backed by locally hosted models. It serves as both our internal tooling and a live reference deployment we use when scoping similar work for clients.

Open WebUI Frontend

A full-featured chat interface with conversation history, model switching, document uploads, and system prompt configuration. Comparable to ChatGPT in capability, running entirely on our own hardware.

Ollama Model Backend

Models are served via Ollama, which handles downloading, versioning, and serving open-weight models locally. Switching between Llama, Mistral, Qwen, and others is a single command. No API keys, no usage limits.

Cloud Model Fallback

For tasks where a frontier model is genuinely needed, the same interface can route to OpenAI, Anthropic, or other cloud providers. Local models handle the volume; cloud models handle edge cases. Cost is controlled by design.

Access Control and Audit Logs

Users authenticate before accessing the platform. All conversations are logged internally. No data is shared with any third party unless a cloud model fallback is explicitly selected.

The Fairway H&C dispatch platform includes receipt tracking tied directly to jobs. Technicians upload a photo of a receipt from the field. A vision-language model reads the image and extracts the relevant data automatically.

  1. 1
    Technician snaps a photo of the receipt

    No special hardware, no scanner. A phone camera is all that's needed. The image is uploaded directly through the job detail page in the web app.

  2. 2
    Vision-language model reads the image

    The uploaded image is passed to a locally hosted vision-language model. The model reads the receipt and extracts vendor name, date, line items, and total. No cloud API call, no external service, no per-image fee.

  3. 3
    Structured data is written to the job record

    The extracted fields are formatted and saved directly to the job. The receipt is stored as an attached image alongside the parsed data. The job record now has a complete, searchable expense entry with zero manual data entry.

  4. 4
    Available for review, audit, and reporting

    The receipt image and its parsed data stay attached to the job permanently. Admins can review, export, or audit any receipt at any time. The original image is always available to verify the extracted data against.

The result: a technician spends three seconds taking a photo. The system handles everything else. No manual entry, no lost receipts, no end-of-month scramble to reconcile expenses.

Receipt scanning is one example of a broader pattern: take something a human is currently doing by reading a document and typing into a system, and replace the reading and typing with a vision or language model. The human takes the photo or uploads the file. The model does the rest.

A self-hosted AI deployment for a business typically involves three components: the model server, the interface or integration layer, and the application logic that connects it to the workflow.

Model Server

Ollama or a similar runtime manages model files and serves a local API. Runs on a dedicated machine, a VPS with enough RAM, or on-premises hardware. A machine with 16GB of RAM handles most text models well. A GPU accelerates response time significantly for heavier workloads.

Interface Layer

Open WebUI for general-purpose chat and document interaction, or a custom API integration built directly into your existing application. Both options are production-ready. The choice depends on whether users need a standalone chat tool or an embedded experience.

Application Integration

Calling a local model from your application is the same as calling any HTTP API. The model server exposes an OpenAI-compatible endpoint, so most existing tooling and libraries work without modification.

Model Selection

The right model depends on the task. Text generation, summarization, and classification use different models than vision tasks like receipt or document scanning. We select and test models against your specific use case before committing to a deployment.

The economics shift significantly at moderate usage. A self-hosted deployment has a fixed infrastructure cost. A cloud API has a variable cost that grows with every query.

Cloud APIs still make sense for low-volume or irregular workloads where standing up dedicated infrastructure isn't justified. The right answer depends on your actual usage pattern. We scope both options honestly before recommending one.

Whether you want a private chat platform for your team, a vision model integrated into an existing app, or a document processing pipeline that runs on your own hardware, we can build it.

Tell us what you're trying to automate and we'll tell you whether a local model can handle it and what the infrastructure would cost.

Get in Touch