Case Study

Self-Hosted LLMs: Full AI Capabilities Without Cloud API Costs

Most businesses exploring AI assume it means sending their data to OpenAI or paying per-query API fees indefinitely. It doesn't. We deploy production-grade language and vision models on your own infrastructure so the capability is yours, the data stays private, and the cost is a fixed server bill instead of an ever-growing API invoice.

The Problem With Cloud AI APIs

Cloud AI APIs work well for prototyping. For production use at any real volume, the economics stop making sense quickly. Every query costs money. Every document sent to an external model leaves your infrastructure. Every API key is a dependency on a vendor's pricing, availability, and terms.

Per-token costs scale directly with usage, so the more value you get, the more you pay
Sensitive business data, customer records, and internal documents pass through third-party infrastructure
Rate limits and outages in a vendor's API become your production incidents
Model versions change or deprecate on the vendor's schedule, not yours
No ability to fine-tune or customize behavior without expensive enterprise contracts

Self-hosted models solve all of these. The tradeoff used to be quality. That gap has closed significantly. Modern open-weight models run well on modest hardware and handle the majority of business use cases without touching a cloud API.

chat.steelcitysolutions.io

We run our own AI chat platform at chat.steelcitysolutions.io, built on Open WebUI backed by locally hosted models. It serves as both our internal tooling and a live reference deployment we use when scoping similar work for clients.

Open WebUI Frontend

A full-featured chat interface with conversation history, model switching, document uploads, and system prompt configuration. Comparable to ChatGPT in capability, running entirely on our own hardware.

Ollama Model Backend

Models are served via Ollama, which handles downloading, versioning, and serving open-weight models locally. Switching between Llama, Mistral, Qwen, and others is a single command. No API keys, no usage limits.

Cloud Model Fallback

For tasks where a frontier model is genuinely needed, the same interface can route to OpenAI, Anthropic, or other cloud providers. Local models handle the volume; cloud models handle edge cases. Cost is controlled by design.

Access Control and Audit Logs

Users authenticate before accessing the platform. All conversations are logged internally. No data is shared with any third party unless a cloud model fallback is explicitly selected.

Real Example: Receipt Scanning in the Fairway H&C App

The Fairway H&C dispatch platform includes receipt tracking tied directly to jobs. Technicians upload a photo of a receipt from the field. A vision-language model reads the image and extracts the relevant data automatically.

1
Technician snaps a photo of the receipt
No special hardware, no scanner. A phone camera is all that's needed. The image is uploaded directly through the job detail page in the web app.
2
Vision-language model reads the image
The uploaded image is passed to a locally hosted vision-language model. The model reads the receipt and extracts vendor name, date, line items, and total. No cloud API call, no external service, no per-image fee.
3
Structured data is written to the job record
The extracted fields are formatted and saved directly to the job. The receipt is stored as an attached image alongside the parsed data. The job record now has a complete, searchable expense entry with zero manual data entry.
4
Available for review, audit, and reporting
The receipt image and its parsed data stay attached to the job permanently. Admins can review, export, or audit any receipt at any time. The original image is always available to verify the extracted data against.

The result: a technician spends three seconds taking a photo. The system handles everything else. No manual entry, no lost receipts, no end-of-month scramble to reconcile expenses.

What This Can Be Applied To

Receipt scanning is one example of a broader pattern: take something a human is currently doing by reading a document and typing into a system, and replace the reading and typing with a vision or language model. The human takes the photo or uploads the file. The model does the rest.

Invoices parsed and entered into accounting systems automatically
Inspection reports photographed in the field and filed with extracted findings
Handwritten forms digitized without manual transcription
Email threads summarized and routed to the right team automatically
Internal knowledge bases made searchable through document-grounded chat (RAG)
Customer support drafted automatically from ticket history and product docs
Meeting transcripts summarized into action items without a human reviewing the recording

How a Deployment Is Built

A self-hosted AI deployment for a business typically involves three components: the model server, the interface or integration layer, and the application logic that connects it to the workflow.

Model Server

Ollama or a similar runtime manages model files and serves a local API. Runs on a dedicated machine, a VPS with enough RAM, or on-premises hardware. A machine with 16GB of RAM handles most text models well. A GPU accelerates response time significantly for heavier workloads.

Interface Layer

Open WebUI for general-purpose chat and document interaction, or a custom API integration built directly into your existing application. Both options are production-ready. The choice depends on whether users need a standalone chat tool or an embedded experience.

Application Integration

Calling a local model from your application is the same as calling any HTTP API. The model server exposes an OpenAI-compatible endpoint, so most existing tooling and libraries work without modification.

Model Selection

The right model depends on the task. Text generation, summarization, and classification use different models than vision tasks like receipt or document scanning. We select and test models against your specific use case before committing to a deployment.

Cost Comparison

The economics shift significantly at moderate usage. A self-hosted deployment has a fixed infrastructure cost. A cloud API has a variable cost that grows with every query.

A small VPS with 32GB RAM runs around $40 to $80 per month and handles hundreds of queries per day without issue
Processing the same volume through a cloud API typically costs several hundred dollars per month at modest scale
At high volume, the difference becomes an order of magnitude
For vision tasks like receipt scanning, cloud vision APIs charge per image. A self-hosted VLM has no per-image cost
The fixed cost stays the same whether you run 100 queries a day or 10,000

Cloud APIs still make sense for low-volume or irregular workloads where standing up dedicated infrastructure isn't justified. The right answer depends on your actual usage pattern. We scope both options honestly before recommending one.

Want AI in your stack without the cloud API bill?

Whether you want a private chat platform for your team, a vision model integrated into an existing app, or a document processing pipeline that runs on your own hardware, we can build it.

Tell us what you're trying to automate and we'll tell you whether a local model can handle it and what the infrastructure would cost.

Get in Touch