LLM Self-Hosting for Production-Grade Performance

Hackathon Date: April 22, 2026 • 2:00 PM - 7:00 PM IST

What You’re Building

Self-host an open-source LLM that can replace our Gemini API dependency for CAI use cases — matching response accuracy while delivering low latency under real production load. Our current Gemini API spend is 2500 USD per month . Your self-hosted solution must stay within this cost envelope. The team can provision instances up to $6/hour.

The Challenge

Getting a model to answer questions is easy. Getting it to answer accurately, fast, and at scale — on our use cases — is the real challenge. You’ll need to:

Pick and claim a model — Choose an open-source LLM and claim it on the general channel (first come, first served — see Model Claiming below)
Set up inference — Deploy it with an optimized serving stack (vLLM, TGI, Ollama, or your own setup)
Match Gemini’s accuracy — Your endpoint will be evaluated via automated evals against the same prompt set
Survive under load — Handle 100 concurrent requests without dropping any
Stay fast — Target average latency under 500ms across 10 runs of 100 parallel requests

Model Claiming

First claim, first served. You are free to experiment with any open-source LLM, but each model + token size combination can only be claimed by one person. Once someone claims a specific model with a specific token size, no one else can use that exact combination.Post your claim on the general channel immediately with the model name and token size (e.g., “Llama 3.1 — 8B”, “Mistral v0.3 — 7B”, “Qwen 2.5 — 14B”). If your preferred combination is already taken, pick a different model or a different token size.

Use Case for Benchmarking

Hindi Collection

Use the Hindi Collection use case — richer transcript data is available for model training and fine-tuning if needed.

Judging Criteria

End-to-End Working + Single API call Latency < 1s — 20%

The self-hosted model is deployed, serves responses correctly via API, and a single API call completes in under 1 second. This is the minimum bar — it works, and it’s fast.

Accuracy of Responses (via Evals) — 30%

Response quality evaluated via automated evals run against your endpoint. Keep your endpoint ready with all required parameters — we will run the eval suite against it. Scored on correctness and completeness vs Gemini. Gold standard test cases can be found here and existing evaluation scoring can be found here.

Handles 100 Parallel Requests — 10%

System stays up and responds correctly under 100 concurrent requests. No crashes, no dropped requests, no errors.

Average Latency < 500ms at Load — 30%

Run 100 parallel requests and measure average latency. Repeat this 10 times and report the average latency across all 10 runs. You need to build a harness/dashboard that demonstrates this. Target: average latency under 500ms.

Bonus: Fine-Tuning with Existing Transcripts — 10%

Use existing call transcripts (Hindi Collection) for model fine-tuning or bending. Engineers who go beyond prompt engineering and actually fine-tune on real production data to improve accuracy and domain fit will score here. Show your training approach, dataset used, and before/after comparison.

Getting it working is the entry ticket (20%). The biggest weight goes to accuracy (30%) and latency under load (30%) — because matching Gemini at scale is the whole point. The 10% bonus rewards engineers who invest in fine-tuning with real data, which is the path to production.

Endpoint Requirements

Your self-hosted model must be accessible via an HTTP API endpoint. Make sure:

The endpoint accepts the same parameters needed for our eval suite
It is stable and accessible at demo time — we will hit it live
You have a load testing harness ready that shows 100 parallel requests running with latency measurements across 10 runs

Rules

Individual work only — this is a solo competition
Claude Code is allowed and encouraged — use it aggressively
Any open-source model — but you must claim your model + token size first (first come, first served)
Any inference framework — vLLM, TGI, Ollama, llama.cpp, or custom
Budget constraint — instance cost must not exceed the equivalent of $2,500/month
Final demo: Live demo showing single-call latency, load test harness results (10 runs × 100 requests), and endpoint ready for accuracy evals

🏆 Hackathon Prize

The winner receives a INR 2000 amazon coupon.

​What You’re Building

​The Challenge

​Model Claiming

​Use Case for Benchmarking

Hindi Collection

​Judging Criteria

​Endpoint Requirements

​Rules

🏆 Hackathon Prize

What You’re Building

The Challenge

Model Claiming

Use Case for Benchmarking

Judging Criteria

Endpoint Requirements

Rules