Hackathon Date: April 22, 2026 • 2:00 PM - 7:00 PM IST
What You’re Building
Self-host an open-source LLM that can replace our Gemini API dependency for CAI use cases — matching response accuracy while delivering low latency under real production load. Our current Gemini API spend is 2500 USD per month . Your self-hosted solution must stay within this cost envelope. The team can provision instances up to $6/hour.The Challenge
Getting a model to answer questions is easy. Getting it to answer accurately, fast, and at scale — on our use cases — is the real challenge. You’ll need to:- Pick and claim a model — Choose an open-source LLM and claim it on the general channel (first come, first served — see Model Claiming below)
- Set up inference — Deploy it with an optimized serving stack (vLLM, TGI, Ollama, or your own setup)
- Match Gemini’s accuracy — Your endpoint will be evaluated via automated evals against the same prompt set
- Survive under load — Handle 100 concurrent requests without dropping any
- Stay fast — Target average latency under 500ms across 10 runs of 100 parallel requests
Model Claiming
Use Case for Benchmarking
Hindi Collection
Use the Hindi Collection use case — richer transcript data is available for model training and fine-tuning if needed.
Judging Criteria
End-to-End Working + Single API call Latency < 1s — 20%
End-to-End Working + Single API call Latency < 1s — 20%
The self-hosted model is deployed, serves responses correctly via API, and a single API call completes in under 1 second. This is the minimum bar — it works, and it’s fast.
Accuracy of Responses (via Evals) — 30%
Accuracy of Responses (via Evals) — 30%
Response quality evaluated via automated evals run against your endpoint. Keep your endpoint ready with all required parameters — we will run the eval suite against it. Scored on correctness and completeness vs Gemini. Gold standard test cases can be found here and existing evaluation scoring can be found here.
Handles 100 Parallel Requests — 10%
Handles 100 Parallel Requests — 10%
System stays up and responds correctly under 100 concurrent requests. No crashes, no dropped requests, no errors.
Average Latency < 500ms at Load — 30%
Average Latency < 500ms at Load — 30%
Run 100 parallel requests and measure average latency. Repeat this 10 times and report the average latency across all 10 runs. You need to build a harness/dashboard that demonstrates this. Target: average latency under 500ms.
Bonus: Fine-Tuning with Existing Transcripts — 10%
Bonus: Fine-Tuning with Existing Transcripts — 10%
Use existing call transcripts (Hindi Collection) for model fine-tuning or bending. Engineers who go beyond prompt engineering and actually fine-tune on real production data to improve accuracy and domain fit will score here. Show your training approach, dataset used, and before/after comparison.
Getting it working is the entry ticket (20%). The biggest weight goes to accuracy (30%) and latency under load (30%) — because matching Gemini at scale is the whole point. The 10% bonus rewards engineers who invest in fine-tuning with real data, which is the path to production.
Endpoint Requirements
Your self-hosted model must be accessible via an HTTP API endpoint. Make sure:
- The endpoint accepts the same parameters needed for our eval suite
- It is stable and accessible at demo time — we will hit it live
- You have a load testing harness ready that shows 100 parallel requests running with latency measurements across 10 runs
Rules
- Individual work only — this is a solo competition
- Claude Code is allowed and encouraged — use it aggressively
- Any open-source model — but you must claim your model + token size first (first come, first served)
- Any inference framework — vLLM, TGI, Ollama, llama.cpp, or custom
- Budget constraint — instance cost must not exceed the equivalent of $2,500/month
- Final demo: Live demo showing single-call latency, load test harness results (10 runs × 100 requests), and endpoint ready for accuracy evals
🏆 Hackathon Prize
The winner receives a INR 2000 amazon coupon.