Smart Voice AI Pipeline - Mint Starter Kit

Hackathon Date: March 31, 2026 • 2:00 PM - 7:00 PM IST Format: Individual

What You’re Building

Build an intelligent Pipecat voice AI pipeline that goes beyond basic conversation. Your pipeline will process a live voice call with our Shilpa agent (Kotak Securities demat account opening) and add four real-time capabilities on top:

Human Escalation : Detect when the caller needs a human agent and execute a live SIP transfer via FreeSWITCH
Gender Detection : Infer the caller’s gender from the conversation transcript in real-time and make it available as pipeline metadata
Language Detection : Detect the caller’s language (Hindi, English, code-mixed) from the transcript in real-time and make it available as pipeline metadata
Prompt Optimization : Reduce latency and cost through one or more of: prompt compression, faster tool calling, or RAG-based dynamic knowledge injection

Pipeline Requirements

Base Pipeline

Build a working Pipecat pipeline from scratch with:

STT (any provider)
LLM (Gemini or any provider)
TTS (any provider)
The Shilpa agent prompt loaded and working

Agent

The agent is Shilpa (Kotak Securities demat account opening).

The Four Capabilities

1. Human Escalation

Detect in real-time when the conversation should be handed off to a human agent, then execute the transfer via FreeSWITCH SIP. Detection triggers (at minimum):

Caller explicitly asks to speak to a human / manager / supervisor
Caller expresses extreme frustration or anger (repeated objections, raised voice cues in transcript)
Conversation is stuck in a loop (agent repeating itself, caller not progressing)
Agent is unable to answer a question outside its domain

When escalation is triggered, the agent should inform the caller, execute a SIP transfer via FreeSWITCH ESL, and log the escalation reason, turn number, and transcript up to that point.

2. Gender Detection

Infer the caller’s likely gender from the conversation transcript in real-time using NLP/LLM analysis. Use linguistic cues — name mentions, pronoun usage, Hindi gendered verb forms (e.g., “मैं करता हूँ” vs “मैं करती हूँ”). Produce a classification (male, female, unknown) with a confidence score that updates as the conversation progresses.

3. Language Detection

Detect the caller’s language in real-time from the transcript. Classify each caller turn as hindi, english, code-mixed, or other. Track the dominant language across the conversation. Handle edge cases like single-word responses, and numbers-only responses.

4. Prompt Optimization

Reduce LLM latency and/or cost through intelligent prompt engineering at the pipeline level. Implement one or more of: prompt compression (reduce token count while preserving instruction fidelity), fast tool calling (parallel execution, caching, speculative selection), or RAG-based knowledge injection (index agent knowledge into a vector store, retrieve relevant chunks per turn instead of stuffing the full prompt).

Judging Criteria

Criteria	Weight	What We’re Looking For
Gender & Language Detection	30%	Do detectors produce correct results in real-time? Handle edge cases — code-mixed utterances, Romanized Hindi (“haan bolo”), ambiguous gender, single-word responses? Does confidence improve with more turns? Clean processor design with proper error handling and logging?
Human Escalation	30%	Does the detector trigger correctly. no false positives on mild complaints, no misses on explicit requests like “get me a manager”? Does the SIP transfer actually execute via FreeSWITCH? Does the agent deliver a smooth handoff message before transfer? Is the escalation logged with reason, transcript, and metadata (gender, language)?
Prompt Optimization	20%	Is there a measurable improvement in latency, token count, or cost? Is response quality preserved after compression/RAG? Is the approach technically sound not just truncating the prompt but intelligently reducing it? Before/after metrics shown?
Ambition & Polish	20%	How far beyond the core requirements did you go? Dynamic STT/TTS language switching? Escalation context handoff to human agent screen? Real-time dashboard? Prompt compression + RAG combined? Does the overall system feel production-ready?

Rules

Individual work only this is a solo competition
Claude Code is allowed and encouraged use it aggressively
Any programming language Python required (Pipecat is Python)
Any STT/TTS/LLM provider you provision your own keys (Gemini is available)
No pre-written code start from scratch at the hackathon start
Final demo: 5-minute live demo run a conversation showing all four capabilities in action

Documentation Index

​What You’re Building

​Pipeline Requirements

​Base Pipeline

​Agent

​The Four Capabilities

​1. Human Escalation

​2. Gender Detection

​3. Language Detection

​4. Prompt Optimization

​Judging Criteria

​Rules

What You’re Building

Pipeline Requirements

Base Pipeline

Agent

The Four Capabilities

1. Human Escalation

2. Gender Detection

3. Language Detection

4. Prompt Optimization

Judging Criteria

Rules