Voice AI Web Plugin - Mint Starter Kit

Hackathon Date: March 31, 2026 • 2:00 PM - 7:00 PM IST Format: Individual

What You’re Building

Build a drop-in voice AI plugin that any client can embed on their website. When a customer clicks it, they can have a live voice conversation with the client’s AI agent — right from the browser. After the conversation ends, your system extracts key entities from the transcript and stores them in a CRM. Think of it as: Intercom chat widget, but for voice AI, with automatic CRM sync.

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                     CLIENT'S WEBPAGE                            │
│                                                                 │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │                  Your Plugin / Widget                      │  │
│  │                                                            │  │
│  │  ┌────────────┐    ┌──────────────────────────────────┐  │  │
│  │  │ Browser     │───▶│ Pipecat WebSocket                │  │  │
│  │  │ Microphone  │    │ (Deployed CAI Agent)              │  │  │
│  │  │ (User's     │    │ Audio In ← PCM 16-bit 16kHz      │  │  │
│  │  │  voice)     │    │ Audio Out → PCM 16-bit 16kHz      │  │  │
│  │  │             │◀───│                                    │  │  │
│  │  │ Speaker     │    └──────────────────────────────────┘  │  │
│  │  └────────────┘                                            │  │
│  │         │                                                  │  │
│  │         ▼  (on conversation end)                           │  │
│  │  ┌────────────┐    ┌─────────────┐    ┌───────────────┐  │  │
│  │  │ Transcript  │───▶│ Entity      │───▶│ CRM           │  │  │
│  │  │ (STT from   │    │ Extraction  │    │ (Store        │  │  │
│  │  │  agent +    │    │ (LLM)       │    │  entities)    │  │  │
│  │  │  user)      │    └─────────────┘    └───────────────┘  │  │
│  │  └────────────┘                                            │  │
│  └──────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

Key Components

Web Plugin — A JavaScript widget (floating button, sidebar, modal — your design) that captures the user’s microphone audio and plays the agent’s audio response through the browser speaker.
Voice Transport — Connect to the Pipecat WebSocket, send user’s microphone audio as PCM 16-bit 16kHz mono, receive and play the agent’s audio response in the same format.
Transcription — Transcribe both sides of the conversation (user’s speech and agent’s speech) to build a full transcript. You can use any STT provider or the browser’s built-in Web Speech API.
Entity Extraction — After the conversation ends, use an LLM (Gemini) to extract structured entities from the transcript based on the agent’s domain.
CRM Storage — Store the extracted entities. This can be a real CRM API (HubSpot, Salesforce, etc.) or a mock CRM (local database, JSON file, Airtable, Google Sheets — whatever you prefer).

Target Agent

The agent is exposed via Pipecat WebSocket

Entity Extraction Reference

Based on the agent’s domain, here are the kinds of entities your system should extract from the conversation transcript.

# Example entity schema (actual schema depends on agent domain)
entities:
  - name: customer_name
    type: string
    description: Full name of the customer

  - name: customer_age
    type: integer
    description: Age of the customer

  - name: interested_product
    type: string
    description: Product or plan the customer showed interest in

  - name: callback_requested
    type: boolean
    description: Whether the customer asked to be called back

  - name: callback_datetime
    type: datetime
    description: Preferred callback date and time (if requested)

  - name: objections_raised
    type: list[string]
    description: Any concerns or objections the customer mentioned

  - name: conversation_outcome
    type: enum [interested, not_interested, callback, do_not_call]
    description: Final disposition of the call

Your entity extraction should work with any entity schema — don’t hardcode it. The schema should be configurable so the plugin works for different clients and agent domains.

Milestones

Goal: A working browser-based voice widget that talks to the agent. Deliverables:

A web page with an embeddable voice widget (button, sidebar, modal — your choice)
Widget captures microphone audio from the browser
Audio sent to Pipecat WebSocket
Agent’s audio response played back through browser speakers
User can have a real multi-turn voice conversation with the agent from the browser
Visual feedback — user can see when the agent is speaking, when it’s their turn, connection status

Validation: Live demo of a 3+ turn voice conversation happening entirely in the browser.

Milestone 2 — Transcript + Entity Extraction

Goal: After a conversation ends, extract structured entities and store them. Deliverables:

Full conversation transcript generated (both user and agent speech)
Transcript displayed in the widget or a side panel after the call ends
LLM-based entity extraction that takes the transcript + entity schema → structured JSON output
Entity extraction works with a configurable schema (not hardcoded to one agent)
Extracted entities displayed to the user after the call

Validation: Show a completed conversation → transcript → extracted entities JSON, with correct values matching what was discussed.

Milestone 3 — CRM Integration & Polish

Goal: Store extracted entities in a CRM and polish the experience. Pick one or more:

CRM Storage: Push extracted entities to a CRM (real or mock): HubSpot, Salesforce, Airtable, Google Sheets, Notionvia API
CRM Dashboard: Simple page showing all past conversations with their extracted entities — searchable, filterable
Embeddable Script: Package the plugin as a single <script> tag that any website can drop in (like Google Analytics or Intercom)
Conversation Summary: In addition to entities, generate a human-readable summary of the call
Multi-language Support: Handle Hindi/English conversations — entity extraction works correctly regardless of language
Plugin Configurability: A config object where the client specifies agent WebSocket URL, entity schema, CRM endpoint, widget theme/colors

Judging Criteria

Criteria	Weight	What We’re Looking For
Voice Experience	30%	Does the browser voice widget work smoothly? Low latency? Clear audio? Good visual feedback? Does it feel like a real phone call in the browser?
Entity Extraction Quality	30%	Are entities extracted accurately? Does it handle messy transcripts, partial information, and multilingual conversations? Is the schema configurable?
Integration & Polish	20%	CRM storage working? Is the plugin embeddable? Is there a dashboard or summary view?
Engineering Quality	20%	Clean code, good abstractions, error handling. Could this be shipped to a client with minimal changes?

Smart Voice AI Pipeline Evaluation

Documentation Index

​What You’re Building

​Architecture

​Key Components

​Target Agent

​Entity Extraction Reference

​Milestones

​Milestone 1 — Voice Widget

​Milestone 2 — Transcript + Entity Extraction

​Milestone 3 — CRM Integration & Polish

​Judging Criteria

What You’re Building

Architecture

Key Components

Target Agent

Entity Extraction Reference

Milestones

Milestone 1 — Voice Widget

Milestone 2 — Transcript + Entity Extraction

Milestone 3 — CRM Integration & Polish

Judging Criteria