AI Voice Agent

Building an AI Phone Receptionist: A Technical Deep Dive Into WorkPhoneFox

March 5, 2026•AILabTX

Contractors lose thousands of dollars in potential revenue every year from missed calls. When you're knee-deep in a plumbing repair or up on a roof, you can't answer the phone. By the time you call back, the customer has already moved on.

WorkPhoneFox solves this with an AI-powered phone receptionist. When a contractor misses a call, it automatically forwards to a Twilio number where an AI agent answers, collects the caller's information, and notifies the business owner — all in real time.

Here's how we built it.

The Stack

Framework: Next.js (App Router) with a custom WebSocket server
Auth & Database: Firebase Auth + Firestore (multi-tenant)
Voice AI: OpenAI Realtime API
Telephony: Twilio (number provisioning, media streams, SMS)
Email: Resend (call summaries)
UI: shadcn/ui + Tailwind CSS
Deployment: Railway (persistent WebSocket support)

Architecture: Why a Custom Server Wraps Next.js

The most significant architectural decision was wrapping Next.js with a custom HTTP + WebSocket server. Twilio's media stream sends real-time audio over a persistent WebSocket connection. Next.js doesn't expose raw socket upgrade events — you can't handle WebSocket connections natively. The solution: a custom Node.js HTTP server that intercepts WebSocket upgrades on a dedicated endpoint and delegates everything else to Next.js.

Each active call is tracked as a session in memory, associating incoming Twilio streams with the AI processing pipeline.

The Call Flow: End to End

Here's what happens when someone calls a WorkPhoneFox user:

Caller dials the contractor's personal number — no answer

Call forwards to a Twilio number via carrier-specific forwarding codes

Twilio hits our webhook which returns TwiML connecting a media stream

Twilio upgrades to WebSocket — audio frames start flowing

OpenAI Realtime API processes the audio — greeting, conversation, lead capture

On call end — email summary sent, lead saved, usage tracked

The entire voice interaction happens over WebSocket with bidirectional audio streaming. No polling, no REST calls during the conversation.

Zero Transcoding: The Key Latency Optimization

Real-time voice is latency-sensitive. Every codec conversion adds delay. The critical optimization: Twilio and OpenAI Realtime both support g711_ulaw natively.

Audio flows directly from Twilio to OpenAI and back without any transcoding step. Previous architectures would convert ulaw to PCM to another format and back, adding 100–200ms of latency. With zero transcoding, end-to-end latency stays under 500ms.

Multi-Tenancy With Firestore

Every tenant gets an isolated data hierarchy — business info, AI agent configurations, provisioned phone numbers, call records with transcripts, team members with roles, and usage tracking. All queries are scoped through a tenant reference helper.

When a call comes in, the system looks up which tenant owns the called number using a Firestore collection group query — finding the right tenant without exposing the tenant structure.

AI Agent: Prompt Generation and Tool Use

Each agent's system prompt is dynamically generated from the business context — name, industry, services, hours, emergency handling preferences. The AI knows it's answering for "Acme Plumbing" and that the owner "John" is busy, without any manual prompt engineering from the user.

The agent has a single critical tool: lead info collection. When the AI has gathered enough information (caller name, phone number, service needed, urgency level), it calls this tool. The system saves the lead, and for urgent calls, immediately sends an SMS to the contractor's emergency number.

Onboarding: From Signup to Live in 3 Steps

Step 1: Signup — Firebase Auth creates the user. The API creates a Firestore tenant with free plan defaults.

Step 2: Business Info — User enters their business name, industry, services, and phone number. On submit, the backend auto-provisions a Twilio number matched to their area code, auto-generates an AI agent with a context-aware system prompt, and returns carrier-specific forwarding codes.

Step 3: Forwarding Setup — User dials a code on their phone to forward missed calls. When the first forwarded call arrives, the system automatically verifies the setup and marks the tenant as live.

Two Voice Pipelines

The system supports two voice processing pipelines:

OpenAI Realtime (Primary): Direct WebSocket connection to OpenAI. Sub-200ms latency, native tool calling, zero transcoding. Cost: ~$0.31/min.

Claude Chain (Secondary): A modular pipeline — Deepgram for speech-to-text, Claude for reasoning, ElevenLabs for text-to-speech. Lower cost (~$0.10–0.20/min) but higher latency. Uses TTS connection pooling (pre-warming 3 WebSocket connections) to cut ~300ms of handshake overhead per turn.

A pipeline router selects the appropriate engine based on the agent's configuration. Both implement the same session interface, making them interchangeable.

Lessons Learned

Twilio SDK quirks are real. The areaCode parameter must be a number, not a string. Local and toll-free number searches have separate type signatures. Some response fields aren't in the TypeScript types and require casting.

Lazy initialization patterns are essential for serverless-adjacent architectures. Any SDK that eagerly initializes (Firebase, database clients) will break Next.js builds. A proxy pattern that defers initialization until runtime is clean and transparent.

Codec alignment matters more than model speed. Shaving 150ms off transcoding latency had a bigger perceptual impact than any model optimization. Users notice when a voice agent hesitates.

Auto-provisioning reduces onboarding friction dramatically. Every manual step is a drop-off point. Automatically provisioning the phone number and generating the agent prompt cut onboarding from ~10 minutes to under 2.

What's Next

Analytics dashboard with call volume trends and lead conversion
Calendar and CRM integrations via agent tools
Outbound calling campaigns
Team collaboration features

The core insight: contractors don't need another app to manage. They need something that works the moment they set it up and sends them a text when it matters. Everything in the architecture — from zero-transcoding audio to auto-provisioned numbers — serves that goal.

←Back to all posts