← Back to Blog
Deep Dives·Jan 17, 2025·8 min read

The Real Cost of Running AI in Production

Token pricing looks cheap until you're processing 100k requests/day. Here's what AI actually costs at scale, and how to not go broke.

AIProductionCostsArchitecture
JV

Jose Viscasillas

January 17, 2025 · 8 min read

The Real Cost of Running AI in Production

"GPT-4 is only $0.03 per 1K tokens!"

Cool. Now multiply that by 100,000 daily users, average 2,000 tokens per request, and your monthly bill is $180,000.

AI pricing looks cheap in isolation. At scale, it can bankrupt a startup faster than AWS ever could. I've seen teams blow through $50k in their first month because nobody did the math.

Let me show you the real costs, the gotchas, and how to actually run AI in production without selling a kidney.

The Token Math Nobody Does

First, let's understand how tokens work:

1 token ≈ 4 characters (English)

  • "Hello" = 1 token
  • "authentication" = 3 tokens
  • Your average 500-word prompt = ~700 tokens
  • The response you get back = 200-500 tokens typically

The multiplier effect:

text
Input tokens: Your prompt + system prompt + context
Output tokens: The AI's response
Total: All of the above × number of requests

Real Example: Customer Support Bot

A typical customer support query:

  • System prompt: 200 tokens
  • Conversation history: 1,000 tokens (last 5 messages)
  • User message: 100 tokens
  • Retrieved context (RAG): 500 tokens
  • AI response: 300 tokens

Total per request: 2,100 tokens

At GPT-4 Turbo pricing ($10/M input, $30/M output):

  • Input: 1,800 tokens × $0.00001 = $0.018
  • Output: 300 tokens × $0.00003 = $0.009
  • Per request: $0.027

Looks cheap, right? Now scale it:

  • 1,000 requests/day = $27/day = $810/month
  • 10,000 requests/day = $270/day = $8,100/month
  • 100,000 requests/day = $2,700/day = $81,000/month

And this is GPT-4 Turbo, the "cheap" GPT-4 variant.

The Pricing Landscape (2025)

Here's what you're actually looking at:

OpenAI

ModelInput (per 1M)Output (per 1M)Use Case
GPT-4o$5$15Best balance
GPT-4 Turbo$10$30Complex tasks
GPT-3.5 Turbo$0.50$1.50Simple tasks
Embedding (ada)$0.10N/AVector search

Anthropic

ModelInput (per 1M)Output (per 1M)Use Case
Claude 3 Opus$15$75Best quality
Claude 3 Sonnet$3$15Balanced
Claude 3 Haiku$0.25$1.25Fast/cheap

The Hidden Costs

Rate limits and throttling:

  • OpenAI: 10,000 TPM (tokens per minute) on free tier
  • Hitting limits means queuing → latency → unhappy users
  • Higher tiers cost more ($$$)

Retries and failures:

  • APIs fail ~1-3% of the time
  • You still pay for failed tokens
  • Retry logic doubles some costs

Context overhead:

  • System prompts sent every request
  • Conversation history accumulates
  • RAG context adds up fast

Strategies That Actually Reduce Costs

I've helped teams cut AI costs by 70-90%. Here's what works:

1. Model Routing

Don't use GPT-4 for everything. Route based on complexity:

typescript
async function smartRoute(message: string): Promise<Model> {
  // Simple queries → cheap model
  if (message.length < 50 && !requiresReasoning(message)) {
    return 'gpt-3.5-turbo';
  }
  
  // Complex queries → expensive model
  if (requiresReasoning(message) || message.includes('analyze')) {
    return 'gpt-4-turbo';
  }
  
  // Default to mid-tier
  return 'gpt-4o';
}

function requiresReasoning(msg: string): boolean {
  const complexPatterns = [
    /why.*(happen|work|fail)/i,
    /compare.*(and|vs|versus)/i,
    /explain.*(how|why)/i,
    /analyze/i
  ];
  return complexPatterns.some(p => p.test(msg));
}

Impact: 40-60% cost reduction with minimal quality loss.

2. Aggressive Caching

Most AI requests aren't unique. Cache them:

typescript
import { Redis } from 'ioredis';

const redis = new Redis();
const CACHE_TTL = 86400; // 24 hours

async function cachedCompletion(prompt: string): Promise<string> {
  const cacheKey = `ai:${hashPrompt(prompt)}`;
  
  // Check cache
  const cached = await redis.get(cacheKey);
  if (cached) return cached;
  
  // Generate
  const result = await openai.chat.completions.create({
    model: 'gpt-4-turbo',
    messages: [{ role: 'user', content: prompt }]
  });
  
  const text = result.choices[0].message.content;
  
  // Cache
  await redis.setex(cacheKey, CACHE_TTL, text);
  
  return text;
}

For more sophisticated caching, use semantic similarity:

typescript
async function semanticCache(prompt: string): Promise<string | null> {
  const embedding = await embed(prompt);
  
  // Find similar cached prompts
  const similar = await vectorDB.search(embedding, { threshold: 0.95 });
  
  if (similar.length > 0) {
    return similar[0].response;
  }
  
  return null;
}

Impact: 20-70% cost reduction depending on query diversity.

3. Prompt Compression

Long system prompts eat tokens every request:

Before: 500-token system prompt

text
You are a helpful customer service agent for TechCorp Inc. 
Your role is to assist customers with questions about our 
products, services, and policies. You should be friendly 
but professional. Always greet the customer warmly...
[continues for 500 tokens]

After: 100-token system prompt

text
Role: TechCorp support agent
Tone: Friendly, professional
Do: Answer product questions, resolve issues, cite policies
Don't: Give refunds over $100 (escalate), share internal info

Same behavior, 80% fewer tokens.

4. Context Window Management

Don't send entire conversation history:

typescript
function prepareContext(messages: Message[], maxTokens: number = 2000) {
  let tokenCount = 0;
  const included: Message[] = [];
  
  // Always include system message
  included.push(messages[0]);
  tokenCount += countTokens(messages[0]);
  
  // Add recent messages until we hit limit
  for (let i = messages.length - 1; i > 0; i--) {
    const msgTokens = countTokens(messages[i]);
    if (tokenCount + msgTokens > maxTokens) break;
    included.unshift(messages[i]);
    tokenCount += msgTokens;
  }
  
  return included;
}

Or summarize older context:

typescript
async function compressHistory(messages: Message[]): Promise<string> {
  if (messages.length < 10) return messages;
  
  // Summarize older messages
  const old = messages.slice(0, -5);
  const summary = await summarize(old); // Use cheap model
  
  // Keep recent messages verbatim
  const recent = messages.slice(-5);
  
  return [
    { role: 'system', content: `Previous context: ${summary}` },
    ...recent
  ];
}

5. Streaming and Early Termination

Stream responses and stop when you have enough:

typescript
async function streamWithLimit(prompt: string, maxTokens: number = 200) {
  const stream = await openai.chat.completions.create({
    model: 'gpt-4-turbo',
    messages: [{ role: 'user', content: prompt }],
    stream: true,
    max_tokens: maxTokens  // Hard limit
  });
  
  let output = '';
  for await (const chunk of stream) {
    output += chunk.choices[0]?.delta?.content || '';
    
    // Early termination if we detect completion
    if (output.includes('[DONE]') || output.endsWith('.')) {
      // Check if it's a complete thought
      if (looksComplete(output)) break;
    }
  }
  
  return output;
}

6. Batching Requests

For non-real-time use cases, batch requests:

typescript
async function batchProcess(items: string[]): Promise<string[]> {
  // Combine into single prompt
  const batchPrompt = `
Process each item and respond in JSON array format:
${items.map((item, i) => `${i + 1}. ${item}`).join('\n')}
`;
  
  const result = await openai.chat.completions.create({
    model: 'gpt-4-turbo',
    messages: [{ role: 'user', content: batchPrompt }]
  });
  
  return JSON.parse(result.choices[0].message.content);
}

// Instead of 10 requests, 1 request
const results = await batchProcess(tenItems);

Impact: 50-80% cost reduction for batch workloads.

The Latency-Cost Tradeoff

Cheaper often means slower:

ApproachCostLatency
GPT-4 Turbo, no cache$$$2-5s
GPT-3.5 with cache$0.1-2s
Local LLMFixed0.5-3s
Hybrid (cache + fallback)$$0.1-5s

For user-facing features, latency matters. Sometimes paying more for GPT-4o is worth it because responses are faster.

Monitoring and Alerting

You need visibility into costs:

typescript
// Wrap all AI calls
async function trackedCompletion(params: CompletionParams) {
  const start = Date.now();
  
  const result = await openai.chat.completions.create(params);
  
  // Track usage
  await metrics.record({
    model: params.model,
    inputTokens: result.usage.prompt_tokens,
    outputTokens: result.usage.completion_tokens,
    latencyMs: Date.now() - start,
    endpoint: params.endpoint || 'unknown',
    success: true
  });
  
  // Alert on anomalies
  if (result.usage.total_tokens > 10000) {
    await alerts.send(`High token usage: ${result.usage.total_tokens}`);
  }
  
  return result;
}

Set up alerts for:

  • Daily spend exceeding threshold
  • Single request with unusual token count
  • Error rate spikes (you're paying for failures)
  • Latency degradation

The Self-Hosted Option

At extreme scale, self-hosting makes sense:

The math:

  • GPT-4 equivalent at 1M requests/day = ~$100k/month
  • High-end GPU server = ~$3-5k/month
  • Running Llama 70B = Unlimited requests

The catch:

  • Quality gap is real (though shrinking)
  • You need ML ops expertise
  • Hardware management is work

The middle ground:

typescript
// Route based on cost threshold
async function costAwareRoute(prompt: string) {
  const estimate = estimateTokens(prompt);
  
  // Expensive query → self-hosted
  if (estimate > 5000) {
    return await localLLM.complete(prompt);
  }
  
  // Normal query → API
  return await openai.complete(prompt);
}

Real Production Architecture

Here's what a cost-optimized production setup looks like:

text
User Request
[Semantic Cache] → Hit? → Return cached response
     ↓ Miss
[Request Classifier] → Simple? → GPT-3.5 Turbo
     ↓ Complex
[Context Compressor] → Trim history, compress system prompt
[GPT-4o/Turbo] → Generate response
[Response Cache] → Store for future
Return to user

With this setup, my typical cost reduction: 65-75% vs naive implementation.

The Bottom Line

AI in production isn't expensive by accident—it's expensive by negligence. The difference between $10k/month and $100k/month is often just:

  1. Routing: Use cheap models for simple tasks
  2. Caching: Don't regenerate identical responses
  3. Compression: Minimize tokens per request
  4. Batching: Combine requests where possible
  5. Monitoring: Catch cost spikes early

Do the math before you launch. I've seen startups die from AI costs. Don't be one of them.

And if someone tells you AI is "basically free"? They've never run it in production.

---

The best AI architecture is the one that delivers value without bankrupting you. Optimize for both.

JV

Written by Jose Viscasillas

Senior Software Engineer building video platforms at ON24. 21 years of coding experience. I write about React, TypeScript, AI, and developer tools.

Recommended Reads

📬

Subscribe to the Newsletter

New articles delivered to your inbox. No spam, unsubscribe anytime.

Join 500+ developers getting weekly insights on React, TypeScript, and building products.