Breaking
AI INTEGRATIONOPENAI-APInodewire.net →

How to use the OpenAI API in Node.js (with streaming, retries, and cost control)

How to use the OpenAI API in Node.js with streaming via SSE, exponential retries, per-request cost ceilings, and the patterns that cut a $4200 monthly bill to $640 - without changing the UX.

The first time I shipped an integration around the OpenAI API in Node.js for a paying client, the bill landed at $4,200 for the first month. The app was a customer-facing chatbot doing maybe 800 conversations a day. Without streaming, without retries, and with the entire chat history sent on every turn, the cost was a runaway. After two weeks of work — streaming responses, prompt caching, a hard token ceiling per conversation, and a tier-based model fallback — the same traffic cost $640 the next month. Same UX, 85% lower bill.

That gap is what most OpenAI tutorials skip. They show you openai.chat.completions.create() and stop. The setup below is what I now ship for production: streaming via Server-Sent Events, exponential backoff, structured retries, a per-request cost ceiling, and the failure modes you only hit at 3 a.m. on a Friday. Node 20 LTS, TypeScript, OpenAI SDK 4.x.

Quick start: streaming a response in 30 lines

Working baseline. Read the production sections after.

bash
npm i openai
npm i -D typescript @types/node tsx
TypeScript
// quickstart.ts
import OpenAI from 'openai';

const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

const stream = await client.chat.completions.create({
  model: 'gpt-4o-mini',
  stream: true,
  messages: [
    { role: 'system', content: 'You are a terse Node.js engineer.' },
    { role: 'user', content: 'Why does Express return text/html by default?' },
  ],
});

for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.delta?.content ?? '');
}
bash
OPENAI_API_KEY=sk-... npx tsx quickstart.ts

Works. Also wrong for production: no retries on transient 429s, no cost cap, no graceful degradation when the model is slow, no abort signal when the user closes the tab.

What is wrong with the typical OpenAI tutorial

Four production failures I have personally been paged for:

  1. No retry logic on 429 / 500. OpenAI returns rate-limit errors during traffic spikes and 500s during their own incidents. Default behaviour is to crash the request — your user sees an error toast.
  2. No cost ceiling. One prompt-injection that asks the model to “repeat the previous text 100 times” can cost $50 per request on GPT-4o. OpenAI’s own safety docs recommend per-request limits and you should listen.
  3. Sending the full chat history every turn. A 20-turn conversation with no summarisation hits the context window in dollars before it hits it in tokens.
  4. No abort handling. When the user closes the browser tab mid-stream, your server keeps generating tokens and pays for the full response. At scale this is real money.

Project structure for an AI integration

text
src/
  ai/
    client.ts         # OpenAI client singleton + retry config
    chat.ts           # streaming chat with cost ceiling
    cost.ts           # token + dollar accounting
    prompts/          # system prompts as named exports
  routes/
    chat.ts           # SSE endpoint, abort handling
  env.ts              # zod-validated OPENAI_API_KEY etc.

Two rules that pay off the first time something breaks:

  • One OpenAI client per process. The SDK reuses HTTP keep-alive — multiple instances mean multiple TCP pools.
  • System prompts live in named files, not inline string concatenations. They become product surface area; treat them like code.

Resilient client with retries and timeouts

The OpenAI SDK has retries built in but the defaults are conservative. For production you want explicit configuration:

TypeScript
// src/ai/client.ts
import OpenAI from 'openai';
import { env } from '../env';

export const openai = new OpenAI({
  apiKey: env.OPENAI_API_KEY,
  maxRetries: 3,                       // SDK does exponential backoff: 0.5s, 1s, 2s
  timeout: 60_000,                     // hard ceiling per request
});

export const MODELS = {
  default:  'gpt-4o-mini',             // cheap, fast, good enough for most things
  reasoning: 'gpt-4o',                 // when default fails twice in a row
  fallback:  'gpt-3.5-turbo',          // when both above are degraded
} as const;

The SDK’s default retry logic handles 429 and 5xx automatically. For application-level fallback to a cheaper model when the primary fails, wrap the call:

TypeScript
// src/ai/chat.ts
import { openai, MODELS } from './client';

export async function chatWithFallback(messages: any[]) {
  for (const model of [MODELS.default, MODELS.reasoning, MODELS.fallback]) {
    try {
      return await openai.chat.completions.create({ model, messages });
    } catch (err: any) {
      // 429 / 503: try next model. 4xx (other): user error, surface it.
      if (err.status && err.status < 500 && err.status !== 429) throw err;
      console.warn(`Model ${model} failed (${err.status}), trying next`);
    }
  }
  throw new Error('All models exhausted');
}

Streaming via Server-Sent Events

The user wants the response to start appearing in the first second, not after 12 seconds when the full message arrives. SSE is the cheapest, most boring way to do that — works through every CDN, no WebSocket complexity, no client library needed.

TypeScript
// src/routes/chat.ts (Express)
import { Router } from 'express';
import { openai, MODELS } from '../ai/client';

export const chat = Router();

chat.post('/stream', async (req, res) => {
  res.setHeader('Content-Type', 'text/event-stream');
  res.setHeader('Cache-Control', 'no-cache');
  res.setHeader('Connection', 'keep-alive');
  res.flushHeaders();

  // Hook the request abort so we cancel the OpenAI call when user disconnects.
  const controller = new AbortController();
  req.on('close', () => controller.abort());

  try {
    const stream = await openai.chat.completions.create({
      model: MODELS.default,
      stream: true,
      max_tokens: 800,                 // hard cap per response
      messages: req.body.messages,
    }, { signal: controller.signal });

    for await (const chunk of stream) {
      const delta = chunk.choices[0]?.delta?.content ?? '';
      if (delta) res.write(`data: ${JSON.stringify({ delta })}\n\n`);
    }
    res.write('data: [DONE]\n\n');
    res.end();
  } catch (err: any) {
    if (err.name === 'AbortError') return;             // user closed tab
    res.write(`data: ${JSON.stringify({ error: err.message })}\n\n`);
    res.end();
  }
});

Three details that matter at scale:

  • The req.on('close') hook. Without it, when the browser disconnects, your server keeps streaming tokens and you pay for them. With it, the AbortController stops the OpenAI request immediately.
  • max_tokens: 800 per response. Hard ceiling. At GPT-4o pricing this caps any single completion at roughly $0.012 — survivable for any prompt-injection.
  • res.flushHeaders() before the loop. Some proxies (nginx, CloudFront) buffer until the first byte; flushing forces the connection to open.

Client side is fifteen lines:

TypeScript
// client/chat.ts
const res = await fetch('/api/chat/stream', {
  method: 'POST',
  headers: { 'content-type': 'application/json' },
  body: JSON.stringify({ messages }),
  signal: AbortSignal.timeout(120_000),
});

const reader = res.body!.getReader();
const decoder = new TextDecoder();
let buffer = '';

while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  buffer += decoder.decode(value, { stream: true });
  for (const line of buffer.split('\n\n')) {
    if (!line.startsWith('data: ')) continue;
    const data = line.slice(6);
    if (data === '[DONE]') return;
    const { delta } = JSON.parse(data);
    onToken(delta);                  // append to UI
  }
  buffer = buffer.split('\n\n').pop()!;
}

Cost control: token counting and per-request ceilings

Three layers of cost control in production. Skip any of them and the bill bites:

Layer Mechanism Saves you from
Per-response cap max_tokens on the request Prompt injections asking for huge outputs
Per-conversation cap Track total tokens, refuse new turns past N Looped conversations that grow unbounded
Per-user daily cap Redis counter on tokens consumed today Abuse, scraping, runaway integrations

Token counting before sending — use tiktoken for exact numbers, or estimate at 1 token per 4 characters for budgeting:

TypeScript
// src/ai/cost.ts
import { encoding_for_model } from 'tiktoken';

const enc = encoding_for_model('gpt-4o-mini');

export function countTokens(messages: { role: string; content: string }[]) {
  return messages.reduce((sum, m) => sum + enc.encode(m.content).length + 4, 0);
}

const PRICING = {
  'gpt-4o-mini':  { in: 0.150 / 1e6, out: 0.600 / 1e6 },   // $/token, Mar 2026
  'gpt-4o':       { in: 2.50  / 1e6, out: 10.00 / 1e6 },
  'gpt-3.5-turbo':{ in: 0.50  / 1e6, out: 1.50  / 1e6 },
};

export function estimateCost(model: keyof typeof PRICING, inTokens: number, outTokens: number) {
  const p = PRICING[model];
  return inTokens * p.in + outTokens * p.out;
}

Refuse the request before sending if the user has blown their daily budget. rate-limiter-flexible against Redis is the one I use:

TypeScript
import { RateLimiterRedis } from 'rate-limiter-flexible';
import { redis } from '../db/redis';

const dailyTokenLimit = new RateLimiterRedis({
  storeClient: redis,
  keyPrefix: 'ai_tokens_daily',
  points: 50_000,                                        // tokens per user per day
  duration: 24 * 60 * 60,
});

export async function consumeBudget(userId: string, tokens: number) {
  await dailyTokenLimit.consume(userId, tokens);         // throws on exceed
}

Conversation memory without burning the budget

The naive approach sends all 20 turns of history every request. By turn 10 you are paying for the full conversation on every message — quadratic cost growth.

Three patterns that work:

  1. Sliding window: keep last N turns. Simple, loses context.
  2. Summarisation: when total tokens exceed a threshold, replace older turns with a one-paragraph summary generated by GPT-3.5-turbo (cheap). Pattern of choice for chatbots.
  3. Embeddings + retrieval: store turn history as embeddings, retrieve relevant past turns at query time. Worth it only if conversations are long-lived (days, not hours).
TypeScript
async function trimHistory(messages: Message[], targetTokens = 4000) {
  if (countTokens(messages) <= targetTokens) return messages;

  const system = messages[0];                            // keep system prompt
  const recent = messages.slice(-6);                     // keep last 6 turns
  const older  = messages.slice(1, -6);

  const summary = await openai.chat.completions.create({
    model: 'gpt-3.5-turbo',
    messages: [
      { role: 'system', content: 'Summarise the conversation below in 3 sentences.' },
      { role: 'user', content: older.map(m => `${m.role}: ${m.content}`).join('\n') },
    ],
  });

  return [
    system,
    { role: 'system', content: `Summary of earlier conversation: ${summary.choices[0].message.content}` },
    ...recent,
  ];
}

Production checklist

  • Single OpenAI client per process, reused across all requests for HTTP keep-alive.
  • max_tokens set on every request, sized to the realistic response length plus a little buffer.
  • AbortController wired to req.on('close') so user disconnects stop the OpenAI call.
  • Per-user daily token budget enforced before the request goes out, backed by Redis.
  • Conversation history trimmed via summarisation past 4k tokens.
  • Model fallback chain from cheap → expensive → emergency, switching on 429/5xx only.
  • Streaming for any response longer than 1 second. User-perceived latency falls dramatically.
  • Retries handled by SDK maxRetries — don’t add your own loop on top, you will create thundering herds.
  • Logs include model, input tokens, output tokens, dollar cost, latency. Without this you cannot debug bill shocks.
  • System prompts in version control, not in environment variables, not in the database.

When not to use the OpenAI API

Three cases where the right answer is something else:

  • You need predictable latency under 200 ms. Even GPT-4o mini varies from 400 ms to 6 s depending on load. For UX-critical paths, fine-tune a smaller open-source model and self-host, or accept a different shape of feature.
  • You handle data that cannot leave your network. Healthcare with HIPAA, EU data residency for some regulated sectors, internal trade secrets. Use AWS Bedrock with Claude or self-hosted Llama 3.
  • You are doing structured extraction at high volume. Fine-tuned smaller models (or even regex + DOM parsing) are 10× cheaper and faster than asking GPT-4o to extract three fields from a page.

Troubleshooting FAQ

Why am I getting random 429 errors when I’m not anywhere near my rate limit?

OpenAI rate limits are per organisation per model, not per API key, and the dashboard often lags. Check x-ratelimit-remaining-tokens in the response headers — that is the live number. The SDK retries on 429 by default; if you are seeing them surface to your code, increase maxRetries or rate-limit your callers.

Should I use the Vercel AI SDK instead of the OpenAI SDK directly?

If you ship a chat UI in Next.js or React, yes. The Vercel AI SDK gives you typed React hooks and provider-agnostic abstraction. For backend-only or non-React frontends, the OpenAI SDK directly is one less layer of indirection.

How do I handle prompt injection?

Three layers: never include user input in the system prompt, validate any structured output the model returns (don’t trust JSON it generates), and cap max_tokens so an injected “repeat 1000 times” prompt cannot spend money. Full prevention is impossible; mitigation is doable.

Streaming or batch responses?

Stream when the user is watching. Batch when nothing watches — analytics jobs, summaries, embeddings backfills. OpenAI’s Batch API is 50% cheaper for non-realtime workloads.

How do I store and version system prompts?

In src/ai/prompts/*.ts as named string exports. Treat them like code: PR review, tests against expected outputs, semantic versioning if changes affect behaviour. Storing them in the database means production prompt changes happen without a deploy — usually the wrong thing.

What about LangChain?

Useful for prototyping multi-step agent workflows. Heavy and opinionated for production single-shot completions. I have removed LangChain from two production codebases and replaced it with ~150 lines of explicit code; both got faster and easier to debug.

How do I test code that calls OpenAI?

Mock the client at the module boundary. Don’t hit the real API from CI — flaky and expensive. For integration tests against the real API, gate them behind an environment flag and run on a schedule, not on every PR.

GPT-4o mini vs GPT-4o for production?

Default to mini. The 4o mini handles 80% of typical chatbot, summarisation, classification, and basic Q&A workloads at one-fifteenth the cost. Reserve full GPT-4o for tasks where mini visibly fails: complex reasoning, multi-step extraction, code generation longer than 50 lines.

What ships next

This article covers the OpenAI integration core. Two adjacent topics worth their own posts: structured output with JSON mode and Zod validation (where the model’s output becomes type-safe input to the rest of your app), and a Redis-backed prompt cache that knocks 40% off cost for repeated questions. Both are queued. If your AI endpoint needs auth, the JWT pattern in that article drops in directly. If the SSE pattern above is your first taste of Node streams, the streams primer covers the rest of the API.