The first Node.js memory leak I had to chase in production took six days to find and one line of code to fix. The app was an Express service handling about 200 req/s. Memory climbed from 180 MB at boot to 1.4 GB over twelve hours, at which point PM2 killed the worker and the cycle restarted. The leak was a Sentry breadcrumb collector with no upper bound, accumulating one entry per request. Six days, one line: maxBreadcrumbs: 50.
The hard part of memory leaks is never the fix. It is the diagnosis. The workflow below is the one I use in production: confirm there is a leak, find which object is leaking, find which code path is allocating it, and prove the patch fixed it. Tools: Node’s built-in --inspect with Chrome DevTools, clinic.js for a higher-level view, and heapdump when you need to capture a snapshot from a process you can’t attach to.
Confirm a leak before you start hunting
Half the time engineers come to me with a “memory leak,” it is a process doing exactly what it is supposed to: caching things, with the cache size bounded by something that takes a few hours to fill. A real leak grows monotonically over hours and never plateaus. Three things to check first.
1. Plot RSS over time. If you have monitoring (Datadog, New Relic, even top in a tmux), watch the resident set size over a 24-hour window. A sawtooth pattern (climbs, GC, drops) is healthy. A staircase that only goes up is a leak.
2. Check if it actually crashes. Node’s V8 heap caps at roughly 1.5 GB by default. RSS includes the heap, native bindings, and the loaded code; a 700 MB process at idle might be entirely fine. Crashes look like:
FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
1: 0xb09980 node::Abort()
2: 0xa1c193 node::FatalError(...)If you see that, you have a leak. If memory grows but the process never crashes, you might have a cache or buffer doing its job.
3. Check the GC. Run with --trace-gc for an hour:
node --trace-gc dist/server.js | tee gc.logLook at the heap size after each “Mark-sweep” line. If it grows unbounded across multiple full GCs, the leak is real. If it grows then drops back, GC is doing its job and the apparent climb was incremental allocation between collections.
Quick start: capture a heap snapshot in 5 commands
Working flow before reading the analysis sections.
node --inspect=0.0.0.0:9229 dist/server.jsFrom your laptop, open chrome://inspect, click “Open dedicated DevTools for Node,” go to the Memory tab. Take a heap snapshot. Run the suspect workload for ten minutes. Take a second snapshot. Use the dropdown above the snapshot list to switch the second one to “Comparison” mode, sort by Delta. The objects with the largest positive delta are your leak candidates.
For a process you cannot reach over a port (managed Lambda, container without an inspector port), use heapdump to write a .heapsnapshot file from inside the process:
import * as heapdump from 'heapdump';
process.on('SIGUSR2', () => {
const file = `/tmp/${Date.now()}.heapsnapshot`;
heapdump.writeSnapshot(file, (err) => {
console.log(err ? `failed: ${err}` : `wrote ${file}`);
});
});kill -USR2 $(pgrep -f 'node dist/server.js')
scp server:/tmp/1742... ~/Downloads/ # then load in Chrome DevTools Memory tabWhat is wrong with the typical “fix the memory leak” article
Five things most leak articles get wrong:
- They start with the fix patterns and skip the diagnosis. “Common leaks include unbounded Maps” is true and useless when you don’t know which Map.
- They show one tool and stop. Heap snapshots tell you what objects are retained. They don’t tell you which code allocated them. You need both views.
- They don’t validate the fix. A patched leak should produce a flat memory line; if you don’t measure post-fix, you don’t know.
- They confuse heap with RSS. Native memory leaks (libraries with C++ bindings) don’t show in heap snapshots; you need a different tool.
- They never mention the boring fixes. Most production leaks I find are unbounded caches, listener accumulation, or closures over request objects. Exotic V8 internals are vanishingly rare.
Find which object is leaking with heap snapshots
The Comparison view in Chrome DevTools is the highest-leverage tool. Workflow:
- Boot the app. Take snapshot #1 once memory has stabilised (usually 60 seconds in).
- Run the workload that you suspect leaks. autocannon against the suspect endpoint for 5 minutes works.
- Force a full GC: in the DevTools Memory tab, click the trash-can icon. This rules out “this object would have been collected on next GC.”
- Take snapshot #2.
- Switch view to “Comparison,” set “Compared to” → snapshot #1, sort by “# Delta” descending.
The first row is what is growing. Click it; the bottom pane shows retainers — the chain of references holding it alive. Trace the chain back to a global, a closure, or a long-lived object. That is your leak.
One pattern that catches engineers off guard: the leak is not the largest object class. It is the class with the largest delta. A million Strings retained because they are keys in a Map you forgot to clean up looks like “lots of strings”; the Map itself is small. Comparison view spots it.
Find which code allocated the leak with the Allocation Profile
Snapshots tell you what is retained. They don’t tell you where it was created. Switch to the “Allocation instrumentation on timeline” profile in the same Memory tab, start recording, run your workload, stop. The flame graph that appears shows allocations grouped by call stack — including the file and line that allocated each retained chunk.
For headless / production captures, use --cpu-prof --heap-prof on the Node command line:
node --heap-prof --heap-prof-interval=512000 dist/server.js
# After running the workload, kill the process; it writes Heap.20240310.heapprofile
# Load that file in Chrome DevTools Performance tab.The leaks I actually find in production
Eight years of paid debugging, sample of one — but the ranking holds across most Node.js apps I have audited:
| Pattern | How often | Fix shape |
|---|---|---|
| Unbounded Map / object cache | ~40% | LRU with size cap (lru-cache) |
| Event listener accumulation | ~20% | Move .on() outside request handler; use .once() |
| Closure over request object | ~15% | Extract values, drop the reference |
| Third-party library default config | ~10% | Read the docs, set the cap (Sentry breadcrumbs, OpenTelemetry buffers) |
| Promise chains that never resolve | ~5% | Add timeout; reject pending promises on shutdown |
| setInterval that closes over context | ~5% | Store handle, clearInterval when context dies |
| Native binding leaks (sharp, node-canvas) | ~3% | Update library; report upstream |
| Genuine V8 bug | ~2% | Update Node; report |
Pattern 1: the unbounded cache
The most common production leak. Looks innocent:
// LEAKS — userCache grows forever
const userCache = new Map<string, User>();
export async function getUser(id: string) {
if (userCache.has(id)) return userCache.get(id)!;
const user = await db.user.findUnique({ where: { id } });
userCache.set(id, user!);
return user!;
}One million unique users over a week, one million entries in the Map, 400 MB of heap. Fix: bounded LRU.
import { LRUCache } from 'lru-cache';
const userCache = new LRUCache<string, User>({
max: 5000, // hard upper bound
ttl: 5 * 60 * 1000, // 5 minutes
});The same pattern in 30 different shapes — request body cache, JWT verification cache, schema compilation cache, parsed user-agent cache. lru-cache with a max size and a TTL is the right answer for almost all of them.
Pattern 2: event listener accumulation
Looks like:
// LEAKS — every request adds a new listener
app.get('/notify', (req, res) => {
eventBus.on('user-update', (user) => {
res.write(`data: ${JSON.stringify(user)}\n\n`);
});
// res never closes; listener never removed
});Node will warn at 10 listeners on the same emitter:
(node:1234) MaxListenersExceededWarning: Possible EventEmitter memory leak detected.
11 user-update listeners added to [EventEmitter]. Use emitter.setMaxListeners() to increase limitDo not silence the warning. Fix the listener:
app.get('/notify', (req, res) => {
const handler = (user: User) => res.write(`data: ${JSON.stringify(user)}\n\n`);
eventBus.on('user-update', handler);
req.on('close', () => eventBus.off('user-update', handler));
});Pattern 3: closure over request object
Looks innocent because the function is small:
// LEAKS — closure captures req, which holds the entire request body
const pending = new Map<string, () => void>();
app.post('/job', (req, res) => {
const id = crypto.randomUUID();
pending.set(id, () => { // closes over req
console.log(`done: ${req.body.user}`);
});
res.json({ id });
});Each pending entry retains the callback, which retains req, which retains the request body. A request with a 5 MB JSON payload that hangs for an hour leaks 5 MB for the duration. Multiply by traffic; you have a problem.
// FIXED — extract the value, drop the request reference
app.post('/job', (req, res) => {
const id = crypto.randomUUID();
const userId = req.body.user; // primitive, no retention
pending.set(id, () => console.log(`done: ${userId}`));
res.json({ id });
});Pattern 4: third-party defaults
The Sentry leak from the opening was this. Sentry’s maxBreadcrumbs defaults to 100 per scope, which is fine — but if your code calls Sentry.addBreadcrumb() from inside a request and you don’t reset the scope per request, breadcrumbs from every request accumulate. Same shape with OpenTelemetry’s batch span processor (cap the queue), Pino transports (close them on shutdown), and a long tail of “telemetry library that retains data until you tell it not to.”
The fix is always the same: read the config docs, set explicit limits, never trust framework defaults for production.
clinic.js: the higher-level view
clinic.js wraps Node with a profiler that produces an HTML report you can read in your browser. Two commands relevant to memory work:
npm install -g clinic autocannon
# Heap profile + flame graph
clinic doctor -- node dist/server.js
# In another shell:
autocannon -c 50 -d 60 http://localhost:3000/suspect-endpoint
# Ctrl-C the clinic command. Browser opens with diagnosis.clinic doctor diagnoses the shape of the problem (event loop blocked, memory growing, GC under pressure). For deeper memory work, clinic heapprofiler samples allocations and shows a flame graph of which functions allocated the most retained memory:
clinic heapprofiler -- node dist/server.jsThe flame graph shows code paths weighted by allocation. The widest stacks at the top are usually where your leak lives.
Production checklist
- Plot RSS in monitoring with a 24-hour-minimum window. Page on monotonic growth past a threshold.
- Set
--max-old-space-sizeexplicitly in the Node command. Default 1.5 GB is rarely the right number for your droplet’s available RAM. - PM2’s
--max-memory-restartas a last-resort safety net, not a fix. Restarts are visible. - Bounded caches everywhere. If a Map can grow with traffic, it must have a max size and TTL.
- Listener cleanup on every long-lived response. SSE endpoints, WebSocket handlers, anything subscribing to event emitters.
- Capture heap snapshots regularly in staging under realistic load. Compare week-over-week to catch slow leaks before they ship.
- Sentry / OpenTelemetry / Pino: read the cap settings, set them explicitly, document the values.
- Regression tests for memory: run the leaky workload against the patched version, compare snapshot deltas.
- Update Node and your runtime libraries. V8 leaks are rare but real; LTS upgrades fix them.
- Don’t run heap snapshots on a hot production process. They pause the event loop for 1–10 seconds. Take them on a single worker drained of traffic.
When not to chase the leak
- The process restarts cleanly every few hours and you do zero-downtime restart. If PM2 cluster mode rotates a worker every 6 hours and no user notices, the cost of finding the leak might exceed the cost of accepting it. (I will judge you, but the math sometimes wins.)
- The “leak” is a startup cost amortised over the process lifetime. Caches that fill in the first hour and stay flat are doing their job.
- Native bindings in third-party libraries. sharp, node-canvas, sometimes node-postgres binary — you can confirm the leak is there but you cannot fix it from JavaScript. Update the library, report upstream, accept restart-on-threshold as a workaround.
Troubleshooting FAQ
My RSS is high but the heap is small. What is leaking?
Native memory. JavaScript heap snapshots won’t show it. Inspect with node --report-on-fatalerror and look at the diagnostic report for “external” memory. Common culprits: Buffer.allocUnsafe usage, native modules (sharp, node-canvas, libuv I/O buffers).
Why does --max-old-space-size not match my droplet RAM?
Node’s default is 1.5 GB regardless of available RAM. On a 4 GB droplet you can safely set --max-old-space-size=3000 (3 GB) and leave headroom for the OS. Don’t set it to total RAM — your app will OOM-kill the kernel.
Should I use global.gc() in production?
No. global.gc() requires --expose-gc at startup, blocks the event loop for the duration of the GC, and is almost never the right tool. Use it only in tests to force collection between snapshots.
Heap snapshot crashed my process. What now?
Snapshots double peak memory briefly because V8 builds the graph in memory. Run them on a worker that has been drained from the load balancer. PM2 lets you target one worker by ID for the inspect port.
How do I attach Chrome DevTools to a Node process I already started?
kill -USR1 <pid> opens the inspector on a process that wasn’t started with --inspect. Then visit chrome://inspect on your laptop.
Does TypeScript prevent memory leaks?
No. TypeScript is a compile-time check; leaks are a runtime problem. The patterns above all happen in fully-typed code.
autocannon vs k6 for memory testing?
autocannon is simpler and node-native; k6 is more powerful with scripting. For “hammer one endpoint and watch memory,” autocannon. For multi-step workflows that simulate users, k6.
Should I worry about memory in serverless functions?
Different problem. Serverless functions get their memory cap from the platform; leaks within a single invocation matter. Cross-invocation state (module-level Maps that persist across thaws) leaks more subtly. Treat the function as a process that lives for unknown duration; same patterns apply.
What ships next
This article covers diagnosis and the common patterns. Two adjacent topics worth their own posts: event-loop blocking diagnosis (different problem, similar tools) and proper graceful shutdown that drains in-flight requests before the process exits — half of “production crashes” turn out to be ungraceful exits, not leaks. If your leak is in a stream pipeline, the unbounded-buffer patterns in that article match the leak shapes here. If you are weighing frameworks, both Express and Fastify leak in the same shapes; the framework is rarely the cause.