Build an AI PDF Summarizer with Node.js and OpenAI
Step-by-step tutorial to create a PDF summarization API that handles documents up to 500 pages using text splitting, embeddings, and recursive summarization.
Build an AI PDF Summarizer with Node.js and OpenAI
Processing large PDFs (research papers, contracts, books) with LLMs is challenging due to token limits. This tutorial builds a production-ready PDF summarizer that chunks documents, summarizes each chunk, and then combines them into a cohesive summary.
Key Takeaways
- Extract text from PDFs using
pdf-parse - Split text into overlapping chunks for context preservation
- Use recursive summarization for very long documents
- Cache summaries to reduce API costs
Prerequisites
- Node.js 20+
- OpenAI API key
- Basic understanding of async/await and streams
Step 1: Project Setup
mkdir pdf-summarizer
cd pdf-summarizer
npm init -y
npm install express multer pdf-parse openai langchain dotenv
npm install -D typescript @types/node @types/express @types/multer
npx tsc --init
Step 2: PDF Text Extraction
Create src/pdfExtractor.ts:
import fs from 'fs';
import pdf from 'pdf-parse';
export async function extractTextFromPDF(filePath: string): Promise<string> {
const dataBuffer = fs.readFileSync(filePath);
const data = await pdf(dataBuffer);
return data.text;
}
Step 3: Text Chunking with LangChain
Install LangChain’s text splitters:
npm install @langchain/textsplitters
Create src/chunker.ts:
import { RecursiveCharacterTextSplitter } from '@langchain/textsplitters';
export async function chunkText(
text: string,
chunkSize: number = 2000,
chunkOverlap: number = 200
): Promise<string[]> {
const splitter = new RecursiveCharacterTextSplitter({
chunkSize,
chunkOverlap,
separators: ['\n\n', '\n', ' ', ''],
});
const chunks = await splitter.splitText(text);
console.log(`Split into ${chunks.length} chunks`);
return chunks;
}
Step 4: Summarize Each Chunk
Create src/summarizer.ts using OpenAI:
import OpenAI from 'openai';
const openai = new OpenAI({
apiKey: process.env.OPENAI_API_KEY,
});
const CHUNK_SUMMARY_PROMPT = `
Summarize the following text in 3-5 sentences. Focus on key facts, arguments, and conclusions.
Text: {text}
Summary:
`;
export async function summarizeChunk(chunk: string): Promise<string> {
const prompt = CHUNK_SUMMARY_PROMPT.replace('{text}', chunk);
const response = await openai.chat.completions.create({
model: 'gpt-4o-mini',
messages: [
{ role: 'system', content: 'You are a precise summarization assistant.' },
{ role: 'user', content: prompt },
],
temperature: 0.3,
max_tokens: 500,
});
return response.choices[0].message.content || '';
}
export async function summarizeChunks(chunks: string[]): Promise<string[]> {
const summaries: string[] = [];
for (let i = 0; i < chunks.length; i++) {
console.log(`Summarizing chunk ${i + 1}/${chunks.length}`);
const summary = await summarizeChunk(chunks[i]);
summaries.push(summary);
// Respect rate limits (50 requests per minute for tier 1)
await new Promise(resolve => setTimeout(resolve, 1200));
}
return summaries;
}
Step 5: Recursive Summarization for Long Documents
If the combined summaries exceed the token limit, recursively summarize them:
export async function recursiveSummarize(
chunks: string[],
targetLength: 'short' | 'medium' | 'long' = 'medium'
): Promise<string> {
let currentLevel = chunks.map(c => c);
while (currentLevel.length > 1) {
const nextLevel: string[] = [];
for (let i = 0; i < currentLevel.length; i += 4) {
const batch = currentLevel.slice(i, i + 4);
const combined = batch.join('\n\n');
const summaryPrompt = `
Combine and condense these summaries into a single cohesive paragraph.
Remove redundancies and maintain logical flow.
${combined}
`;
const response = await openai.chat.completions.create({
model: 'gpt-4o-mini',
messages: [
{ role: 'system', content: 'You combine summaries into tighter summaries.' },
{ role: 'user', content: summaryPrompt },
],
temperature: 0.2,
max_tokens: targetLength === 'short' ? 300 : 800,
});
nextLevel.push(response.choices[0].message.content || '');
}
currentLevel = nextLevel;
}
return currentLevel[0];
}
Step 6: Express API with File Upload
Create src/index.ts:
import express from 'express';
import multer from 'multer';
import path from 'path';
import fs from 'fs';
import dotenv from 'dotenv';
import { extractTextFromPDF } from './pdfExtractor';
import { chunkText } from './chunker';
import { summarizeChunks, recursiveSummarize } from './summarizer';
dotenv.config();
const app = express();
const upload = multer({ dest: 'uploads/' });
app.use(express.json());
app.post('/api/summarize', upload.single('pdf'), async (req, res) => {
try {
if (!req.file) {
return res.status(400).json({ error: 'No PDF file uploaded' });
}
const filePath = req.file.path;
// Extract text
const text = await extractTextFromPDF(filePath);
console.log(`Extracted ${text.length} characters`);
// Chunk text
const chunks = await chunkText(text, 2000, 200);
// Summarize chunks
const chunkSummaries = await summarizeChunks(chunks);
// Combine summaries
let finalSummary: string;
if (chunkSummaries.length > 5) {
finalSummary = await recursiveSummarize(chunkSummaries, 'medium');
} else {
finalSummary = chunkSummaries.join('\n\n');
}
// Clean up uploaded file
fs.unlinkSync(filePath);
res.json({
summary: finalSummary,
metadata: {
originalLength: text.length,
chunksProcessed: chunks.length,
model: 'gpt-4o-mini',
},
});
} catch (error) {
console.error(error);
res.status(500).json({ error: 'Summarization failed' });
}
});
const PORT = process.env.PORT || 3000;
app.listen(PORT, () => {
console.log(`Server running on port ${PORT}`);
});
Step 7: Client Example
A simple HTML/JavaScript client:
<!DOCTYPE html>
<html>
<head>
<title>PDF Summarizer</title>
</head>
<body>
<input type="file" id="pdfInput" accept=".pdf">
<button onclick="summarize()">Summarize</button>
<pre id="output"></pre>
<script>
async function summarize() {
const file = document.getElementById('pdfInput').files[0];
const formData = new FormData();
formData.append('pdf', file);
const response = await fetch('/api/summarize', {
method: 'POST',
body: formData,
});
const data = await response.json();
document.getElementById('output').textContent = data.summary;
}
</script>
</body>
</html>
Performance Optimizations
- Cache summaries in Redis using a hash of the PDF content
- Parallel chunk summarization with
Promise.all(respect rate limits) - Stream progress via Server-Sent Events for long PDFs
- Use cheaper models for first-pass summaries (GPT-3.5-turbo)
Cost Analysis
For a 100-page PDF (~200,000 characters, ~100 chunks):
- Input tokens: ~50,000 (chunk summaries)
- Output tokens: ~10,000
- Cost with GPT-4o-mini: ~$0.15 per document
- Cost with GPT-4: ~$1.50 per document
Limitations & Solutions
| Problem | Solution |
|---|---|
| Scanned PDFs (images) | Use OCR (Tesseract.js or AWS Textract) |
| Tables & charts | Extract with pdf-table-extractor |
| Non-English text | Set language in pdf-parse |
| Very large PDFs (>500MB) | Use streaming chunk processing |
Conclusion
You now have a scalable PDF summarizer that can handle documents of any length. Deploy to a serverless platform like Railway or Fly.io, add authentication, and you have a SaaS product ready for beta.
Extensions:
- Add question answering over the PDF (RAG with embeddings)
- Support for DOCX, PPTX, and TXT files
- Generate bullet-point summaries and key takeaways
- Email summaries to users using Resend
Comments
Join the conversation — sign in to leave a comment.