AI Projects
May 1, 2026•6 min read•...
AI ProjectsMay 1, 2026•6 min read•

Build an AI PDF Summarizer with Node.js and OpenAI

Step-by-step tutorial to create a PDF summarization API that handles documents up to 500 pages using text splitting, embeddings, and recursive summarization.

Build an AI PDF Summarizer with Node.js and OpenAI

Build an AI PDF Summarizer with Node.js and OpenAI

Processing large PDFs (research papers, contracts, books) with LLMs is challenging due to token limits. This tutorial builds a production-ready PDF summarizer that chunks documents, summarizes each chunk, and then combines them into a cohesive summary.

Key Takeaways

  • Extract text from PDFs using pdf-parse
  • Split text into overlapping chunks for context preservation
  • Use recursive summarization for very long documents
  • Cache summaries to reduce API costs

Prerequisites

  • Node.js 20+
  • OpenAI API key
  • Basic understanding of async/await and streams

Step 1: Project Setup

bash
mkdir pdf-summarizer
cd pdf-summarizer
npm init -y
npm install express multer pdf-parse openai langchain dotenv
npm install -D typescript @types/node @types/express @types/multer
npx tsc --init

Step 2: PDF Text Extraction

Create src/pdfExtractor.ts:

typescript
import fs from 'fs';
import pdf from 'pdf-parse';

export async function extractTextFromPDF(filePath: string): Promise<string> {
  const dataBuffer = fs.readFileSync(filePath);
  const data = await pdf(dataBuffer);
  return data.text;
}

Step 3: Text Chunking with LangChain

Install LangChain’s text splitters:

bash
npm install @langchain/textsplitters

Create src/chunker.ts:

typescript
import { RecursiveCharacterTextSplitter } from '@langchain/textsplitters';

export async function chunkText(
  text: string,
  chunkSize: number = 2000,
  chunkOverlap: number = 200
): Promise<string[]> {
  const splitter = new RecursiveCharacterTextSplitter({
    chunkSize,
    chunkOverlap,
    separators: ['\n\n', '\n', ' ', ''],
  });
  
  const chunks = await splitter.splitText(text);
  console.log(`Split into ${chunks.length} chunks`);
  return chunks;
}

Step 4: Summarize Each Chunk

Create src/summarizer.ts using OpenAI:

typescript
import OpenAI from 'openai';

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
});

const CHUNK_SUMMARY_PROMPT = `
Summarize the following text in 3-5 sentences. Focus on key facts, arguments, and conclusions.

Text: {text}

Summary:
`;

export async function summarizeChunk(chunk: string): Promise<string> {
  const prompt = CHUNK_SUMMARY_PROMPT.replace('{text}', chunk);
  
  const response = await openai.chat.completions.create({
    model: 'gpt-4o-mini',
    messages: [
      { role: 'system', content: 'You are a precise summarization assistant.' },
      { role: 'user', content: prompt },
    ],
    temperature: 0.3,
    max_tokens: 500,
  });
  
  return response.choices[0].message.content || '';
}

export async function summarizeChunks(chunks: string[]): Promise<string[]> {
  const summaries: string[] = [];
  
  for (let i = 0; i < chunks.length; i++) {
    console.log(`Summarizing chunk ${i + 1}/${chunks.length}`);
    const summary = await summarizeChunk(chunks[i]);
    summaries.push(summary);
    
    // Respect rate limits (50 requests per minute for tier 1)
    await new Promise(resolve => setTimeout(resolve, 1200));
  }
  
  return summaries;
}

Step 5: Recursive Summarization for Long Documents

If the combined summaries exceed the token limit, recursively summarize them:

typescript
export async function recursiveSummarize(
  chunks: string[],
  targetLength: 'short' | 'medium' | 'long' = 'medium'
): Promise<string> {
  let currentLevel = chunks.map(c => c);
  
  while (currentLevel.length > 1) {
    const nextLevel: string[] = [];
    
    for (let i = 0; i < currentLevel.length; i += 4) {
      const batch = currentLevel.slice(i, i + 4);
      const combined = batch.join('\n\n');
      
      const summaryPrompt = `
        Combine and condense these summaries into a single cohesive paragraph.
        Remove redundancies and maintain logical flow.
        
        ${combined}
      `;
      
      const response = await openai.chat.completions.create({
        model: 'gpt-4o-mini',
        messages: [
          { role: 'system', content: 'You combine summaries into tighter summaries.' },
          { role: 'user', content: summaryPrompt },
        ],
        temperature: 0.2,
        max_tokens: targetLength === 'short' ? 300 : 800,
      });
      
      nextLevel.push(response.choices[0].message.content || '');
    }
    
    currentLevel = nextLevel;
  }
  
  return currentLevel[0];
}

Step 6: Express API with File Upload

Create src/index.ts:

typescript
import express from 'express';
import multer from 'multer';
import path from 'path';
import fs from 'fs';
import dotenv from 'dotenv';
import { extractTextFromPDF } from './pdfExtractor';
import { chunkText } from './chunker';
import { summarizeChunks, recursiveSummarize } from './summarizer';

dotenv.config();

const app = express();
const upload = multer({ dest: 'uploads/' });

app.use(express.json());

app.post('/api/summarize', upload.single('pdf'), async (req, res) => {
  try {
    if (!req.file) {
      return res.status(400).json({ error: 'No PDF file uploaded' });
    }
    
    const filePath = req.file.path;
    
    // Extract text
    const text = await extractTextFromPDF(filePath);
    console.log(`Extracted ${text.length} characters`);
    
    // Chunk text
    const chunks = await chunkText(text, 2000, 200);
    
    // Summarize chunks
    const chunkSummaries = await summarizeChunks(chunks);
    
    // Combine summaries
    let finalSummary: string;
    if (chunkSummaries.length > 5) {
      finalSummary = await recursiveSummarize(chunkSummaries, 'medium');
    } else {
      finalSummary = chunkSummaries.join('\n\n');
    }
    
    // Clean up uploaded file
    fs.unlinkSync(filePath);
    
    res.json({
      summary: finalSummary,
      metadata: {
        originalLength: text.length,
        chunksProcessed: chunks.length,
        model: 'gpt-4o-mini',
      },
    });
  } catch (error) {
    console.error(error);
    res.status(500).json({ error: 'Summarization failed' });
  }
});

const PORT = process.env.PORT || 3000;
app.listen(PORT, () => {
  console.log(`Server running on port ${PORT}`);
});

Step 7: Client Example

A simple HTML/JavaScript client:

html
<!DOCTYPE html>
<html>
<head>
  <title>PDF Summarizer</title>
</head>
<body>
  <input type="file" id="pdfInput" accept=".pdf">
  <button onclick="summarize()">Summarize</button>
  <pre id="output"></pre>

  <script>
    async function summarize() {
      const file = document.getElementById('pdfInput').files[0];
      const formData = new FormData();
      formData.append('pdf', file);
      
      const response = await fetch('/api/summarize', {
        method: 'POST',
        body: formData,
      });
      
      const data = await response.json();
      document.getElementById('output').textContent = data.summary;
    }
  </script>
</body>
</html>

Performance Optimizations

  1. Cache summaries in Redis using a hash of the PDF content
  2. Parallel chunk summarization with Promise.all (respect rate limits)
  3. Stream progress via Server-Sent Events for long PDFs
  4. Use cheaper models for first-pass summaries (GPT-3.5-turbo)

Cost Analysis

For a 100-page PDF (~200,000 characters, ~100 chunks):

  • Input tokens: ~50,000 (chunk summaries)
  • Output tokens: ~10,000
  • Cost with GPT-4o-mini: ~$0.15 per document
  • Cost with GPT-4: ~$1.50 per document

Limitations & Solutions

Problem Solution
Scanned PDFs (images) Use OCR (Tesseract.js or AWS Textract)
Tables & charts Extract with pdf-table-extractor
Non-English text Set language in pdf-parse
Very large PDFs (>500MB) Use streaming chunk processing

Conclusion

You now have a scalable PDF summarizer that can handle documents of any length. Deploy to a serverless platform like Railway or Fly.io, add authentication, and you have a SaaS product ready for beta.

Extensions:

  • Add question answering over the PDF (RAG with embeddings)
  • Support for DOCX, PPTX, and TXT files
  • Generate bullet-point summaries and key takeaways
  • Email summaries to users using Resend

Find the complete code on GitHub and try the live demo.

Comments

Join the conversation — sign in to leave a comment.