Crawling using Puppeteer

So you want to scrape a website, take a screenshot, or download an image. Well, cool kids nowadays are using n8n and Firecrawl for the same purpose. But there is a way cheaper solution, with Puppeteer. Puppeteer is cheap; all you need is a self-hosted n8n and a free mind to get your hands dirty a little bit, as below.

OS

Ubuntu 24.04 LTS

Dockerfile

# Dựa trên image n8n chính thức
FROM n8nio/n8n:1.99.0

# Quyền root để cài dependencies
USER root

# Cài Chromium và thư viện bắt buộc để render headless
RUN apk add --no-cache \
  chromium \
  nss \
  freetype \
  harfbuzz \
  ca-certificates \
  ttf-freefont \
  nodejs \
  npm \
  udev \
  bash

# Cài các thư viện NodeJS phục vụ crawl
RUN npm install -g \
  puppeteer-core@latest \
  axios \
  cheerio

# Khai báo path Chromium cho Puppeteer
ENV PUPPETEER_EXECUTABLE_PATH=/usr/bin/chromium-browser

# Trả quyền về user node
USER node

Build image: docker build -t n8n-with-modules .

docker-compose.yml

version: '3.8'

services:
  postgres:
    image: postgres:15
    restart: always
    environment:
      POSTGRES_USER: ${DB_POSTGRESDB_USER}
      POSTGRES_PASSWORD: ${DB_POSTGRESDB_PASSWORD}
      POSTGRES_DB: ${DB_POSTGRESDB_DATABASE}
    volumes:
      - ./postgres_data:/var/lib/postgresql/data

  n8n:
    image: n8n-with-modules
    restart: always
    ports:
      - "5678:5678"
    environment:
      N8N_BASIC_AUTH_ACTIVE: ${N8N_BASIC_AUTH_ACTIVE}
      N8N_BASIC_AUTH_USER: ${N8N_BASIC_AUTH_USER}
      N8N_BASIC_AUTH_PASSWORD: ${N8N_BASIC_AUTH_PASSWORD}

      GENERIC_TIMEZONE: ${GENERIC_TIMEZONE}

      DB_TYPE: ${DB_TYPE}
      DB_POSTGRESDB_HOST: ${DB_POSTGRESDB_HOST}
      DB_POSTGRESDB_PORT: ${DB_POSTGRESDB_PORT}
      DB_POSTGRESDB_DATABASE: ${DB_POSTGRESDB_DATABASE}
      DB_POSTGRESDB_USER: ${DB_POSTGRESDB_USER}
      DB_POSTGRESDB_PASSWORD: ${DB_POSTGRESDB_PASSWORD}

      N8N_PORT: ${N8N_PORT}
      N8N_HOST: ${N8N_HOST}
      N8N_PROTOCOL: ${N8N_PROTOCOL}
      WEBHOOK_URL: ${WEBHOOK_URL}

      NODE_FUNCTION_ALLOW_BUILTIN: '*'
      NODE_FUNCTION_ALLOW_EXTERNAL: cheerio,axios,puppeteer-core
    volumes:
      - ./n8n_data:/home/node/.n8n
      - ./scripts:/data/scripts
    depends_on:
      - postgres

Thư mục ./scripts có thể chứa các crawler script để chạy từ Execute Command hoặc Code node.

crawl-an-url.js

root@localhost:~/puppeteer-scripts# vi scripts/crawl-an-url-with-undetected-puppeteer.js 

const puppeteer = require('undetected-puppeteer');

(async () => {
  const inputUrl = process.argv[2];
  if (!inputUrl) {
    console.error('❌ No URL provided!');
    process.exit(1);
  }

  const browser = await puppeteer.launch({
    headless: 'new',
    args: [
      '--no-sandbox',
      '--disable-setuid-sandbox',
      '--window-size=1920,1080',
    ],
  });

  const page = await browser.newPage();

  try {
    await page.goto(inputUrl, { waitUntil: 'domcontentloaded', timeout: 60000 });

    // 👇 Dùng Promise-based timeout để chắc chắn tương thích
    await new Promise(resolve => setTimeout(resolve, 5000));

    const html = await page.content();
    console.log(html);

  } catch (error) {
    console.error(❌ Failed to load page: ${error.message});
  } finally {
    await browser.close();
  }
})();

💡 Use a n8n Code node, or save as a .js file and use it with an Execute Command node.

n8n's Execute Command node

# Paste into Command input
node /data/scripts/crawl-an-url.js {{ $json.target_url }}

Paste into the "Command" input, set the toggle "Execute Once" On

Note

Running n8n in the same place as Puppeteer is only for short-term usage; for long-term production, separate them.