Crawling using Puppeteer
So you want to scrape a website, take a screenshot, or download an image. Well, cool kids nowadays are using n8n and Firecrawl for the same purpose. But there is a way cheaper solution, with Puppeteer. Puppeteer is cheap; all you need is a self-hosted n8n and a free mind to get your hands dirty a little bit, as below.
OS
Ubuntu 24.04 LTS
Dockerfile
# Dựa trên image n8n chính thức
FROM n8nio/n8n:1.99.0
# Quyền root để cài dependencies
USER root
# Cài Chromium và thư viện bắt buộc để render headless
RUN apk add --no-cache \
chromium \
nss \
freetype \
harfbuzz \
ca-certificates \
ttf-freefont \
nodejs \
npm \
udev \
bash
# Cài các thư viện NodeJS phục vụ crawl
RUN npm install -g \
puppeteer-core@latest \
axios \
cheerio
# Khai báo path Chromium cho Puppeteer
ENV PUPPETEER_EXECUTABLE_PATH=/usr/bin/chromium-browser
# Trả quyền về user node
USER node
Build image: docker build -t n8n-with-modules .
docker-compose.yml
version: '3.8'
services:
postgres:
image: postgres:15
restart: always
environment:
POSTGRES_USER: ${DB_POSTGRESDB_USER}
POSTGRES_PASSWORD: ${DB_POSTGRESDB_PASSWORD}
POSTGRES_DB: ${DB_POSTGRESDB_DATABASE}
volumes:
- ./postgres_data:/var/lib/postgresql/data
n8n:
image: n8n-with-modules
restart: always
ports:
- "5678:5678"
environment:
N8N_BASIC_AUTH_ACTIVE: ${N8N_BASIC_AUTH_ACTIVE}
N8N_BASIC_AUTH_USER: ${N8N_BASIC_AUTH_USER}
N8N_BASIC_AUTH_PASSWORD: ${N8N_BASIC_AUTH_PASSWORD}
GENERIC_TIMEZONE: ${GENERIC_TIMEZONE}
DB_TYPE: ${DB_TYPE}
DB_POSTGRESDB_HOST: ${DB_POSTGRESDB_HOST}
DB_POSTGRESDB_PORT: ${DB_POSTGRESDB_PORT}
DB_POSTGRESDB_DATABASE: ${DB_POSTGRESDB_DATABASE}
DB_POSTGRESDB_USER: ${DB_POSTGRESDB_USER}
DB_POSTGRESDB_PASSWORD: ${DB_POSTGRESDB_PASSWORD}
N8N_PORT: ${N8N_PORT}
N8N_HOST: ${N8N_HOST}
N8N_PROTOCOL: ${N8N_PROTOCOL}
WEBHOOK_URL: ${WEBHOOK_URL}
NODE_FUNCTION_ALLOW_BUILTIN: '*'
NODE_FUNCTION_ALLOW_EXTERNAL: cheerio,axios,puppeteer-core
volumes:
- ./n8n_data:/home/node/.n8n
- ./scripts:/data/scripts
depends_on:
- postgres
Thư mục ./scripts
có thể chứa các crawler script để chạy từ Execute Command
hoặc Code
node.
crawl-an-url.js
root@localhost:~/puppeteer-scripts# vi scripts/crawl-an-url-with-undetected-puppeteer.js
const puppeteer = require('undetected-puppeteer');
(async () => {
const inputUrl = process.argv[2];
if (!inputUrl) {
console.error('❌ No URL provided!');
process.exit(1);
}
const browser = await puppeteer.launch({
headless: 'new',
args: [
'--no-sandbox',
'--disable-setuid-sandbox',
'--window-size=1920,1080',
],
});
const page = await browser.newPage();
try {
await page.goto(inputUrl, { waitUntil: 'domcontentloaded', timeout: 60000 });
// 👇 Dùng Promise-based timeout để chắc chắn tương thích
await new Promise(resolve => setTimeout(resolve, 5000));
const html = await page.content();
console.log(html);
} catch (error) {
console.error(❌ Failed to load page: ${error.message});
} finally {
await browser.close();
}
})();
💡 Use a n8n Code node, or save as a .js
file and use it with an Execute Command
node.
n8n's Execute Command node
# Paste into Command input
node /data/scripts/crawl-an-url.js {{ $json.target_url }}
Paste into the "Command" input, set the toggle "Execute Once" On
Note
Running n8n in the same place as Puppeteer is only for short-term usage; for long-term production, separate them.