本流程的最终目标配置如下:
- 在 Ubuntu 24 上安装 FireCrowl(API 服务器 & 工作进程)
- 在系统范围内安装 Node.js,全局使用
pnpm - 部署 Rust 工具链(rustup 等),构建 FireCrawl 的 Rust 版 HTML 转换库(html-transformer)
- 安装 Playwright 的依赖包,并创建自定义的 Playwright 微服务脚本
- 创建两个用于启动 FireCrowl 的 systemd 服务(服务器用 & 工作进程用),实现:
- 操作系统启动时自动启动
- 可手动执行
sudo systemctl restart firecrowl-server/sudo systemctl restart firecrowl-workers等命令
(可选)
+-----------------+
| Dify |
+-----------------+
|
| HTTP / REST API
v
+------------------------------------------------------------------+
| FireCrawl (Node.js, TypeScript, pnpm) |
| 目录: /home/firecrawl/apps/api |
| |
| +----------------------+ (pnpm run start) |
| | API 服务器 |-----------------------------------------+
| | - Express | |
| | - BullMQ Dashboard | |
| +----------------------+ |
| ^ |
| | (队列任务) |
| +----------------------+ (pnpm run workers) |
| | Workers |-----------------------------------------+
| | - 抓取、解析、 | |
| | 索引 | |
| +----------------------+ |
| |
| - 在.env中配置的各种选项(API密钥、PORT、HOST、 |
| PLAYWRIGHT_MICROSERVICE_URL等) |
| |
| - Rust HTML Transformer: |
| -> 通过Cargo构建于: |
| /home/firecrawl/apps/api/sharedLibs/html-transformer |
| -> 生成: libhtml_transformer.so |
| -> FireCrawl用于快速HTML解析(或回退到Cheerio) |
+------------------------------------------------------------------+
|
| 任务队列 / 速率限制
v
+----------------------------------+
| Redis (localhost) |
| - 用于BullMQ作业队列 |
+----------------------------------+
+---------------------------------------------+
| systemd (Ubuntu 24) |
| |
| +--------------------------+ |
| | firecrowl-server | |
| | - ExecStart=pnpm run start| |
| +--------------------------+ |
| +--------------------------+ |
| | firecrowl-workers | |
| | - ExecStart=pnpm run | |
| | workers | |
| +--------------------------+ |
| +--------------------------+ |
| | firecrowl-playwright | |
| | - ExecStart=pnpm run | |
| | playwright-service | |
| +--------------------------+ |
| (开机自动启动、进程监控、 |
| 通过systemd管理日志) |
+---------------------------------------------+
前提条件:
- 在Ubuntu 24(代号”noble”)上不使用Docker运行
- 操作系统用户具有管理员权限
- 假定将FireCrawl仓库克隆到
/home/firecrawl - Redis已安装(
sudo apt install -y redis-server)并正在运行
STEP1. 系统准备及 Node.js / pnpm 安装 #
STEP1.1 系统更新和开发工具安装 #
sudo apt update
sudo apt install -y build-essential pkg-config curl git libssl-dev
STEP1.2 从 NodeSource 安装 Node.js #
curl -fsSL https://deb.nodesource.com/setup_20.x | sudo -E bash -
sudo apt-get install -y nodejs
STEP1.3 全局安装 pnpm #
sudo npm install -g pnpm
※请使用 which pnpm 确认执行路径(例: /usr/local/bin/pnpm)。
STEP2. FireCrawl 仓库克隆及依赖包安装 #
STEP2.1 克隆 FireCrawl #
cd /home/firecrawl
git clone https://github.com/mendableai/firecrawl.git
STEP2.2 依赖包安装 #
cd /home/firecrawl/apps/api
pnpm install
STEP2.3 .env 文件配置 #
创建或编辑 /home/firecrawl/apps/api/.env,设置必要的环境变量。例:
# ===== Required ENVS ======
NUM_WORKERS_PER_QUEUE=8
PORT=3002
HOST=0.0.0.0
REDIS_URL=redis://localhost:6379
REDIS_RATE_LIMIT_URL=redis://localhost:6379
PLAYWRIGHT_MICROSERVICE_URL=http://localhost:3000/html
USE_DB_AUTHENTICATION=false
# ===== Optional ENVS ======
TEST_API_KEY=fc-bestnet
BULL_AUTH_KEY=fc-bestnet
...(其他可选项根据需要设置)
STEP3. Rust 工具链部署及 HTML Transformer 构建 #
STEP3.1 Rust 工具链安装 #
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env
rustc --version
cargo --version
STEP3.2 确认 FireCrawl 内的 Rust 库目录 #
在仓库根目录执行以下命令,查找包含 Cargo.toml 的目录。
cd /home/firecrawl
find . -type f -name Cargo.toml | grep -i html-transformer
例:找到 ./sharedLibs/html-transformer/Cargo.toml 后,进入该目录。
STEP3.3 Rust 库构建 #
cd /home/firecrawl/sharedLibs/html-transformer
cargo build --release
构建成功时将生成 target/release/libhtml_transformer.so。确认:
ls target/release/libhtml_transformer.so
STEP3.4 库部署或环境变量设置 #
- 方法 A: 如果 FireCrawl 的 Node.js 代码通过相对路径加载,则无需特殊操作
- 方法 B: 根据需要添加到
LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/home/firecrawl/sharedLibs/html-transformer/target/release:$LD_LIBRARY_PATH
如需在 systemd 中自动配置,请在服务文件中添加
Environment=LD_LIBRARY_PATH=...。
STEP4. Playwright 设置及微服务脚本创建 #
STEP4.1 Playwright 依赖包安装 #
Ubuntu 24.04 依赖库示例:
sudo apt-get install -y libnss3 libnspr4 libatk1.0-0 libatk-bridge2.0-0 libcups2
libdrm2 libxkbcommon0 libgtk-3-0 libpango-1.0-0 libcairo2 libgdk-pixbuf2.0-0
libgbm1 libatspi2.0-0 libx11-xcb1 libxcomposite1 libxdamage1 libxfixes3
libxrandr2 libxrender1 libxtst6 libxcb1 libxi6 libxcursor1 ca-certificates
fonts-liberation xdg-utils
STEP4.2 Playwright 浏览器安装 #
pnpm exec playwright install
STEP4.3 Playwright 微服务脚本创建 #
/home/firecrawl/apps/api 中创建 playwright-service.js,并粘贴以下代码示例。
// playwright-service.js
const http = require('http');
const { chromium } = require('playwright');
const PORT = 3000;
const server = http.createServer(async (req, res) => {
if (req.method === 'GET' && req.url.startsWith('/html')) {
try {
const urlParam = new URL(req.url, `http://localhost:${PORT}`).searchParams.get('url');
if (!urlParam) {
res.writeHead(400, { 'Content-Type': 'text/plain' });
return res.end('Missing ?url parameter');
}
const browser = await chromium.launch({ headless: true });
const page = await browser.newPage();
await page.goto(urlParam, { waitUntil: 'networkidle' });
const content = await page.content();
res.writeHead(200, { 'Content-Type': 'text/html; charset=utf-8' });
res.end(content);
await browser.close();
} catch (err) {
console.error('Playwright microservice error:', err);
res.writeHead(500, { 'Content-Type': 'text/plain' });
res.end('Playwright error occurred');
}
} else {
res.writeHead(404, { 'Content-Type': 'text/plain' });
res.end('Not found');
}
});
server.listen(PORT, '0.0.0.0', () => {
console.log(`Playwright microservice listening on http://0.0.0.0:${PORT}/html`);
});
STEP4.4 在 package.json 中添加脚本 #
请在 /home/firecrawl/apps/api/package.json 的 "scripts" 中添加以下内容(与现有项目合并)。
"playwright-service": "node playwright-service.js"
STEP5. Systemd 服务配置 #
为了让 FireCrawl 服务器、工作进程、Playwright 微服务自动启动,需创建 systemd 单元文件。
STEP5.1 FireCrawl 服务器服务 (/etc/systemd/system/firecrowl-server.service) #
[Unit]
Description=FireCrowl Server
After=network.target
[Service]
User=firecrawl
Group=firecrawl
WorkingDirectory=/home/firecrawl/apps/api
Environment=PATH=/usr/local/bin:/usr/bin:/bin
ExecStart=/bin/bash -c 'pnpm run start'
Restart=always
RestartSec=5
Type=simple
[Install]
WantedBy=multi-user.target
STEP5.2 FireCrawl 工作进程服务 (/etc/systemd/system/firecrowl-workers.service) #
[Unit]
Description=FireCrowl Workers
After=network.target
[Service]
User=firecrawl
Group=firecrawl
WorkingDirectory=/home/firecrawl/apps/api
Environment=PATH=/usr/local/bin:/usr/bin:/bin
ExecStart=/bin/bash -c 'pnpm run workers'
Restart=always
RestartSec=5
Type=simple
[Install]
WantedBy=multi-user.target
STEP5.3 Playwright 微服务服务 (/etc/systemd/system/firecrowl-playwright.service) #
[Unit]
Description=FireCrowl Playwright Microservice
After=network.target
[Service]
User=firecrawl
Group=firecrawl
WorkingDirectory=/home/firecrawl/apps/api
Environment=PATH=/home/firecrawl/.nvm/versions/node/v20.18.3/bin:/usr/local/bin:/usr/bin:/bin
ExecStart=/bin/bash -c 'pnpm run playwright-service'
Restart=always
RestartSec=5
Type=simple
[Install]
WantedBy=multi-user.target
注意:
- 使用
User=firecrawl时,需提前创建firecrawl用户,并通过sudo chown -R firecrawl:firecrawl /home/firecrawl等命令设置所有权。 Environment=PATH=...中配置了通过 nvm 安装的 Node.js 路径。请根据环境进行修改。- 如果 Rust 库不是使用相对路径加载,还需要配置
LD_LIBRARY_PATH。
STEP5.4 启用并启动 Systemd 服务 #
sudo systemctl daemon-reload
sudo systemctl enable firecrowl-server
sudo systemctl enable firecrowl-workers
sudo systemctl enable firecrowl-playwright
sudo systemctl start firecrowl-server
sudo systemctl start firecrowl-workers
sudo systemctl start firecrowl-playwright
sudo systemctl status <service> 确认各服务是否为 active (running) 状态。
STEP6. FireCrawl 的重启与验证 #
- .env 的配置确认:
特别要确认PLAYWRIGHT_MICROSERVICE_URL是否设置为http://localhost:3000/html - FireCrawl 的服务器和工作进程已启动:
日志中出现”Scrape via fetch…”等信息则表示 HTTP 请求正常运行 - Playwright 服务的运行确认:
在浏览器中访问http://<服务器IP>:3000/html?url=https://example.com,如果返回 HTML 则正常
STEP7. 最终确认 #
- Rust库:
确认target/release/libhtml_transformer.so存在,且日志中无错误信息 - Playwright 微服务:
测试是否正确启动并执行 JavaScript 渲染 - 整体联动:
最终检查 FireCrawl 服务器·工作进程·Playwright 是否协同工作,能否正常响应来自 Dify 等的请求