2025年1月3日

ScrapeGraphAI：下一代智能数据抓取解决方案 - 使用现代 AI 技术进行高效数据提取

安装步骤及说明

在安装和使用 ScrapeGraphAI 之前，请按照以下步骤完成相关依赖项的安装：

安装 ScrapeGraphAI：
使用 pip 命令安装 ScrapeGraphAI 及其依赖模块。
```
pip install scrapegraphai
```
安装浏览器驱动（如 Playwright）：
ScrapeGraphAI 需要 Playwright 支持 JavaScript 渲染的网页抓取。
```
playwright install
```
安装其他依赖模块（根据需要）：
根据使用场景，可以安装其他语言模型或扩展功能模块：
- 添加其他语言模型支持：
```
pip install scrapegraphai'[other-language-models]'
```
- 添加语义增强选项：
```
pip install scrapegraphai'[more-semantic-options]'
```
- 添加更多浏览器选项：
```
pip install scrapegraphai'[more-browser-options]'
```
本地模型支持（可选）：
如果使用本地模型（如 Ollama），请确保已安装 Ollama，并使用 ollama pull 命令下载所需的模型文件。
安装 DuckDuckGo 搜索模块（可选）： 如果需要使用 DuckDuckGo 搜索功能，请安装 duckduckgo-search 模块。
```
pip install -U duckduckgo-search
```

示例案例

以下为 ScrapeGraphAI 在不同场景下的典型使用案例：

案例 1：使用本地模型的 SmartScraper

此示例展示如何使用 Ollama 提供的本地模型进行数据提取。

请确保：

已安装 Ollama 软件。
已下载所需的模型，例如 ollama/mistral。

from scrapegraphai.graphs import SmartScraperGraph

# 图配置
graph_config = {
    "llm": {
        "model": "ollama/mistral",  # 使用本地模型
        "temperature": 0,
        "format": "json",           # 明确指定输出格式
        "base_url": "http://localhost:11434",  # 本地 Ollama 服务 URL
    },
    "embeddings": {
        "model": "ollama/nomic-embed-text",
        "base_url": "http://localhost:11434",
    },
    "verbose": True,
}

# 创建 SmartScraperGraph 实例
smart_scraper_graph = SmartScraperGraph(
    prompt="List me all the projects with their descriptions",
    source="https://perinim.github.io/projects",  # 数据来源
    config=graph_config
)

# 运行并输出结果
result = smart_scraper_graph.run()
print(result)

输出示例：

{
    "projects": [
        {
            "title": "Rotary Pendulum RL",
            "description": "Open Source project aimed at controlling a real-life rotary pendulum using RL algorithms"
        },
        {
            "title": "DQN Implementation from scratch",
            "description": "Developed a Deep Q-Network algorithm to train a simple and double pendulum"
        },
        ...
    ]
}

案例 2：使用混合模型的 SearchGraph

此示例结合使用 Groq 模型作为 LLM，Ollama 模型作为嵌入模型，用于从搜索结果中提取特定信息。

from scrapegraphai.graphs import SearchGraph

# 图配置
graph_config = {
    "llm": {
        "model": "groq/gemma-7b-it",  # 使用 Groq 语言模型
        "api_key": "GROQ_API_KEY",
        "temperature": 0
    },
    "embeddings": {
        "model": "ollama/nomic-embed-text",  # 使用 Ollama 嵌入模型
        "base_url": "http://localhost:11434",
    },
    "max_results": 5,  # 设置返回的最大结果数
}

# 创建 SearchGraph 实例
search_graph = SearchGraph(
    prompt="List me all the traditional recipes from Chioggia",
    config=graph_config
)

# 运行并输出结果
result = search_graph.run()
print(result)

输出示例：

{
    "recipes": [
        {"name": "Sarde in Saòre"},
        {"name": "Bigoli in salsa"},
        {"name": "Seppie in umido"},
        {"name": "Moleche frite"},
        {"name": "Risotto alla pescatora"}
    ]
}

案例 3：使用 OpenAI 的 SpeechGraph

此示例展示如何将提取内容转换为音频文件，适合需要生成语音输出的场景。

from scrapegraphai.graphs import SpeechGraph

# 图配置
graph_config = {
    "llm": {
        "api_key": "OPENAI_API_KEY",   # OpenAI API 密钥
        "model": "openai/gpt-3.5-turbo"
    },
    "tts_model": {
        "api_key": "OPENAI_API_KEY",
        "model": "tts-1",             # TTS 模型
        "voice": "alloy"              # 指定语音类型
    },
    "output_path": "audio_summary.mp3",  # 输出音频路径
}

# 创建 SpeechGraph 实例
speech_graph = SpeechGraph(
    prompt="Make a detailed audio summary of the projects.",
    source="https://perinim.github.io/projects/",
    config=graph_config
)

# 运行并输出结果
result = speech_graph.run()
print(result)

输出：
将生成名为 audio_summary.mp3 的音频文件，其中包含页面上项目的语音摘要。

项目地址项目文档