Hyperbrowser 网络爬取工具
Hyperbrowser 是一个用于运行和扩展无头浏览器的平台。它让您能够大规模启动和管理浏览器会话,并为任何网络抓取需求提供易于使用的解决方案,例如抓取单个页面或爬取整个网站。
主要功能:
- 即时扩展能力 - 数秒内启动数百个浏览器会话,无需担忧基础设施难题
- 简单集成 - 与 Puppeteer 和 Playwright 等流行工具无缝协作
- 强大的 API - 易于使用的 API,可用于抓取/爬取任何网站以及更多功能
- 绕过反机器人措施 - 内置隐身模式、广告拦截、自动验证码识别以及轮换代理
本笔记本提供了快速入门 Hyperbrowser 网络工具的概览。
有关Hyperbrowser的更多信息,请访问Hyperbrowser网站,或者如果你想查看文档,可以访问Hyperbrowser文档。
关键能力
抓取
Hyperbrowser 提供强大的抓取功能,让您能够从任何网页中提取数据。该抓取工具可将网页内容转换为 Markdown 或 HTML 等结构化格式,便于处理和分析数据。
爬取
爬取功能使您能够自动浏览网站的多个页面。您可以设置页面限制等参数,以控制爬虫对网站的探索范围,并从其访问的每个页面收集数据。
提取
Hyperbrowser 的提取功能利用人工智能,根据您定义的模式从网页中提取特定信息。这使您能够将非结构化的网络内容转换为符合您确切需求的结构化数据。
概览
集成详情
| 工具 | 包 | 本地 | 可序列化的 | JS 支持 |
|---|---|---|---|---|
| Crawl Tool | langchain-hyperbrowser | ❌ | ❌ | ❌ |
| Scrape Tool | langchain-hyperbrowser | ❌ | ❌ | ❌ |
| Extract Tool | langchain-hyperbrowser | ❌ | ❌ | ❌ |
设置
要访问 Hyperbrowser 网络工具,您需要安装 langchain-hyperbrowser 集成包,并创建一个 Hyperbrowser 账户以获取 API 密钥。
凭据
前往 Hyperbrowser 注册并生成 API 密钥。完成此操作后,请设置 HYPERBROWSER_API_KEY 环境变量:
export HYPERBROWSER_API_KEY=<your-api-key>
安装
安装 langchain-hyperbrowser。
%pip install -qU langchain-hyperbrowser
实例化
爬取工具
HyperbrowserCrawlTool 是一个强大的工具,可以从给定 URL 开始爬取整个网站。它支持可配置的页面限制和抓取选项。
from langchain_hyperbrowser import HyperbrowserCrawlTool
tool = HyperbrowserCrawlTool()
爬取工具
HyperbrowserScrapeTool 是一个可以从网页抓取内容的工具。它支持 Markdown 和 HTML 两种输出格式,并支持元数据提取。
from langchain_hyperbrowser import HyperbrowserScrapeTool
tool = HyperbrowserScrapeTool()
提取工具
HyperbrowserExtractTool 是一个强大的工具,它利用人工智能从网页中提取结构化数据。它可以基于预定义的模式提取信息。
from langchain_hyperbrowser import HyperbrowserExtractTool
tool = HyperbrowserExtractTool()
调用
基本用法
爬取工具
from langchain_hyperbrowser import HyperbrowserCrawlTool
result = HyperbrowserCrawlTool().invoke(
{
"url": "https://example.com",
"max_pages": 2,
"scrape_options": {"formats": ["markdown"]},
}
)
print(result)
{'data': [CrawledPage(metadata={'url': 'https://www.example.com/', 'title': 'Example Domain', 'viewport': 'width=device-width, initial-scale=1', 'sourceURL': 'https://example.com'}, html=None, markdown='Example Domain\n\n# Example Domain\n\nThis domain is for use in illustrative examples in documents. You may use this\ndomain in literature without prior coordination or asking for permission.\n\n[More information...](https://www.iana.org/domains/example)', links=None, screenshot=None, url='https://example.com', status='completed', error=None)], 'error': None}
爬取工具
from langchain_hyperbrowser import HyperbrowserScrapeTool
result = HyperbrowserScrapeTool().invoke(
{"url": "https://example.com", "scrape_options": {"formats": ["markdown"]}}
)
print(result)
{'data': ScrapeJobData(metadata={'url': 'https://www.example.com/', 'title': 'Example Domain', 'viewport': 'width=device-width, initial-scale=1', 'sourceURL': 'https://example.com'}, html=None, markdown='Example Domain\n\n# Example Domain\n\nThis domain is for use in illustrative examples in documents. You may use this\ndomain in literature without prior coordination or asking for permission.\n\n[More information...](https://www.iana.org/domains/example)', links=None, screenshot=None), 'error': None}
提取工具
from langchain_hyperbrowser import HyperbrowserExtractTool
from pydantic import BaseModel
class SimpleExtractionModel(BaseModel):
title: str
result = HyperbrowserExtractTool().invoke(
{
"url": "https://example.com",
"schema": SimpleExtractionModel,
}
)
print(result)
{'data': {'title': 'Example Domain'}, 'error': None}
使用自定义选项
具有自定义选项的爬取工具
result = HyperbrowserCrawlTool().run(
{
"url": "https://example.com",
"max_pages": 2,
"scrape_options": {
"formats": ["markdown", "html"],
},
"session_options": {"use_proxy": True, "solve_captchas": True},
}
)
print(result)
{'data': [CrawledPage(metadata={'url': 'https://www.example.com/', 'title': 'Example Domain', 'viewport': 'width=device-width, initial-scale=1', 'sourceURL': 'https://example.com'}, html=None, markdown='Example Domain\n\n# Example Domain\n\nThis domain is for use in illustrative examples in documents. You may use this\ndomain in literature without prior coordination or asking for permission.\n\n[More information...](https://www.iana.org/domains/example)', links=None, screenshot=None, url='https://example.com', status='completed', error=None)], 'error': None}
带自定义选项的抓取工具
result = HyperbrowserScrapeTool().run(
{
"url": "https://example.com",
"scrape_options": {
"formats": ["markdown", "html"],
},
"session_options": {"use_proxy": True, "solve_captchas": True},
}
)
print(result)
{'data': ScrapeJobData(metadata={'url': 'https://www.example.com/', 'title': 'Example Domain', 'viewport': 'width=device-width, initial-scale=1', 'sourceURL': 'https://example.com'}, html='<html><head>\n <title>Example Domain</title>\n\n <meta charset="utf-8">\n <meta http-equiv="Content-type" content="text/html; charset=utf-8">\n <meta name="viewport" content="width=device-width, initial-scale=1">\n \n</head>\n\n<body>\n<div>\n <h1>Example Domain</h1>\n <p>This domain is for use in illustrative examples in documents. You may use this\n domain in literature without prior coordination or asking for permission.</p>\n <p><a href="https://www.iana.org/domains/example">More information...</a></p>\n</div>\n\n\n</body></html>', markdown='Example Domain\n\n# Example Domain\n\nThis domain is for use in illustrative examples in documents. You may use this\ndomain in literature without prior coordination or asking for permission.\n\n[More information...](https://www.iana.org/domains/example)', links=None, screenshot=None), 'error': None}
使用自定义模式提取工具
from typing import List
from pydantic import BaseModel
class ProductSchema(BaseModel):
title: str
price: float
class ProductsSchema(BaseModel):
products: List[ProductSchema]
result = HyperbrowserExtractTool().run(
{
"url": "https://dummyjson.com/products?limit=10",
"schema": ProductsSchema,
"session_options": {"session_options": {"use_proxy": True}},
}
)
print(result)
{'data': {'products': [{'price': 9.99, 'title': 'Essence Mascara Lash Princess'}, {'price': 19.99, 'title': 'Eyeshadow Palette with Mirror'}, {'price': 14.99, 'title': 'Powder Canister'}, {'price': 12.99, 'title': 'Red Lipstick'}, {'price': 8.99, 'title': 'Red Nail Polish'}, {'price': 49.99, 'title': 'Calvin Klein CK One'}, {'price': 129.99, 'title': 'Chanel Coco Noir Eau De'}, {'price': 89.99, 'title': "Dior J'adore"}, {'price': 69.99, 'title': 'Dolce Shine Eau de'}, {'price': 79.99, 'title': 'Gucci Bloom Eau de'}]}, 'error': None}
异步用法
所有工具均支持异步使用:
from typing import List
from langchain_hyperbrowser import (
HyperbrowserCrawlTool,
HyperbrowserExtractTool,
HyperbrowserScrapeTool,
)
from pydantic import BaseModel
class ExtractionSchema(BaseModel):
popular_library_name: List[str]
async def web_operations():
# Crawl
crawl_tool = HyperbrowserCrawlTool()
crawl_result = await crawl_tool.arun(
{
"url": "https://example.com",
"max_pages": 5,
"scrape_options": {"formats": ["markdown"]},
}
)
# Scrape
scrape_tool = HyperbrowserScrapeTool()
scrape_result = await scrape_tool.arun(
{"url": "https://example.com", "scrape_options": {"formats": ["markdown"]}}
)
# Extract
extract_tool = HyperbrowserExtractTool()
extract_result = await extract_tool.arun(
{
"url": "https://npmjs.com",
"schema": ExtractionSchema,
}
)
return crawl_result, scrape_result, extract_result
results = await web_operations()
print(results)
---------------------------------------------------------------------------
``````output
NameError Traceback (most recent call last)
``````output
Cell In[6], line 10
1 from langchain_hyperbrowser import (
2 HyperbrowserCrawlTool,
3 HyperbrowserExtractTool,
4 HyperbrowserScrapeTool,
5 )
7 from pydantic import BaseModel
---> 10 class ExtractionSchema(BaseModel):
11 popular_library_name: List[str]
14 async def web_operations():
15 # Crawl
``````output
Cell In[6], line 11, in ExtractionSchema()
10 class ExtractionSchema(BaseModel):
---> 11 popular_library_name: List[str]
``````output
NameError: name 'List' is not defined
在代理中使用
以下是在 Agent 中使用任何 Web 工具的方法:
from langchain_hyperbrowser import HyperbrowserCrawlTool
from langchain_openai import ChatOpenAI
from langgraph.prebuilt import create_react_agent
# Initialize the crawl tool
crawl_tool = HyperbrowserCrawlTool()
# Create the agent with the crawl tool
llm = ChatOpenAI(temperature=0)
agent = create_react_agent(llm, [crawl_tool])
user_input = "Crawl https://example.com and get content from up to 5 pages"
for step in agent.stream(
{"messages": user_input},
stream_mode="values",
):
step["messages"][-1].pretty_print()
================================[1m Human Message [0m=================================
Crawl https://example.com and get content from up to 5 pages
==================================[1m Ai Message [0m==================================
Tool Calls:
hyperbrowser_crawl_data (call_G2ofdHOqjdnJUZu4hhbuga58)
Call ID: call_G2ofdHOqjdnJUZu4hhbuga58
Args:
url: https://example.com
max_pages: 5
scrape_options: {'formats': ['markdown']}
=================================[1m Tool Message [0m=================================
Name: hyperbrowser_crawl_data
{'data': [CrawledPage(metadata={'url': 'https://www.example.com/', 'title': 'Example Domain', 'viewport': 'width=device-width, initial-scale=1', 'sourceURL': 'https://example.com'}, html=None, markdown='Example Domain\n\n# Example Domain\n\nThis domain is for use in illustrative examples in documents. You may use this\ndomain in literature without prior coordination or asking for permission.\n\n[More information...](https://www.iana.org/domains/example)', links=None, screenshot=None, url='https://example.com', status='completed', error=None)], 'error': None}
==================================[1m Ai Message [0m==================================
I have crawled the website [https://example.com](https://example.com) and retrieved content from the first page. Here is the content in markdown format:
\`\`\`
Example Domain
# Example Domain
This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.
[More information...](https://www.iana.org/domains/example)
\`\`\`
If you would like to crawl more pages or need additional information, please let me know!
配置选项
常用选项
所有工具均支持这些基本配置选项:
url: 要处理的 URLsession_options浏览器会话配置use_proxy: 是否使用代理solve_captchas: 是否自动解决验证码accept_cookies: 是否接受 Cookie
特定于工具的选项
爬取工具
max_pages: 要爬取的最大页面数scrape_options抓取每个页面的选项formats: 输出格式列表(markdown、html)
爬取工具
scrape_options抓取页面的选项formats: 输出格式列表(markdown、html)
提取工具
schema: 定义提取结构的 Pydantic 模型extraction_prompt: 用于提取的自然语言提示
更多详情,请参阅相应的 API 参考文档: