87 lines
2.0 KiB
Markdown
87 lines
2.0 KiB
Markdown
|
|
# MinerU v2.0 多GPU服务器
|
|||
|
|
|
|||
|
|
[English](README.md)
|
|||
|
|
|
|||
|
|
这是一个精简的多GPU服务器实现。
|
|||
|
|
|
|||
|
|
## 快速开始
|
|||
|
|
|
|||
|
|
### 1. 安装 MinerU
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
pip install --upgrade pip
|
|||
|
|
pip install uv
|
|||
|
|
uv pip install -U "mineru[core]"
|
|||
|
|
uv pip install litserve aiohttp loguru
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 2. 启动服务器
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
python server.py
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 3. 启动客户端
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
python client.py
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
现在,`[demo](../../demo/)` 文件夹下的PDF文件将并行处理。假设您有2个GPU,如果您将 `workers_per_device` 更改为 `2`,则可以同时处理4个PDF文件!
|
|||
|
|
|
|||
|
|
## 自定义
|
|||
|
|
|
|||
|
|
### 服务器
|
|||
|
|
|
|||
|
|
以下示例展示了如何启动带有自定义设置的服务器:
|
|||
|
|
```python
|
|||
|
|
server = ls.LitServer(
|
|||
|
|
MinerUAPI(output_dir='/tmp/mineru_output'), # 自定义输出文件夹
|
|||
|
|
accelerator='auto', # 您可以指定 'cuda'
|
|||
|
|
devices='auto', # "auto" 使用所有可用的GPU
|
|||
|
|
workers_per_device=1, # 每个GPU启动一个工作实例
|
|||
|
|
timeout=False # 禁用超时,用于长时间处理
|
|||
|
|
)
|
|||
|
|
server.run(port=8000, generate_client_file=False)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 客户端
|
|||
|
|
|
|||
|
|
客户端支持同步和异步处理:
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
import asyncio
|
|||
|
|
import aiohttp
|
|||
|
|
from client import mineru_parse_async
|
|||
|
|
|
|||
|
|
async def process_documents():
|
|||
|
|
async with aiohttp.ClientSession() as session:
|
|||
|
|
# 基本用法
|
|||
|
|
result = await mineru_parse_async(session, 'document.pdf')
|
|||
|
|
|
|||
|
|
# 带自定义选项
|
|||
|
|
result = await mineru_parse_async(
|
|||
|
|
session,
|
|||
|
|
'document.pdf',
|
|||
|
|
backend='pipeline',
|
|||
|
|
lang='ch',
|
|||
|
|
formula_enable=True,
|
|||
|
|
table_enable=True
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# 运行异步处理
|
|||
|
|
asyncio.run(process_documents())
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 并行处理
|
|||
|
|
同时处理多个文件:
|
|||
|
|
```python
|
|||
|
|
async def process_multiple_files():
|
|||
|
|
files = ['doc1.pdf', 'doc2.pdf', 'doc3.pdf']
|
|||
|
|
|
|||
|
|
async with aiohttp.ClientSession() as session:
|
|||
|
|
tasks = [mineru_parse_async(session, file) for file in files]
|
|||
|
|
results = await asyncio.gather(*tasks)
|
|||
|
|
|
|||
|
|
return results
|
|||
|
|
```
|