Miner U 使用实录

1385字约5分钟

pdf2md pdf2json

2024-11-24

MinerU 是一个一站式开源高质量数据提取工具，将 PDF 转换成 Markdown 和 JSON 格式。

说来我使用 tailscale 就是为 MinerU 这醋包的饺子。一直以来，在读论文时都有以下痛点：

语言不通。毕竟本专业优质论文基本上都是英语。
做笔记困难。如果说是直接在原文批注，但原文多且杂。
难以用到大模型做二次处理等等。

在实验室 104c 的服务器上跑，大概能够 15s 左右解析一页。

使用解析生成的 json 文件，通过调用大模型 API 将翻译加入到相应的 text，最后再组合成 Markdown 文件可以实现一段英文一段中文的沉浸式翻译的效果。

成本大概为 20 k Tokens / 篇中等规模论文（8 页），豆包价格为 2 分钱一篇，GPT-4o 用某宝买的代理 API 大概在几毛钱一篇。

写个脚本批量转换并翻译，大概晚上丢进去一个论文目录，早上就能收获翻译结果啦。

scripts/json_trans_to_md.py

import json
# from openai import OpenAI
import json
import os
from volcenginesdkarkruntime import Ark

PROMPT = """
你是一位专业的双语翻译专家。请将以下 Markdown 文本从[源语言]翻译成中文。

要求：
1. 保持原文的 Markdown 格式和标记不变，包括但不限于：
   - 标题层级（#, ##, ###)
   - 列表格式（有序、无序列表）
   - 强调标记（**, *, ~~, `)
   - 链接和图片
   - 代码块和行内代码
   - 表格结构
   - 引用块（>)

2. 翻译原则：
   - 保持专业术语的准确性
   - 符合目标语言的表达习惯
   - 保持原文的语气和风格
   - 不翻译代码块内的代码
   - 不翻译公式
   - 保留原文的 URL 和文件路径
   - 保持原有的段落结构

3. 如遇到专有名词：
   - 使用官方翻译（如果有）
   - 首次出现时可以保留原文，格式为：翻译（原文）
   - 代码相关的专有名词优先使用业界通用翻译

4. 输出要求：
   - 直接输出翻译后的 Markdown 文本
   - 不要添加额外的解释或说明
   - 保持原有的换行和空行
"""

# OPENAI
# 初始化客户端时指定基础URL
# client = OpenAI(
#     api_key=os.getenv('OPENAI_API_KEY'),
#     base_url=os.getenv('OPENAI_BASE')
# )

# # 其余的API调用保持不变
# def get_chat_response(content):
#     try:
#         response = client.chat.completions.create(
#             model="gpt-4o",
#             messages=[
#                 {"role": "user", "content": content},
#                 {"role": "system", "content": PROMPT}
#             ]
#         )
#         return response.choices[0].message.content
#     except Exception as e:
#         return f"发生错误：{str(e)}"

# 初始化 Ark 客户端，豆包
ark_client = Ark(api_key=os.getenv('ARK_API_KEY'))

def get_chat_response_ark(content):
    try:
        response = ark_client.chat.completions.create(
            model=os.getenv('ARK_MODEL'),  # 需要替换为你的模型端点ID
            messages=[
                {"role": "system", "content": PROMPT},
                {"role": "user", "content": content}
            ]
        )
        return response.choices[0].message.content
    except Exception as e:
        return f"发生错误：{str(e)}"

from multiprocessing import Pool
from functools import partial

def process_item(item, output_file):
    result = ""
    print("doing", item['type'])
    
    if item['type'] == 'text':
        text = item['text']
        text += ('\n\n' + get_chat_response_ark(text)) if text != '' else ''
        print("translated", text)
        
        if 'text_level' in item:
            result = '#' * item['text_level'] + ' ' + text + '\n\n'
        else:
            result = text + '\n\n'
            
    elif item['type'] == 'image' or item['type'] == 'table':
        caption = ' '.join(item['img_caption']) if 'img_caption' in item else ''
        caption = caption.strip()
        
        image_md = f'![]({item["img_path"]})'
        result = image_md + '\n'
        
        if caption:
            result += f'*{caption}*\n'
        result += '\n'
    
    # 写入文件时需要加锁，但由于我们这里返回字符串，在主进程统一写入，所以不需要锁
    return result

def json_to_markdown(json_file_path, output_file_path):
    # 读取JSON文件
    with open(json_file_path, 'r', encoding='utf-8') as f:
        data = json.load(f)
    
    # 创建进程池
    with Pool() as pool:
        # 使用偏函数固定 output_file 参数
        process_func = partial(process_item, output_file=output_file_path)
        # 并发处理所有项目
        results = pool.map(process_func, data)
    
    # 将所有结果写入文件
    with open(output_file_path, 'w', encoding='utf-8') as out_file:
        for result in results:
            out_file.write(result)

if __name__ == "__main__":
    import argparse
    
    # 创建命令行参数解析器
    parser = argparse.ArgumentParser(description='将JSON文件转换为Markdown文件')
    parser.add_argument('input_file', help='输入的JSON文件路径')
    parser.add_argument('output_file', help='输出的Markdown文件路径')
    
    # 解析命令行参数
    args = parser.parse_args()
    
    try:
        json_to_markdown(args.input_file, args.output_file)
        print(f"转换成功！\n输出文件：{args.output_file}")
    except Exception as e:
        print(f"转换失败：{str(e)}")

batch_do.sh

#! /bin/bash

# 设置默认值
need_trans=true

# # 解析选项
# OPTIND=1 
# while getopts "t:" opt; do
#     case $opt in
#         t)
#             need_trans=true
#             ;;
#         \?)
#             echo "无效的选项: -$OPTARG"
#             exit 1
#             ;;
#     esac
# done

if $need_trans; then
    # 检查环境变量 ARK_API_KEY 和 ARK_MODEL 是否已设置
    if [ -z "$ARK_API_KEY" ]; then
        echo "ARK_API_KEY 环境变量未设置"
        exit 1
    fi

    if [ -z "$ARK_MODEL" ]; then
        echo "ARK_MODEL 环境变量未设置"
        exit 1
    fi
fi

# 检查剩余参数
if [ $# -lt 2 ]; then
    echo "用法: $0 [-t] <PDF文件...> <输出目录>"
    echo "     -t 表示需要翻译"
    exit 1
fi

# 获取最后一个参数作为输出目录
output_dir="${@: -1}"
echo "输出目录: $output_dir"
files=()
for file in "${@:1:$((${#@}-1))}"; do
    files+=("$file")
    echo "添加文件到队列: $file"
done

output_json_files=()

# 创建日志文件，使用时间戳命名
log_file="logs/pdf_process_$(date '+%Y%m%d_%H%M%S').log"

total=0
success=0
failed=0
for file in "${files[@]}"; do
    ((total++))
    echo "正在处理: $file, 输出目录: "$output_dir"/$(basename $file .pdf)/" | tee -a "$log_file"
    conda run -n MinerU --no-capture-output bash -c '
    magic-pdf -p '"$file"' -o '"$output_dir"'/ -m auto 2>&1 | tee -a '"$log_file"'
    '
    if [ $? -eq 0 ]; then
        ((success++))
        echo "✅ $file 处理成功" | tee -a "$log_file"
    else
        ((failed++))
        echo "❌ $file 处理失败" | tee -a "$log_file"
    fi
done

echo "处理完成，共处理了 $total 个文件，成功了 $success 个，失败了 $failed 个" | tee -a "$log_file"

for file in "${files[@]}"; do
    output_json_file="$output_dir/$(basename $file .pdf)/auto/$(basename $file .pdf)_content_list.json"
    if [ -f "$output_json_file" ]; then
        echo "开始翻译 $(basename $file .pdf)_trans.md" | tee -a "$log_file"
        python3 scripts/json_trans_to_md.py "$output_json_file" "$output_dir/$(basename $file .pdf)/auto/$(basename $file .pdf)_trans.md"
        if [ $? -eq 0 ]; then
            echo "✅ $file 翻译成功 $(basename $file .pdf)_trans.md" | tee -a "$log_file"
        else
            echo "❌ $file 翻译失败 $(basename $file .pdf)_trans.md" | tee -a "$log_file"
        fi
    fi
done