部署OCR识别模型

一、模型原理与结构

光学字符识别（OCR）模型将图像中的文字转换为可编辑文本，现代 OCR 系统通常融合：

视觉编码器（Vision Encoder）：提取图像特征，如 CNN、ViT（Vision Transformer）
文本解码器（Text Decoder）：将视觉特征序列转换为文字序列，常用 Transformer 或 LSTM
注意力机制：对齐图像区域与文本 token，处理复杂布局
多模态融合：结合视觉与语言预训练（如 CLIP），提升语义理解

DeepSeek-OCR、PaddleOCR-VL 等模型采用视觉语言模型（VLM）架构，支持从简单文字识别到复杂文档理解的多种任务。

二、应用场景

OCR 模型可应用于：

文档数字化：扫描件转可编辑文档、档案管理
票据识别：发票、身份证、银行卡信息提取
内容审核：图片中的文字违规检测
智能翻译：拍照翻译、路标识别
表格提取：财报分析、结构化数据录入

三、部署指南与示例

本指南以多模态视觉语言模型的 OCR 能力为例，演示在国产算力环境中完成图像文字识别的部署流程。我们将以 PaddleOCR-VL-1.5 为示例模型。

推理框架概览

Hugging Face Transformers: 支持视觉语言模型的统一加载与推理。

前提条件

资源准备:
- 内置模型: 推荐优先使用 /mnt/moark-models/ 路径加载模型。
- 图片素材: 准备待识别图片，例如 /mnt/moark-models/ocr_demo.jpg。
环境一致性:
- 镜像匹配: 请严格按照各章节镜像版本创建实例，避免驱动不兼容。

一、沐曦 (MetaX) 部署指南

本章节适用于 曦云 C500 等沐曦系列算力卡。

1. 通用环境准备

算力型号: 曦云 C500 (64GB)
版本要求：pytorch>=2.4

2. 模型部署实战

2.1 PaddleOCR-VL-1.5

本示例演示如何在 曦云 C500 上完成 OCR 识别。

运行推理代码: 新建 Notebook 单元格运行。

!pip install "transformers>=5.0.0"
from PIL import Image
import torch
from transformers import AutoProcessor, AutoModelForImageTextToText

model_path = "/mnt/moark-models/PaddleOCR-VL-1.5"
image_path = "/mnt/moark-models/ocr_demo.png"
task = "ocr" # Options: 'ocr' | 'table' | 'chart' | 'formula' | 'spotting' | 'seal'

image = Image.open(image_path).convert("RGB")
orig_w, orig_h = image.size
spotting_upscale_threshold = 1500

# 通用图片预处理方式，目的是放大图片，针对spotting任务，需要确保图片内的文字清晰可见，以提高ocr的识别度，
if task == "spotting" and orig_w < spotting_upscale_threshold and orig_h < spotting_upscale_threshold:
    process_w, process_h = orig_w * 2, orig_h * 2
    try:
        resample_filter = Image.Resampling.LANCZOS
    except AttributeError:
        resample_filter = Image.LANCZOS
    image = image.resize((process_w, process_h), resample_filter)

max_pixels = 2048 * 28 * 28 if task == "spotting" else 1280 * 28 * 28

DEVICE = "cuda"
PROMPTS = {
    "ocr": "OCR:",
    "table": "Table Recognition:",
    "formula": "Formula Recognition:",
    "chart": "Chart Recognition:",
    "spotting": "Spotting:",
    "seal": "Seal Recognition:",
}

model = AutoModelForImageTextToText.from_pretrained(model_path, torch_dtype=torch.bfloat16).to(DEVICE).eval()
processor = AutoProcessor.from_pretrained(model_path)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": PROMPTS[task]},
        ]
    }
]
inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    images_kwargs={"size": {"shortest_edge": processor.image_processor.min_pixels, "longest_edge": max_pixels}},
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512)
result = processor.decode(outputs[0][inputs["input_ids"].shape[-1]:-1])
print(result)

二、燧原 (Enflame) 部署指南

本章节适用于 S60 等燧原系列算力卡。需引入 torch_gcu 适配库。

1. 通用环境准备

算力型号: Enflame S60 (48GB)
镜像选择: vLLM / 0.11.0 / Python 3.12 / ef 1.7.0.14

2. 模型部署实战

2.1 PaddleOCR-VL-1.5

本示例演示如何在 Enflame S60 环境完成 OCR 识别。

运行推理代码: 新建 Notebook 单元格运行。

# 安装依赖
!pip install "transformers>=5.0.0"

import torch
import torch_gcu
from torch_gcu import transfer_to_gcu
from PIL import Image
from transformers import AutoProcessor, AutoModelForImageTextToText

model_path = "/mnt/moark-models/PaddleOCR-VL-1.5"
image_path = "/mnt/moark-models/L1_exam/ocr_demo.png"
task = "ocr" # Options: 'ocr' | 'table' | 'chart' | 'formula' | 'spotting' | 'seal'

image = Image.open(image_path).convert("RGB")
orig_w, orig_h = image.size
spotting_upscale_threshold = 1500

# 通用图片预处理方式，目的是放大图片，针对spotting任务，需要确保图片内的文字清晰可见，以提高ocr的识别度，
if task == "spotting" and orig_w < spotting_upscale_threshold and orig_h < spotting_upscale_threshold:
    process_w, process_h = orig_w * 2, orig_h * 2
    try:
        resample_filter = Image.Resampling.LANCZOS
    except AttributeError:
        resample_filter = Image.LANCZOS
    image = image.resize((process_w, process_h), resample_filter)

max_pixels = 2048 * 28 * 28 if task == "spotting" else 1280 * 28 * 28

DEVICE = "cuda"
PROMPTS = {
    "ocr": "OCR:",
    "table": "Table Recognition:",
    "formula": "Formula Recognition:",
    "chart": "Chart Recognition:",
    "spotting": "Spotting:",
    "seal": "Seal Recognition:",
}

model = AutoModelForImageTextToText.from_pretrained(model_path, torch_dtype=torch.bfloat16).to(DEVICE).eval()
processor = AutoProcessor.from_pretrained(model_path)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": PROMPTS[task]},
        ]
    }
]
inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    images_kwargs={"size": {"shortest_edge": processor.image_processor.min_pixels, "longest_edge": max_pixels}},
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512)
result = processor.decode(outputs[0][inputs["input_ids"].shape[-1]:-1])
print(result)

四、常见问题

图片识别不完整: 尝试提高图片分辨率或使用更清晰的扫描件。
模型加载失败: 确认模型路径存在，或先在终端中完成模型下载。

五、本地访问与服务验证

请参考【SSH 隧道配置指南】建立安全连接。

一、模型原理与结构​

二、应用场景​

三、部署指南与示例​

推理框架概览​

前提条件​

一、 沐曦 (MetaX) 部署指南​

1. 通用环境准备​

2. 模型部署实战​

2.1 PaddleOCR-VL-1.5​

二、 燧原 (Enflame) 部署指南​

1. 通用环境准备​

2. 模型部署实战​

2.1 PaddleOCR-VL-1.5​

四、常见问题​

五、本地访问与服务验证​

一、模型原理与结构

二、应用场景

三、部署指南与示例

推理框架概览

前提条件

一、沐曦 (MetaX) 部署指南

1. 通用环境准备

2. 模型部署实战

2.1 PaddleOCR-VL-1.5

二、燧原 (Enflame) 部署指南

1. 通用环境准备

2. 模型部署实战

2.1 PaddleOCR-VL-1.5

四、常见问题

五、本地访问与服务验证