📄 ocr.md

← 返回目录

Qwen-VL-OCR Text Extraction Guide

Content validity: 2026-03 | Source: Qwen-VL-OCR


Overview

Qwen-VL-OCR is optimized for text extraction and structured data parsing from images: scanned documents, tables, receipts, tickets, ID cards, and handwritten text. Higher accuracy than general VL models for text-heavy images.


Supported Models

| Model | Region | Notes | |-------|--------|-------| | qwen-vl-ocr (stable) | China (cn-beijing) | For pricing, see official pricing page | | qwen-vl-ocr-2025-11-20 | China (cn-beijing) | Pinned version |

Context: 38,192 tokens. Max input: 30,000 tokens per image. Max output: 8,192 tokens.


API Reference

Same endpoint: POST /compatible-mode/v1/chat/completions

Pixel Control Parameters

| Parameter | Type | Description | |-----------|------|-------------| | min_pixels | int | Minimum pixel threshold. Default: 32323 (3,072). Controls minimum resolution. | | max_pixels | int | Maximum pixel threshold. Default: 32328192 (8,388,608). Controls max tokens consumed. |

Pass these inside the image_url object:

{
  "type": "image_url",
  "image_url": {
    "url": "https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg",
    "min_pixels": 3072,
    "max_pixels": 8388608
  }
}

Token Calculation

For qwen-vl-ocr-2025-11-20, token_pixels = 32*32 = 1024. Formula:

tokens = (h_bar × w_bar) / token_pixels + 2

Where h_bar and w_bar are the image dimensions after resizing to fit within pixel bounds.

Default Behavior

If no prompt is provided, the model uses: "Please output only the text content from the image without any additional descriptions or formatting."


Code Examples

Basic OCR (Python)

from openai import OpenAI
import os

client = OpenAI( api_key=os.getenv("DASHSCOPE_API_KEY"), base_url="https://dashscope.aliyuncs.com/compatible-mode/v1", )

completion = client.chat.completions.create( model="qwen-vl-ocr", messages=[{ "role": "user", "content": [ {"type": "image_url", "image_url": { "url": "https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg", "min_pixels": 3072, "max_pixels": 8388608, }}, {"type": "text", "text": "Extract all text from this image."}, ], }], ) print(completion.choices[0].message.content)

Structured Extraction — Train Ticket (Python)

PROMPT = """Extract the following fields from this train ticket image and return as JSON:
{"invoice_number": "...", "train_number": "...", "departure_station": "...",
 "destination_station": "...", "departure_time": "...", "seat": "...",
 "ticket_price": "...", "passenger_name": "..."}"""

completion = client.chat.completions.create( model="qwen-vl-ocr-2025-11-20", messages=[{ "role": "user", "content": [ {"type": "image_url", "image_url": { "url": "https://img.alicdn.com/imgextra/i1/NotRealJustExample/ticket.jpg", "min_pixels": 3072, "max_pixels": 8388608, }}, {"type": "text", "text": PROMPT}, ], }], ) print(completion.choices[0].message.content)

curl

curl -X POST https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions \
  -H "Authorization: Bearer $DASHSCOPE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen-vl-ocr",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "image_url", "image_url": {"url": "https://img.alicdn.com/imgextra/i1/NotRealJustExample/doc.jpg", "min_pixels": 3072, "max_pixels": 8388608}},
        {"type": "text", "text": "Extract all text from this image."}
      ]
    }]
  }'


Capabilities

- Multi-language: Chinese, English, Japanese, Korean, and many more

2. Pixel parameters control cost. Higher max_pixels = more tokens = better accuracy but higher cost. For simple text, lower values suffice. 3. SDK version requirements: DashScope Python SDK >= 1.22.2, Java SDK >= 2.21.8. 4. DashScope-only features: Image rotation correction and built-in OCR task types are only available through the DashScope native API, not through the OpenAI-compatible API.