Qwen3.5-0.8B Vision OCR — GGUF Q4_K_M

A quantized GGUF export of Mustafaege/Qwen3.5-0.8B-vision-LORA-16bit — a Qwen3.5-0.8B vision-language model fine-tuned for document OCR and image-to-LaTeX conversion.

Converted from the merged 16-bit weights to Q4_K_M quantization using Unsloth's GGUF pipeline. Includes the multimodal projector (mmproj) required for vision inference.

Full-precision LoRA adapter: Mustafaege/Qwen3.5-0.8B-vision-LORA-16bit

Model Details

Property	Value
Base Model	`unsloth/Qwen3.5-0.8B`
Fine-tuned Adapter	`Mustafaege/Qwen3.5-0.8B-vision-LORA-16bit`
Task	Document OCR / Image-to-LaTeX
Format	GGUF
Quantization	Q4_K_M
Multimodal Projector	Included — `Qwen3.5-0.8B.BF16-mmproj.gguf`
Conversion Tool	Unsloth GGUF pipeline
Training Platform	Lightning.ai
Training Hardware	NVIDIA A100-SXM4-80GB
License	Apache 2.0
Developed by	Mustafaege

Available Files

File	Description	Required
`Qwen3.5-0.8B.Q4_K_M.gguf`	Main model weights — Q4_K_M quantized	✅ Yes
`Qwen3.5-0.8B.BF16-mmproj.gguf`	Vision multimodal projector — BF16 precision	✅ Yes

⚠️ Both files must be downloaded for vision/multimodal inference. The mmproj file handles image encoding and is separate from the LLM weights.

Quantization: Q4_K_M

Q4_K_M is a 4-bit K-Quant Mixed format — the most widely recommended quantization for balanced quality and memory efficiency.

Property	Value
Bits	4-bit mixed precision
Method	K-Quant (block-wise optimization)
Estimated Model Size	~700 MB
Estimated VRAM	~1.5 GB (GPU) / runs on CPU
Quality Loss	Minimal for OCR and structured text tasks

Quantization Format Comparison

Format	Size	VRAM	Quality	Best For
`BF16` (original merged)	~1.6 GB	~2.5 GB	Highest	Requantization / GPU with headroom
`Q4_K_M` (this repo)	~700 MB	~1.5 GB	Balanced	Most deployments — recommended
`Q5_K_M`	~900 MB	~1.8 GB	Better	When slightly more accuracy needed
`Q8_0`	~1.3 GB	~2 GB	Near-lossless	High-accuracy CPU inference

GGUF Conversion Process

This GGUF was produced by:

Fine-tuning unsloth/Qwen3.5-0.8B with 16-bit LoRA on Mustafaege/qwen3.5-vision-ocr-v1
Merging LoRA weights into the base model (bf16)
Converting the merged model to BF16 GGUF format
Quantizing to Q4_K_M — including the mmproj vision projector
Uploading both the LLM and mmproj files to this repository

Usage

llama.cpp — Multimodal CLI (Recommended)

# Step 1: Download both files
# Qwen3.5-0.8B.Q4_K_M.gguf
# Qwen3.5-0.8B.BF16-mmproj.gguf

# Step 2: Run multimodal inference
llama-mtmd-cli \
  -m Qwen3.5-0.8B.Q4_K_M.gguf \
  --mmproj Qwen3.5-0.8B.BF16-mmproj.gguf \
  --image formula.png \
  -p "Write the LaTeX representation for this image." \
  -n 512

llama.cpp — Direct Hub Download

llama-mtmd-cli \
  -hf Mustafaege/Qwen3.5-0.8B-GGUF-q4_k_m \
  --image formula.png \
  -p "Write the LaTeX representation for this image." \
  --jinja

Ollama

# Create Modelfile
cat > Modelfile << 'EOF'
FROM Mustafaege/Qwen3.5-0.8B-GGUF-q4_k_m
PARAMETER num_predict 512
PARAMETER temperature 0.7
EOF

ollama create qwen3.5-ocr -f Modelfile
ollama run qwen3.5-ocr

Python — llama-cpp-python

pip install llama-cpp-python

from llama_cpp import Llama
from llama_cpp.llama_chat_format import MiniCPMv26ChatHandler

chat_handler = MiniCPMv26ChatHandler(
    clip_model_path="Qwen3.5-0.8B.BF16-mmproj.gguf"
)

llm = Llama(
    model_path="Qwen3.5-0.8B.Q4_K_M.gguf",
    chat_handler=chat_handler,
    n_ctx=4096,
    verbose=False,
)

response = llm.create_chat_completion(
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": "file:///path/to/formula.png"}},
                {"type": "text", "text": "Write the LaTeX representation for this image."},
            ],
        }
    ],
    max_tokens=512,
)
print(response["choices"][0]["message"]["content"])

Training Details

Parameter	Value
LoRA Rank	16
Learning Rate	`2e-4`
Batch Size	`4`
Gradient Accumulation	`4` (eff. batch 16)
Precision	bf16
Platform	Lightning.ai · NVIDIA A100-SXM4-80GB

Related Resources

Resource	Link
Full-precision LoRA Adapter	`Mustafaege/Qwen3.5-0.8B-vision-LORA-16bit`
Training Dataset	`Mustafaege/qwen3.5-vision-ocr-v1`
Base Model	`unsloth/Qwen3.5-0.8B`
Unsloth (GGUF conversion)	github.com/unslothai/unsloth

Limitations

Optimized for document and formula images; not suited for natural scene understanding.
Q4_K_M quantization may reduce accuracy on highly complex mathematical notation compared to the full-precision adapter.
Output quality depends on input image resolution and clarity.

Citation

@misc{mustafaege2026qwen35visionocr,
  title   = {Qwen3.5-0.8B Vision OCR: GGUF Q4_K_M for Local Inference},
  author  = {Mustafaege},
  year    = {2026},
  url     = {https://huggingface.co/Mustafaege/Qwen3.5-0.8B-GGUF-q4_k_m}
}

@misc{qwen3_5,
  title     = {Qwen3.5 Technical Report},
  author    = {Qwen Team},
  year      = {2025},
  publisher = {Alibaba Cloud}
}

Converted with Unsloth on Lightning.ai.

Downloads last month: 189

GGUF

Model size

0.8B params

Architecture

qwen35

Hardware compatibility

4-bit

Model tree for Mustafaege/Qwen3.5-0.8B-GGUF-q4_k_m

Base model

Qwen/Qwen3.5-0.8B-Base

Finetuned

Qwen/Qwen3.5-0.8B

Finetuned

unsloth/Qwen3.5-0.8B

Quantized

(1)

this model