Qwen3.5-0.8B Vision OCR β GGUF Q4_K_M
A quantized GGUF export of Mustafaege/Qwen3.5-0.8B-vision-LORA-16bit β a Qwen3.5-0.8B vision-language model fine-tuned for document OCR and image-to-LaTeX conversion.
Converted from the merged 16-bit weights to Q4_K_M quantization using Unsloth's GGUF pipeline. Includes the multimodal projector (mmproj) required for vision inference.
Full-precision LoRA adapter:
Mustafaege/Qwen3.5-0.8B-vision-LORA-16bit
Model Details
| Property | Value |
|---|---|
| Base Model | unsloth/Qwen3.5-0.8B |
| Fine-tuned Adapter | Mustafaege/Qwen3.5-0.8B-vision-LORA-16bit |
| Task | Document OCR / Image-to-LaTeX |
| Format | GGUF |
| Quantization | Q4_K_M |
| Multimodal Projector | Included β Qwen3.5-0.8B.BF16-mmproj.gguf |
| Conversion Tool | Unsloth GGUF pipeline |
| Training Platform | Lightning.ai |
| Training Hardware | NVIDIA A100-SXM4-80GB |
| License | Apache 2.0 |
| Developed by | Mustafaege |
Available Files
| File | Description | Required |
|---|---|---|
Qwen3.5-0.8B.Q4_K_M.gguf |
Main model weights β Q4_K_M quantized | β Yes |
Qwen3.5-0.8B.BF16-mmproj.gguf |
Vision multimodal projector β BF16 precision | β Yes |
β οΈ Both files must be downloaded for vision/multimodal inference. The
mmprojfile handles image encoding and is separate from the LLM weights.
Quantization: Q4_K_M
Q4_K_M is a 4-bit K-Quant Mixed format β the most widely recommended quantization for balanced quality and memory efficiency.
| Property | Value |
|---|---|
| Bits | 4-bit mixed precision |
| Method | K-Quant (block-wise optimization) |
| Estimated Model Size | ~700 MB |
| Estimated VRAM | ~1.5 GB (GPU) / runs on CPU |
| Quality Loss | Minimal for OCR and structured text tasks |
Quantization Format Comparison
| Format | Size | VRAM | Quality | Best For |
|---|---|---|---|---|
BF16 (original merged) |
~1.6 GB | ~2.5 GB | Highest | Requantization / GPU with headroom |
Q4_K_M (this repo) |
~700 MB | ~1.5 GB | Balanced | Most deployments β recommended |
Q5_K_M |
~900 MB | ~1.8 GB | Better | When slightly more accuracy needed |
Q8_0 |
~1.3 GB | ~2 GB | Near-lossless | High-accuracy CPU inference |
GGUF Conversion Process
This GGUF was produced by:
- Fine-tuning
unsloth/Qwen3.5-0.8Bwith 16-bit LoRA onMustafaege/qwen3.5-vision-ocr-v1 - Merging LoRA weights into the base model (bf16)
- Converting the merged model to
BF16GGUF format - Quantizing to
Q4_K_Mβ including themmprojvision projector - Uploading both the LLM and mmproj files to this repository
Usage
llama.cpp β Multimodal CLI (Recommended)
# Step 1: Download both files
# Qwen3.5-0.8B.Q4_K_M.gguf
# Qwen3.5-0.8B.BF16-mmproj.gguf
# Step 2: Run multimodal inference
llama-mtmd-cli \
-m Qwen3.5-0.8B.Q4_K_M.gguf \
--mmproj Qwen3.5-0.8B.BF16-mmproj.gguf \
--image formula.png \
-p "Write the LaTeX representation for this image." \
-n 512
llama.cpp β Direct Hub Download
llama-mtmd-cli \
-hf Mustafaege/Qwen3.5-0.8B-GGUF-q4_k_m \
--image formula.png \
-p "Write the LaTeX representation for this image." \
--jinja
Ollama
# Create Modelfile
cat > Modelfile << 'EOF'
FROM Mustafaege/Qwen3.5-0.8B-GGUF-q4_k_m
PARAMETER num_predict 512
PARAMETER temperature 0.7
EOF
ollama create qwen3.5-ocr -f Modelfile
ollama run qwen3.5-ocr
Python β llama-cpp-python
pip install llama-cpp-python
from llama_cpp import Llama
from llama_cpp.llama_chat_format import MiniCPMv26ChatHandler
chat_handler = MiniCPMv26ChatHandler(
clip_model_path="Qwen3.5-0.8B.BF16-mmproj.gguf"
)
llm = Llama(
model_path="Qwen3.5-0.8B.Q4_K_M.gguf",
chat_handler=chat_handler,
n_ctx=4096,
verbose=False,
)
response = llm.create_chat_completion(
messages=[
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "file:///path/to/formula.png"}},
{"type": "text", "text": "Write the LaTeX representation for this image."},
],
}
],
max_tokens=512,
)
print(response["choices"][0]["message"]["content"])
Training Details
| Parameter | Value |
|---|---|
| LoRA Rank | 16 |
| Learning Rate | 2e-4 |
| Batch Size | 4 |
| Gradient Accumulation | 4 (eff. batch 16) |
| Precision | bf16 |
| Platform | Lightning.ai Β· NVIDIA A100-SXM4-80GB |
Related Resources
| Resource | Link |
|---|---|
| Full-precision LoRA Adapter | Mustafaege/Qwen3.5-0.8B-vision-LORA-16bit |
| Training Dataset | Mustafaege/qwen3.5-vision-ocr-v1 |
| Base Model | unsloth/Qwen3.5-0.8B |
| Unsloth (GGUF conversion) | github.com/unslothai/unsloth |
Limitations
- Optimized for document and formula images; not suited for natural scene understanding.
- Q4_K_M quantization may reduce accuracy on highly complex mathematical notation compared to the full-precision adapter.
- Output quality depends on input image resolution and clarity.
Citation
@misc{mustafaege2026qwen35visionocr,
title = {Qwen3.5-0.8B Vision OCR: GGUF Q4_K_M for Local Inference},
author = {Mustafaege},
year = {2026},
url = {https://huggingface.co/Mustafaege/Qwen3.5-0.8B-GGUF-q4_k_m}
}
@misc{qwen3_5,
title = {Qwen3.5 Technical Report},
author = {Qwen Team},
year = {2025},
publisher = {Alibaba Cloud}
}
Converted with Unsloth on Lightning.ai.
- Downloads last month
- 189
Hardware compatibility
Log In to add your hardware
4-bit