Qwen3.5-0.8B Vision OCR β€” GGUF Q4_K_M

A quantized GGUF export of Mustafaege/Qwen3.5-0.8B-vision-LORA-16bit β€” a Qwen3.5-0.8B vision-language model fine-tuned for document OCR and image-to-LaTeX conversion.

Converted from the merged 16-bit weights to Q4_K_M quantization using Unsloth's GGUF pipeline. Includes the multimodal projector (mmproj) required for vision inference.

Full-precision LoRA adapter: Mustafaege/Qwen3.5-0.8B-vision-LORA-16bit


Model Details

Property Value
Base Model unsloth/Qwen3.5-0.8B
Fine-tuned Adapter Mustafaege/Qwen3.5-0.8B-vision-LORA-16bit
Task Document OCR / Image-to-LaTeX
Format GGUF
Quantization Q4_K_M
Multimodal Projector Included β€” Qwen3.5-0.8B.BF16-mmproj.gguf
Conversion Tool Unsloth GGUF pipeline
Training Platform Lightning.ai
Training Hardware NVIDIA A100-SXM4-80GB
License Apache 2.0
Developed by Mustafaege

Available Files

File Description Required
Qwen3.5-0.8B.Q4_K_M.gguf Main model weights β€” Q4_K_M quantized βœ… Yes
Qwen3.5-0.8B.BF16-mmproj.gguf Vision multimodal projector β€” BF16 precision βœ… Yes

⚠️ Both files must be downloaded for vision/multimodal inference. The mmproj file handles image encoding and is separate from the LLM weights.


Quantization: Q4_K_M

Q4_K_M is a 4-bit K-Quant Mixed format β€” the most widely recommended quantization for balanced quality and memory efficiency.

Property Value
Bits 4-bit mixed precision
Method K-Quant (block-wise optimization)
Estimated Model Size ~700 MB
Estimated VRAM ~1.5 GB (GPU) / runs on CPU
Quality Loss Minimal for OCR and structured text tasks

Quantization Format Comparison

Format Size VRAM Quality Best For
BF16 (original merged) ~1.6 GB ~2.5 GB Highest Requantization / GPU with headroom
Q4_K_M (this repo) ~700 MB ~1.5 GB Balanced Most deployments β€” recommended
Q5_K_M ~900 MB ~1.8 GB Better When slightly more accuracy needed
Q8_0 ~1.3 GB ~2 GB Near-lossless High-accuracy CPU inference

GGUF Conversion Process

This GGUF was produced by:

  1. Fine-tuning unsloth/Qwen3.5-0.8B with 16-bit LoRA on Mustafaege/qwen3.5-vision-ocr-v1
  2. Merging LoRA weights into the base model (bf16)
  3. Converting the merged model to BF16 GGUF format
  4. Quantizing to Q4_K_M β€” including the mmproj vision projector
  5. Uploading both the LLM and mmproj files to this repository

Usage

llama.cpp β€” Multimodal CLI (Recommended)

# Step 1: Download both files
# Qwen3.5-0.8B.Q4_K_M.gguf
# Qwen3.5-0.8B.BF16-mmproj.gguf

# Step 2: Run multimodal inference
llama-mtmd-cli \
  -m Qwen3.5-0.8B.Q4_K_M.gguf \
  --mmproj Qwen3.5-0.8B.BF16-mmproj.gguf \
  --image formula.png \
  -p "Write the LaTeX representation for this image." \
  -n 512

llama.cpp β€” Direct Hub Download

llama-mtmd-cli \
  -hf Mustafaege/Qwen3.5-0.8B-GGUF-q4_k_m \
  --image formula.png \
  -p "Write the LaTeX representation for this image." \
  --jinja

Ollama

# Create Modelfile
cat > Modelfile << 'EOF'
FROM Mustafaege/Qwen3.5-0.8B-GGUF-q4_k_m
PARAMETER num_predict 512
PARAMETER temperature 0.7
EOF

ollama create qwen3.5-ocr -f Modelfile
ollama run qwen3.5-ocr

Python β€” llama-cpp-python

pip install llama-cpp-python
from llama_cpp import Llama
from llama_cpp.llama_chat_format import MiniCPMv26ChatHandler

chat_handler = MiniCPMv26ChatHandler(
    clip_model_path="Qwen3.5-0.8B.BF16-mmproj.gguf"
)

llm = Llama(
    model_path="Qwen3.5-0.8B.Q4_K_M.gguf",
    chat_handler=chat_handler,
    n_ctx=4096,
    verbose=False,
)

response = llm.create_chat_completion(
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": "file:///path/to/formula.png"}},
                {"type": "text", "text": "Write the LaTeX representation for this image."},
            ],
        }
    ],
    max_tokens=512,
)
print(response["choices"][0]["message"]["content"])

Training Details

Parameter Value
LoRA Rank 16
Learning Rate 2e-4
Batch Size 4
Gradient Accumulation 4 (eff. batch 16)
Precision bf16
Platform Lightning.ai Β· NVIDIA A100-SXM4-80GB

Related Resources

Resource Link
Full-precision LoRA Adapter Mustafaege/Qwen3.5-0.8B-vision-LORA-16bit
Training Dataset Mustafaege/qwen3.5-vision-ocr-v1
Base Model unsloth/Qwen3.5-0.8B
Unsloth (GGUF conversion) github.com/unslothai/unsloth

Limitations

  • Optimized for document and formula images; not suited for natural scene understanding.
  • Q4_K_M quantization may reduce accuracy on highly complex mathematical notation compared to the full-precision adapter.
  • Output quality depends on input image resolution and clarity.

Citation

@misc{mustafaege2026qwen35visionocr,
  title   = {Qwen3.5-0.8B Vision OCR: GGUF Q4_K_M for Local Inference},
  author  = {Mustafaege},
  year    = {2026},
  url     = {https://huggingface.co/Mustafaege/Qwen3.5-0.8B-GGUF-q4_k_m}
}

@misc{qwen3_5,
  title     = {Qwen3.5 Technical Report},
  author    = {Qwen Team},
  year      = {2025},
  publisher = {Alibaba Cloud}
}

Converted with Unsloth on Lightning.ai.

Downloads last month
189
GGUF
Model size
0.8B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Mustafaege/Qwen3.5-0.8B-GGUF-q4_k_m

Finetuned
Qwen/Qwen3.5-0.8B
Quantized
(1)
this model