Anubis OSS - Local LLM Benchmarking for Apple Silicon with Real-Time Hardware Telemetry (Looking for Testers + Open Data)

JT-UncSoft · February 25, 2026, 4:01am

Anubis OSS - Local LLM Benchmarking for Apple Silicon with Real-Time Hardware Telemetry (Looking for Testers + Open Data) - now with live leaderboards!

Hey everyone

I built Anubis - a native macOS app for benchmarking and comparing local LLMs on Apple Silicon. It’s open source (GPL-3.0), free, and I wanted to share it here because the data it produces might be useful to this community, and I’m looking for testers to help grow the dataset. The data is easily submittable through the app itself, and the benchmark results can be exported in one click as a .png or dataset, with full history of past runs and charts always stored locally for recall.

[Album] imgur.com

The problem it solves

If you run local models on a Mac, you’ve probably noticed the tooling gap. Chat wrappers like Ollama and LM Studio are great for conversation but don’t give you systematic performance data. CLI monitors like asitop show hardware stats but have no LLM context. And evaluation frameworks like promptfoo require YAML configs and terminal expertise.

No existing tool correlates real-time Apple Silicon hardware telemetry with inference performance. That’s what Anubis does.

What makes it different

Anubis is a native SwiftUI app (no Electron, no web wrapper) that captures metrics you can’t get anywhere else during local inference:

Real-time power telemetry - GPU, CPU, ANE (Neural Engine), and DRAM power draw in watts via IOReport. You can see the actual watts-per-token efficiency of different models and quantizations.
GPU frequency tracking - weighted average from P-state residency, so you can see when your GPU is actually throttling.
Process-level memory - tracks phys_footprint including Metal/GPU buffer allocations, not just the generic RSS that most tools report.
Thermal state monitoring - see when your Mac hits thermal pressure during long inference runs.
8 metric cards + 11 live charts and more - updating in real time during generation, all correlated with the token stream.

Beyond the benchmark dashboard, it includes an Arena mode for blind A/B model comparison with voting, and a Vault for managing models across all your backends in one place.

It works with any OpenAI-compatible endpoint - Ollama, LM Studio, MLX, vLLM, LocalAI, OpenWebUI, Docker Models, etc.

Open benchmark data - free for the community

This is the part I think the HF community might find especially useful. Anubis users can submit their benchmark results to a community leaderboard that’s publicly accessible. The dataset includes real-world performance across Apple Silicon chips (M1 through M4/M5) running various models and quantizations, with datapoints like:

Tokens/sec (eval rate)
Time to first token (TTFT)
GPU/CPU utilization during inference
Power consumption (GPU, CPU, ANE, DRAM)
Process memory footprint
Model load time
Thermal state

This is real hardware performance data from real Macs - not synthetic benchmarks or self-reported numbers. If you’re working on model optimization, quantization research, Apple Silicon performance analysis, or MLX development, this data could be a valuable resource.

There’s also a Data Explorer for filtering and analyzing the results by chip, model, quantization, and backend.

Looking for testers

I’m actively looking for people to try Anubis and contribute benchmark results. The more diverse the hardware coverage (different M-series chips, different RAM configurations), the more useful the dataset becomes for everyone.

Getting started takes about 2 minutes:

Download the latest release (notarized .app) or build from source
Point it at your existing Ollama / LM Studio / MLX setup
Run a benchmark - watch the metrics in real time
Submit your results to the leaderboard

You need macOS 15+ and Apple Silicon (M1 or later). If you already have Ollama or any local LLM backend running, Anubis auto-detects it on launch.

75 stars → Homebrew Cask + Hugging Face distribution

Here’s where I need the community’s help: if the repo hits 75 GitHub stars, we’ll package Anubis as a Homebrew Cask for one-line installs (brew install --cask anubis), and we’ll explore distributing the aggregated benchmark dataset as a proper Hugging Face dataset so it’s easily accessible for research and other AI projects.

We’re currently at 34 stars - if you find this useful or interesting, a on the GitHub repo would go a long way.

Links

GitHub: github.com/uncSoft/anubis-oss
Project page: devpadapp.com/anubis-oss.html
Leaderboard: devpadapp.com/leaderboard.html
Download: Latest release
Support: Ko-fi · GitHub Sponsors

A sandboxed version (less robust due to Apple rules) is also available on the Mac App Store as part of The Architect’s Toolkit bundle or solo if you prefer a managed install - purchasing it supports continued open source development.

Happy to answer any questions about the architecture, the data, or Apple Silicon performance in general. Feedback, bug reports, and PRs are all welcome.

— JT

Topic		Replies	Views
Anubis OSS — native macOS app for benchmarking local LLMs with real-time hardware telemetry (free, open source) Intermediate	1	57	February 11, 2026
[Tool] M-Courtyard – GUI for local LLM fine-tuning on Apple Silicon Show and Tell	0	74	February 17, 2026
M-Courtyard: Fine-tune LLMs on macOS with zero code (MLX + Ollama GUI) Models	0	93	February 13, 2026
M-Courtyard: GUI app for fine-tuning mlx-community models on Mac (open source) Beginners	0	60	February 13, 2026
Track real-time GPU and LLM pricing across all major cloud and inference providers Show and Tell	2	13	March 4, 2026