Anubis OSS - Local LLM Benchmarking for Apple Silicon with Real-Time Hardware Telemetry (Looking for Testers + Open Data) - now with live leaderboards!
Hey everyone ![]()
I built Anubis - a native macOS app for benchmarking and comparing local LLMs on Apple Silicon. It’s open source (GPL-3.0), free, and I wanted to share it here because the data it produces might be useful to this community, and I’m looking for testers to help grow the dataset. The data is easily submittable through the app itself, and the benchmark results can be exported in one click as a .png or dataset, with full history of past runs and charts always stored locally for recall.
The problem it solves
If you run local models on a Mac, you’ve probably noticed the tooling gap. Chat wrappers like Ollama and LM Studio are great for conversation but don’t give you systematic performance data. CLI monitors like asitop show hardware stats but have no LLM context. And evaluation frameworks like promptfoo require YAML configs and terminal expertise.
No existing tool correlates real-time Apple Silicon hardware telemetry with inference performance. That’s what Anubis does.
What makes it different
Anubis is a native SwiftUI app (no Electron, no web wrapper) that captures metrics you can’t get anywhere else during local inference:
-
Real-time power telemetry - GPU, CPU, ANE (Neural Engine), and DRAM power draw in watts via IOReport. You can see the actual watts-per-token efficiency of different models and quantizations.
-
GPU frequency tracking - weighted average from P-state residency, so you can see when your GPU is actually throttling.
-
Process-level memory - tracks
phys_footprintincluding Metal/GPU buffer allocations, not just the generic RSS that most tools report. -
Thermal state monitoring - see when your Mac hits thermal pressure during long inference runs.
-
8 metric cards + 11 live charts and more - updating in real time during generation, all correlated with the token stream.
Beyond the benchmark dashboard, it includes an Arena mode for blind A/B model comparison with voting, and a Vault for managing models across all your backends in one place.
It works with any OpenAI-compatible endpoint - Ollama, LM Studio, MLX, vLLM, LocalAI, OpenWebUI, Docker Models, etc.
Open benchmark data - free for the community
This is the part I think the HF community might find especially useful. Anubis users can submit their benchmark results to a community leaderboard that’s publicly accessible. The dataset includes real-world performance across Apple Silicon chips (M1 through M4/M5) running various models and quantizations, with datapoints like:
-
Tokens/sec (eval rate)
-
Time to first token (TTFT)
-
GPU/CPU utilization during inference
-
Power consumption (GPU, CPU, ANE, DRAM)
-
Process memory footprint
-
Model load time
-
Thermal state
This is real hardware performance data from real Macs - not synthetic benchmarks or self-reported numbers. If you’re working on model optimization, quantization research, Apple Silicon performance analysis, or MLX development, this data could be a valuable resource.
There’s also a Data Explorer for filtering and analyzing the results by chip, model, quantization, and backend.
Looking for testers
I’m actively looking for people to try Anubis and contribute benchmark results. The more diverse the hardware coverage (different M-series chips, different RAM configurations), the more useful the dataset becomes for everyone.
Getting started takes about 2 minutes:
-
Download the latest release (notarized .app) or build from source
-
Point it at your existing Ollama / LM Studio / MLX setup
-
Run a benchmark - watch the metrics in real time
-
Submit your results to the leaderboard
You need macOS 15+ and Apple Silicon (M1 or later). If you already have Ollama or any local LLM backend running, Anubis auto-detects it on launch.
75 stars → Homebrew Cask + Hugging Face distribution
Here’s where I need the community’s help: if the repo hits 75 GitHub stars, we’ll package Anubis as a Homebrew Cask for one-line installs (brew install --cask anubis), and we’ll explore distributing the aggregated benchmark dataset as a proper Hugging Face dataset so it’s easily accessible for research and other AI projects.
We’re currently at 34 stars - if you find this useful or interesting, a
on the GitHub repo would go a long way.
Links
-
GitHub: github.com/uncSoft/anubis-oss
-
Project page: devpadapp.com/anubis-oss.html
-
Leaderboard: devpadapp.com/leaderboard.html
-
Download: Latest release
-
Support: Ko-fi · GitHub Sponsors
A sandboxed version (less robust due to Apple rules) is also available on the Mac App Store as part of The Architect’s Toolkit bundle or solo if you prefer a managed install - purchasing it supports continued open source development.
Happy to answer any questions about the architecture, the data, or Apple Silicon performance in general. Feedback, bug reports, and PRs are all welcome.
— JT