Disclaimer: Personal homelab project. Views are my own. Built and described in the open as a working reference, not a product pitch, not a roadmap, not a customer reference.
TL;DR
I extended my existing tri-stack observability layer to cover a second LLM cluster, Apple Silicon MLX on top of the existing NVIDIA llama.cpp side, and added Apple-silicon power telemetry as a first-class OTel metric. One LiteLLM front door routes traffic to either backend. One OpenTelemetry Collector fans the metrics, traces, and logs out to SigNoz (self-hosted), Splunk Enterprise (on-prem), and Splunk O11y Cloud. A 120-line Python adapter wraps the macmon binary that ships inside EXO.app and emits mac.power.{total,cpu,gpu,ane,ram} every five seconds.
If you skim, jump to §4 (the four gotchas), that's the engineering content. The architecture sections are scaffolding for them.
Here's what it actually looks like
Two clusters, one operator view each, both serving a real model right now.

The first two shots are the whole post in a single screen each, the rest of this article is how those views happen and what made them annoying to build.
1. Why two clusters (and why hardware backstory matters here)
The NVIDIA side already ran fine: an NVIDIA M6000 24 GB plus three llama-server --rpc workers (rpc-w-01/02/03) for the 70B+ models that don't fit on one card.
A small digression on the M6000, because it's the part everyone in the local-LLM scene rolls their eyes at. The M6000 is a 2015 Maxwell card. By most people's reading, it's e-waste, no FP16, no tensor cores, no FlashAttention, ECC-disabled when used as a desktop card. The cope is real. But: it's still 24 GB of VRAM with 317 GB/s memory bandwidth at idle wattage roughly equal to a modern mid-range card, and llama.cpp knows exactly how to use it. With Q4_K_M weights it'll do ~13 tok/s on a 7B model, hold a 128k context, and serve as the prefill node for a 70B RPC-sharded run that the three VM workers carry the rest of. It's not fast in 2026 terms. It is, however, capable, cheap, and here, and that's worth more than a benchmark on a card you don't own. The setup is already wired for an SXM2 V100 32 GB drop-in next month, second-hand server-pull SXM2 boards with custom PCB carriers are a real path for the patient hardware crowd, once that lands, the M6000 stays in the rig as a second card for embeddings, vision pre-processing, and the small models the router doesn't need to swap. Nothing gets thrown out; everything gets a job.
What changed this week: I added an EXO cluster on the Mac side, Mac Studio M4 Max (128 GB) plus a Mac Mini (24 GB), so the same LiteLLM proxy can route to either MLX-on-Apple-Silicon or GGUF-on-CUDA depending on the model. Two reasons:
- MLX dominates on Apple Silicon for memory-bound models. A 35B Qwen3-Coder in 6-bit MLX fits trivially in 128 GB of unified memory and saturates the M4 Max GPU without any swapping. The M6000 has to RPC-shard the same model across three VMs.
- It's a real-world A/B. Same prompt, same LiteLLM client, two backends, instant comparison of latency, throughput, and (now) power-per-token across architectures. That comparison is the whole point of this post.
The cluster boundary lives in cluster=mac-exo vs cluster=nvidia resource attributes on every OTel emission. Dashboards filter on it; alerts route on it.
Aside: the AI hardware squeeze of 2026. I bought the Mac Studio M4 Max with 128 GB unified memory when that tier was still on the order page. The squeeze has now hit both chip lines:
- M4 Max (the one I have) used to offer a 128 GB unified memory tier. As of May 2026, the maximum M4 Max config you can currently order from apple.com is 96 GB, the 128 GB tier I bought into is gone from the order page.
- M3 Ultra is the high-memory line, that's where the 256 GB and 512 GB unified-memory tiers lived. Apple has pulled the 512 GB Ultra upgrade entirely and the price for the remaining high-memory Ultra upgrades jumped over the same window per Tom's Hardware. Delivery on Ultra configs has slipped to 11 to 13 weeks globally as of writing.
Australian buyers feel it harder once duty and the apple.com.au markup are applied. Today on apple.com/au:
- M4 Max base (36 GB): from A$3,499
- M4 Max 64 GB: A$4,549 (16-18 week ship)
- M3 Ultra (96 GB unified memory included): from A$6,999 (11-13 week ship)
- M3 Ultra with 32-core CPU + 80-core GPU (96 GB): A$9,249
The pricing jumps are not gentle. Going from M4 Max 64 GB to the entry M3 Ultra 96 GB is roughly AU$2,450 for the privilege of more unified memory (because today you can't actually buy a higher-memory M4 Max in Australia, the path forward at >96 GB is Ultra-only).
RAM commodity prices are at record highs, the squeeze isn't a rumour, it's right there on the order page. I've been quietly hoarding unified memory and DDR for years (this lab's NVIDIA side has 256 GB+ of DDR5 across the workers, plus the 128 GB Mac Studio M4 Max), and that turns out to have been accidentally well-timed. Anyone trying to build this exact stack from scratch today would pay considerably more than I did, in both USD and AUD, and wait roughly twice as long. If you have memory in a drawer, the AI-on-prem story is still open. If you don't, the window narrows.
"Just wait for WWDC / M5 Mac Studio?" Tempting. The natural move is to hold for WWDC 2026 in June, expect Apple to announce the M5 Mac Studio, and hope the unified-memory ceiling goes up. The leaks don't agree with that hope. Per Macworld and Geeky Gadgets, the M5 Mac Studio is now expected to slip from a WWDC reveal to roughly October 2026, explicitly because of the same global DRAM shortage. And the rumoured ceilings actually regress: the M5 Max is expected to top out at 128 GB (same as the M4 Max tier Apple just removed from the order page), and the M5 Ultra is expected to top out at 256 GB, down from the M3 Ultra's old 512 GB ceiling that already got pulled in March.
So we're in the absurd position where, today, the easiest way to buy a brand-new 128 GB Apple Silicon machine is a MacBook Pro M4 Max, the laptop still ships with that tier on apple.com.au, while the desktop workstation in the same chip line has been quietly downgraded to 96 GB. If your AI-on-prem plan was "wait for the next Mac Studio and get more memory than I can get today", the current best guess is: you're going to wait until late 2026, and you're probably going to get the same memory ceiling or less. So "wait" might also mean "buy a MacBook Pro instead and pretend it's a server", which, honestly, more than one person in the local-LLM scene is now doing.
2. The router: one LiteLLM, two backends
LiteLLM sits in the middle as the OpenAI-compatible front door. Every model is registered with a friendly name and routed to its right backend:
The OTel callback emits gen_ai.client.* metrics (LLM request duration, token usage, operation duration) in addition to the standard span traces. That's the telemetry surface every dashboard panel in this post is built on. Important detail: you have to set LITELLM_OTEL_INTEGRATION_ENABLE_METRICS=true as an env var, without it, only traces flow, no metrics. I lost an hour chasing that one yesterday.
Two side-cars sit between LiteLLM and the actual model servers:
exo-router(:8767on dockerhost): a FastAPI lazy-loader I wrote for the Mac side. EXO.app on Mac Studio holds the model; the side-car wakes it on first request and applies a 30-min TTL eviction.llama-swap(:4004on ai-stack): TTL-managedllama-serverinvocations for the NVIDIA side. Drop-in upstream tool. Swaps between models on demand based on the OpenAI-stylemodelfield in the request.
Both side-cars are decorated with the OTel SDK so the trace tree shows client → litellm → exo-router → mac-studio or client → litellm → llama-swap → m6000 → rpc-w-01 end-to-end in SigNoz APM.
2.1 The gateway layer above LiteLLM
LiteLLM is the router. It is not the policy/guardrails surface, that lives in front of it. Three OpenAI-compatible gateways on ai-gw-01 decorate every request before it ever reaches the router:
| Gateway | Role | Why it's its own box |
|---|---|---|
| AI-GW-01 / OpenCLAW | Policy + structured logging + per-tenant token accounting | OpenAI-compatible endpoint that wraps every call with audit-grade logging (CLAW = "Control, Logging, Audit, Workflow"). Sits between the agentic clients and LiteLLM. Lets me say "show me every prompt that touched a model in the last 24h, by team, with the redacted body" without hot-patching the proxy. |
| NeMo CLAW | Content safety, jailbreak detection, output rails | Wraps NVIDIA NeMo Guardrails (Python sidecar). Refuses inputs that match prompt-injection patterns; rewrites outputs that leak secrets. Same OpenAI shape, so any client that talks to GPT-4 talks to this one. |
| Hermes Agent | Multi-step agentic reasoning loop | Tool-using agent that wraps LiteLLM. Plans, calls MCP tools (Splunk, Proxmox, TrueNAS, Outline, Plane, Grist), and returns the synthesized answer. The client thinks it's talking to a smart model; it's really a planner-executor chain backed by whatever model LiteLLM routes to. |
All three speak the OpenAI /v1/chat/completions shape, route into LiteLLM, and ultimately land on either the NVIDIA cluster (default for tool-using agents because llama-swap holds the model warm) or the Mac MLX side (default for high-context coding). Each gateway emits its own OTel traces, so the SigNoz APM tree shows client → ai-gw-01 → litellm → llama-swap → m6000 end-to-end, with per-hop latency. The trace tree is the killer feature here, most local-LLM stacks treat the gateway as opaque; with OTel everywhere, every hop is queryable.
The Claude Code CLI is the one client that bypasses the gateways and talks to the Mac side directly via the exo-router lazy-loader. Reason: Claude Code drives long agentic loops that thrash the gateway logging volume, and for IDE-style coding work I want the MLX latency floor (~50 ms vs ~200 ms through the full stack).
3. Power telemetry: macmon → OpenTelemetry
This is the new bit, and it's the reason I'm posting.
EXO.app bundles a small Rust binary called macmon that reads Apple-silicon power and temperature data via Apple's IOReport framework. EXO's own topology UI uses it to show per-node wattage live. That data is exactly what I want as a queryable metric series across my whole observability stack, but it's locked inside EXO.app's UI.
So I wrote a 120-line Python adapter that runs macmon pipe --interval 5000 as a subprocess, parses each JSON sample, and POSTs it as OTLP-format gauges to my central OTel collector on ai-stack:
Stdlib only, no requests, no opentelemetry-sdk. Just urllib.request.urlopen. It runs as a per-user LaunchAgent on both Macs and sends every 5 s. Happy to share the full adapter if you want it published, drop me a message via the contact form.
What lands at the central otelcol (and then fans out to all three backends):
| Metric | Unit | What |
|---|---|---|
mac.power.total | W | Whole-board power draw |
mac.power.gpu | W | GPU domain (where MLX inference runs) |
mac.power.cpu / .ane / .ram / .system | W | Per-domain breakdown |
mac.temp.{cpu,gpu} | Cel | Avg die temperatures |
mac.gpu.usage | 0–1 | GPU utilization fraction |
mac.cpu.{e,p}cpu_usage | 0–1 | E-core / P-core utilization |
mac.memory.ram_{used,total} | bytes | Memory pressure |
Resource attributes: host.name, cluster=mac-exo, deployment.environment=ozteklab-lab. Standard OTel hygiene, everything is filterable.
A SigNoz query against this looks like any other gauge:
And the answer right now, with the cluster idle: ~20 W across both Macs. Under inference (Qwen3-Coder generating): ~60–80 W. That's the raw power signal, the same number EXO's UI shows internally, now landing as a real OTel gauge that any dashboard can query. The ratio (tokens / watt, the actual efficiency number) is the obvious next panel to build, the two source streams are both flowing into SigNoz already, see §7.
4. Four gotchas worth their own headlines
These are the things that ate my time. Documenting them so they don't eat yours.
4.1 macOS silently blocks unsigned binaries from LAN, even though ping, curl, and nc all work
This one took me 30 minutes the first time and 5 minutes every time after because I finally wrote it down.
You install otelcol-contrib from the upstream GitHub release on a Mac. It's a Go binary with ad-hoc linker signing (Signature=adhoc, no Apple notarization). You start it. Logs look fine. Then this in the export-error stream:
dial tcp <ai-stack-ip>:4318: connect: no route to host
Meanwhile, from the same shell:
macOS 14+ enforces a Local Network privacy framework on every process trying to reach an RFC1918 address. Unsigned binaries can't request the permission (no app bundle to attach the prompt to), so the syscall returns EHOSTUNREACH silently. There's no log line. The TCC database doesn't show a denial. You just get a fake "no route to host" forever.
Fix: codesign --force --deep -s - /path/to/binary (ad-hoc signature). After that, the first launch of the process triggers the popup; user clicks Allow; the grant persists per cdhash.
Sub-gotcha I burned on later: re-signing an already-signed binary strips the grant, even when the resulting Signature=adhoc identifier looks identical. Don't re-codesign on every deploy, make it conditional:
4.2 open -a EXO --args ... silently drops arguments
When the Mac peers got out of sync, I tried pinning EXO's libp2p port to make bootstrap deterministic across reboots:
EXO bound a random port. Args were dropped. macOS's open accepts --args but the receiving .app has to declare it intends to read them; most bundled apps don't.
What works: invoke the bundled binary directly.
What breaks if you do that: the bundled macmon subprocess can't spawn anymore (loses the .app's signed entitlements), and the topology UI's power column goes to 0. You can't have both. I chose the .app launch (wattage matters more than fixed port; cross-VLAN peer discovery via mDNS reflector works fine without).
4.3 libp2p mDNS across VLANs needs a UniFi gateway mDNS reflector
EXO's peer discovery is libp2p mDNS only in the current v1 build, no config.toml, no bootstrap-peers for the Mac UI to consume even though the CLI flag exists. (I confirmed by grep -r config.toml inside the bundle: one TODO comment, never loaded.)
Apple's Bonjour service browser announcements are link-local multicast, they don't cross L3 boundaries unless something forwards them. UniFi has had a built-in "Multicast DNS" / Bonjour reflector for years; you enable it per-VLAN. Mine had to span VLAN 60 (Mac Studio) and VLAN 80 (Mac Mini, on the Coder workspaces VLAN).
Two clicks in UniFi UI → both Macs see each other → EXO topology shows both nodes connected → MLX Ring inference works across the pair.
4.4 Splunk Observability Cloud silently rejects new metric names on trial accounts
The OTel collector reports otelcol_exporter_sent_metric_points{exporter="signalfx"} 979675, send_failed=0. The SignalFX ingest API returns 200 OK to direct mac.power.total POSTs. The metric never appears in /v2/metric or in SignalFlow queries.
You've hit the trial-account custom-MTS cap. The org-info endpoint returns metricTimeSeriesLimit: null when you're already past it. No error message anywhere in the pipeline.
Mac power data lands in SigNoz (self-hosted, no quota) and Splunk Enterprise (lab_otel_metrics index, on-prem, no quota) just fine. Only the cloud trial silently dropped it. Diagnostic check: confirm the metric exists in lab_otel_metrics:
You'll see all 13 mac.* series. The pipeline is healthy; the trial backend is just done with you.
5. Dashboards in three places
Same shape across backends so the screenshots are comparable:
- SigNoz (exo, NVIDIA): row 1 cluster headline tiles, row 2 per-host load, rows 3-5 LLM perf, rows 6-7 token rate + cost, rows 8-9 power (the new bit).
- Splunk O11y (exo, NVIDIA): identical layout, SignalFlow programs instead of SigNoz QueryBuilder. Limitation: SignalFX flattens OTel histograms to count/min/max only, so no p50/p95/p99 latency panels, just min/max/rate.
- Splunk Enterprise dashboards: SPL with
tstatsoverlab_otel(logs) andmstatsoverlab_otel_metrics(metrics). Power data queryable today with| mstats avg("mac.power.total") WHERE index=lab_otel_metrics BY host.name.

Two SigNoz-specific patterns that paid off:
-
The
Macs reportingcount tile uses a ClickHouse SQL panel rather than a builder query, because SigNoz's value-panelreduceTo: countshows the first series's label (e.g.mac-studio) instead of the series count. Directcount(DISTINCT host.name)SQL fixed it. -
For the LLM perf panels, use
gen_ai.usage.input_tokensandgen_ai.usage.output_tokens, notprompt_tokens/completion_tokens. Both names exist in SigNoz schema discovery; only the new semconv names are actually populated by LiteLLM. Hours of empty panels until I queried both.
5b. The operator UI: cluster-dashboard
There's a piece of glue I haven't talked about in this post but it's the one I look at most often: cluster-dashboard, a FastAPI + React app on :8801 that gives me an octagon-topology view of every node, a live nvidia-smi/macmon panel, model load/unload buttons that talk to llama-swap, and an in-browser benchmark that streams tokens directly from llama.cpp's OpenAI endpoint while recording p50/p95/TTFT/tokens-per-second into a SQLite history table. It's how I made the side-by-side Mac-vs-NVIDIA comparison concrete instead of vibes.
The Models library page is the unified view across HuggingFace cache and Ollama-format weights, with quick-load buttons and live VRAM accounting. The benchmark compare view puts two runs side-by-side so the A/B is concrete instead of a vibe-check:

The benchmark flow: pick a model, drop in a prompt, click Run. Tokens stream live; metrics fill in as the response completes. Every run is persisted so you can compare apples-to-apples across models, prompts, max_tokens settings, and RPC vs LOCAL backends.

I also put a Definitions tab in the dashboard with the quantization / precision / context-length / MLX-vs-GGUF glossary I keep wishing existed when I'm onboarding the next person to the lab, sources cited, no AI-generated waffle:

And the same Mac cluster, viewed through EXO.app's native UI on Mac Studio, same model, different operator perspective:

Sample output from a qwen-coder run against the local NVIDIA cluster, same OpenAI API contract any client expects, no remote calls, ~13 tok/s on the M6000 with the 7B model:

5c. The hardware story behind this lab (and what's next on the NVIDIA side)
This is the part for the hardware nerds. The M6000 is doing a job that the internet wrote off as "e-waste", and the lab is wired today to slot in a V100 SXM2 32 GB drop-in as soon as one lands. The plan:
- SXM2 V100 32 GB arrives → install in a desktop-PCIe carrier (the $200 V100 SMX hack on Tom's Hardware showed it's a real path, not a Twitter party trick).
- M6000 stays in the rig, second card for embeddings, vision pre-processing, and small models the router shouldn't have to swap. Nothing gets retired; everything earns its rack space.
- Add the V100 as a fourth member of the RPC ring, currently 1 M6000 prefill + 3 RPC workers; V100 becomes the workhorse for the 70B-class models the M6000 has to RPC-shard.
- EXO ring + NVIDIA RPC ring side-by-side, same prompt, same client, same dashboard, four backends now (Mac Studio, Mac Mini, M6000, V100). The watts/token comparison gets sharper because Volta in INT8 is a different efficiency curve than Maxwell or Apple Silicon.
The honest constraint on the SXM2 plan is software, not silicon. The serving ecosystem is moving on from Volta (compute capability sm_70):
- vLLM still technically supports sm_70, but in practice, v0.20+ has broken kernel paths, BF16 isn't supported on Volta at all (BF16 requires compute capability 8.0+), and newer models hit
no kernel image is availableerrors on V100 (vLLM issue #25456). Community forks like1Cat-vLLMbackport CUDA 12.8 + AWQ kernels to keep V100 viable, but you're maintaining your own fork the moment you commit to that path. - TensorRT-LLM, TGI, and Triton have all dropped sm_70 from their current container releases.
- llama.cpp, the actual home of this lab's NVIDIA side, treats Volta as a first-class target. CUDA 12.x compiles cleanly, Q4_K_M/Q5_K_M paths work, FlashAttention-2 isn't supported on Volta but the regular attention path is fine for inference. This is why the cluster doesn't depend on vLLM at all, and why a V100 drop-in won't break it.
So: when the SXM2 arrives, the work is rebuild the llama.cpp container with sm_70 enabled, plug in the carrier, register it as a new RPC node, update the cluster-dashboard topology. That's an afternoon. The vLLM rebuild, if I ever need it, is a fork-and-pin-and-pray exercise; the lab's design deliberately avoids needing that.
6. What this looks like as a hiring artifact
I'm writing this with one eye on the Splunk Platform AI Architect / NVIDIA SE roles I'm targeting. So the explicit "why this matters" framing for a hiring manager:
- It's not a script, it's a platform. Two backends, three observability stacks, one router, full GitOps, idempotent deploy. ~120 lines of Python for the power exporter, ~80 for the EXO LaunchAgent wrappers, the rest is OTel config + dashboard JSON, all managed in my private GitOps repo. If you want to see any specific piece on a public GitHub mirror, send me a note via the contact form and I'll publish the relevant bits.
- The dashboards exist on a real backend you can audit. SigNoz instance is publicly reachable on
signoz.ozteklab.com; the Splunk side is mine. Screenshots are nice; live URLs you can poke at are nicer. - The gotchas are documented in the open so the next engineer doesn't spend 30 minutes on the macOS Local Network grant. That's the difference between a homelab and a platform.
If you're a Splunk SE looking at this thinking "the AI Ops angle is real and he can actually run the gear", that's the message. Get in touch.
7. What's still missing
- The Splunk O11y trial dies in a few days. Power data won't be in the cloud screenshots until either the trial extends or I move to paid. Not worth chasing further on a clock.
- NVIDIA-side power telemetry.
nvidia-smigives me total board watts per second; need a similar Python adapter to emitgpu.power.{board,sm}as gauges. Should take an evening. - Tokens-per-watt as a dashboard metric.
rate(gen_ai.usage.output_tokens) / mac.power.gpugives a tokens/W series per model. The maths is one formula away; just need to ship the panel. - Cost overlay. When the metric lands, divide power × your local electricity rate to get $/1k-tokens on-prem. That's the real comparison vs OpenAI API pricing, and the bit that makes the on-prem ROI story land.
The full architecture, scripts, OpenTelemetry collector config, and dashboard JSON live in my private homelab GitOps repo. If anything in this post is useful for your setup and you'd like the specific piece published on GitHub, the macmon adapter, the SigNoz dashboard JSON, the otelcol config, the LaunchAgent plists, please drop me a note via the contact form and I'll publish the relevant bits. If the gotcha section saved you an hour, that's why I wrote it.