OztekLab

Heterogeneous AI Inference at Home: Mac MLX + NVIDIA llama.cpp Side-by-Side, One OpenTelemetry Stream, Watts and Tokens in Splunk + SigNoz

19 min read
splunksplunk-observability-cloudsignozopentelemetryotelcollitellmmlxexollama-cppapple-siliconmacmonmacoslaunchdlibp2pnvidiahomelab

Disclaimer: Personal homelab project. Views are my own. Built and described in the open as a working reference, not a product pitch, not a roadmap, not a customer reference.

TL;DR

I extended my existing tri-stack observability layer to cover a second LLM cluster, Apple Silicon MLX on top of the existing NVIDIA llama.cpp side, and added Apple-silicon power telemetry as a first-class OTel metric. One LiteLLM front door routes traffic to either backend. One OpenTelemetry Collector fans the metrics, traces, and logs out to SigNoz (self-hosted), Splunk Enterprise (on-prem), and Splunk O11y Cloud. A 120-line Python adapter wraps the macmon binary that ships inside EXO.app and emits mac.power.{total,cpu,gpu,ane,ram} every five seconds.

If you skim, jump to §4 (the four gotchas), that's the engineering content. The architecture sections are scaffolding for them.

Dual-Cluster Architecture, LiteLLM front-door routes to Mac MLX (EXO) and NVIDIA llama.cpp clusters, with AI-GW-01 (OpenCLAW), NeMo CLAW, and Hermes Agent as the OpenAI-compatible gateway layer

Here's what it actually looks like

Two clusters, one operator view each, both serving a real model right now.

NVIDIA cluster operator view, ai-stack (M6000) at the center of an octagon topology with three RPC worker nodes (rpc-w-01/02/03) and the Mac Mini visible. Loaded model: gemma-4-26b-a4b-rpc, 16.9 GB, state RPC ready. Live nvidia-smi panel underneath showing M6000 utilization, processes, and VRAM
NVIDIA cluster operator view, ai-stack (M6000) at the center of an octagon topology with three RPC worker nodes (rpc-w-01/02/03) and the Mac Mini visible. Loaded model: gemma-4-26b-a4b-rpc, 16.9 GB, state RPC ready. Live nvidia-smi panel underneath showing M6000 utilization, processes, and VRAM
1/3

The first two shots are the whole post in a single screen each, the rest of this article is how those views happen and what made them annoying to build.


1. Why two clusters (and why hardware backstory matters here)

The NVIDIA side already ran fine: an NVIDIA M6000 24 GB plus three llama-server --rpc workers (rpc-w-01/02/03) for the 70B+ models that don't fit on one card.

A small digression on the M6000, because it's the part everyone in the local-LLM scene rolls their eyes at. The M6000 is a 2015 Maxwell card. By most people's reading, it's e-waste, no FP16, no tensor cores, no FlashAttention, ECC-disabled when used as a desktop card. The cope is real. But: it's still 24 GB of VRAM with 317 GB/s memory bandwidth at idle wattage roughly equal to a modern mid-range card, and llama.cpp knows exactly how to use it. With Q4_K_M weights it'll do ~13 tok/s on a 7B model, hold a 128k context, and serve as the prefill node for a 70B RPC-sharded run that the three VM workers carry the rest of. It's not fast in 2026 terms. It is, however, capable, cheap, and here, and that's worth more than a benchmark on a card you don't own. The setup is already wired for an SXM2 V100 32 GB drop-in next month, second-hand server-pull SXM2 boards with custom PCB carriers are a real path for the patient hardware crowd, once that lands, the M6000 stays in the rig as a second card for embeddings, vision pre-processing, and the small models the router doesn't need to swap. Nothing gets thrown out; everything gets a job.

What changed this week: I added an EXO cluster on the Mac side, Mac Studio M4 Max (128 GB) plus a Mac Mini (24 GB), so the same LiteLLM proxy can route to either MLX-on-Apple-Silicon or GGUF-on-CUDA depending on the model. Two reasons:

  1. MLX dominates on Apple Silicon for memory-bound models. A 35B Qwen3-Coder in 6-bit MLX fits trivially in 128 GB of unified memory and saturates the M4 Max GPU without any swapping. The M6000 has to RPC-shard the same model across three VMs.
  2. It's a real-world A/B. Same prompt, same LiteLLM client, two backends, instant comparison of latency, throughput, and (now) power-per-token across architectures. That comparison is the whole point of this post.

The cluster boundary lives in cluster=mac-exo vs cluster=nvidia resource attributes on every OTel emission. Dashboards filter on it; alerts route on it.

Aside: the AI hardware squeeze of 2026. I bought the Mac Studio M4 Max with 128 GB unified memory when that tier was still on the order page. The squeeze has now hit both chip lines:

  • M4 Max (the one I have) used to offer a 128 GB unified memory tier. As of May 2026, the maximum M4 Max config you can currently order from apple.com is 96 GB, the 128 GB tier I bought into is gone from the order page.
  • M3 Ultra is the high-memory line, that's where the 256 GB and 512 GB unified-memory tiers lived. Apple has pulled the 512 GB Ultra upgrade entirely and the price for the remaining high-memory Ultra upgrades jumped over the same window per Tom's Hardware. Delivery on Ultra configs has slipped to 11 to 13 weeks globally as of writing.

Australian buyers feel it harder once duty and the apple.com.au markup are applied. Today on apple.com/au:

  • M4 Max base (36 GB): from A$3,499
  • M4 Max 64 GB: A$4,549 (16-18 week ship)
  • M3 Ultra (96 GB unified memory included): from A$6,999 (11-13 week ship)
  • M3 Ultra with 32-core CPU + 80-core GPU (96 GB): A$9,249

The pricing jumps are not gentle. Going from M4 Max 64 GB to the entry M3 Ultra 96 GB is roughly AU$2,450 for the privilege of more unified memory (because today you can't actually buy a higher-memory M4 Max in Australia, the path forward at >96 GB is Ultra-only).

RAM commodity prices are at record highs, the squeeze isn't a rumour, it's right there on the order page. I've been quietly hoarding unified memory and DDR for years (this lab's NVIDIA side has 256 GB+ of DDR5 across the workers, plus the 128 GB Mac Studio M4 Max), and that turns out to have been accidentally well-timed. Anyone trying to build this exact stack from scratch today would pay considerably more than I did, in both USD and AUD, and wait roughly twice as long. If you have memory in a drawer, the AI-on-prem story is still open. If you don't, the window narrows.

"Just wait for WWDC / M5 Mac Studio?" Tempting. The natural move is to hold for WWDC 2026 in June, expect Apple to announce the M5 Mac Studio, and hope the unified-memory ceiling goes up. The leaks don't agree with that hope. Per Macworld and Geeky Gadgets, the M5 Mac Studio is now expected to slip from a WWDC reveal to roughly October 2026, explicitly because of the same global DRAM shortage. And the rumoured ceilings actually regress: the M5 Max is expected to top out at 128 GB (same as the M4 Max tier Apple just removed from the order page), and the M5 Ultra is expected to top out at 256 GB, down from the M3 Ultra's old 512 GB ceiling that already got pulled in March.

So we're in the absurd position where, today, the easiest way to buy a brand-new 128 GB Apple Silicon machine is a MacBook Pro M4 Max, the laptop still ships with that tier on apple.com.au, while the desktop workstation in the same chip line has been quietly downgraded to 96 GB. If your AI-on-prem plan was "wait for the next Mac Studio and get more memory than I can get today", the current best guess is: you're going to wait until late 2026, and you're probably going to get the same memory ceiling or less. So "wait" might also mean "buy a MacBook Pro instead and pretend it's a server", which, honestly, more than one person in the local-LLM scene is now doing.


2. The router: one LiteLLM, two backends

LiteLLM sits in the middle as the OpenAI-compatible front door. Every model is registered with a friendly name and routed to its right backend:

YAML
# /home/jp/litellm/config.yaml (excerpt)
model_list:
  # MLX side, routed through exo-router lazy-loader on dockerhost
  - model_name: qwen-coder-mac
    litellm_params:
      model: openai/mlx-community/Qwen3-Coder-Next-6bit
      api_base: http://exo-router.local:8767/v1  # exo-router

  # NVIDIA side, routed through llama-swap
  - model_name: qwen-coder-nvidia
    litellm_params:
      model: openai/qwen-coder-30b
      api_base: http://ai-stack.local:4004/v1  # llama-swap

litellm_settings:
  success_callback: [langfuse, otel]
  failure_callback: [langfuse, otel]
  drop_params: true

The OTel callback emits gen_ai.client.* metrics (LLM request duration, token usage, operation duration) in addition to the standard span traces. That's the telemetry surface every dashboard panel in this post is built on. Important detail: you have to set LITELLM_OTEL_INTEGRATION_ENABLE_METRICS=true as an env var, without it, only traces flow, no metrics. I lost an hour chasing that one yesterday.

Two side-cars sit between LiteLLM and the actual model servers:

  • exo-router (:8767 on dockerhost): a FastAPI lazy-loader I wrote for the Mac side. EXO.app on Mac Studio holds the model; the side-car wakes it on first request and applies a 30-min TTL eviction.
  • llama-swap (:4004 on ai-stack): TTL-managed llama-server invocations for the NVIDIA side. Drop-in upstream tool. Swaps between models on demand based on the OpenAI-style model field in the request.

Both side-cars are decorated with the OTel SDK so the trace tree shows client → litellm → exo-router → mac-studio or client → litellm → llama-swap → m6000 → rpc-w-01 end-to-end in SigNoz APM.

2.1 The gateway layer above LiteLLM

LiteLLM is the router. It is not the policy/guardrails surface, that lives in front of it. Three OpenAI-compatible gateways on ai-gw-01 decorate every request before it ever reaches the router:

GatewayRoleWhy it's its own box
AI-GW-01 / OpenCLAWPolicy + structured logging + per-tenant token accountingOpenAI-compatible endpoint that wraps every call with audit-grade logging (CLAW = "Control, Logging, Audit, Workflow"). Sits between the agentic clients and LiteLLM. Lets me say "show me every prompt that touched a model in the last 24h, by team, with the redacted body" without hot-patching the proxy.
NeMo CLAWContent safety, jailbreak detection, output railsWraps NVIDIA NeMo Guardrails (Python sidecar). Refuses inputs that match prompt-injection patterns; rewrites outputs that leak secrets. Same OpenAI shape, so any client that talks to GPT-4 talks to this one.
Hermes AgentMulti-step agentic reasoning loopTool-using agent that wraps LiteLLM. Plans, calls MCP tools (Splunk, Proxmox, TrueNAS, Outline, Plane, Grist), and returns the synthesized answer. The client thinks it's talking to a smart model; it's really a planner-executor chain backed by whatever model LiteLLM routes to.

All three speak the OpenAI /v1/chat/completions shape, route into LiteLLM, and ultimately land on either the NVIDIA cluster (default for tool-using agents because llama-swap holds the model warm) or the Mac MLX side (default for high-context coding). Each gateway emits its own OTel traces, so the SigNoz APM tree shows client → ai-gw-01 → litellm → llama-swap → m6000 end-to-end, with per-hop latency. The trace tree is the killer feature here, most local-LLM stacks treat the gateway as opaque; with OTel everywhere, every hop is queryable.

The Claude Code CLI is the one client that bypasses the gateways and talks to the Mac side directly via the exo-router lazy-loader. Reason: Claude Code drives long agentic loops that thrash the gateway logging volume, and for IDE-style coding work I want the MLX latency floor (~50 ms vs ~200 ms through the full stack).


3. Power telemetry: macmon → OpenTelemetry

This is the new bit, and it's the reason I'm posting.

EXO.app bundles a small Rust binary called macmon that reads Apple-silicon power and temperature data via Apple's IOReport framework. EXO's own topology UI uses it to show per-node wattage live. That data is exactly what I want as a queryable metric series across my whole observability stack, but it's locked inside EXO.app's UI.

So I wrote a 120-line Python adapter that runs macmon pipe --interval 5000 as a subprocess, parses each JSON sample, and POSTs it as OTLP-format gauges to my central OTel collector on ai-stack:

PYTHON
# Excerpt: macmon-otel-exporter.py
proc = subprocess.Popen(
    [MACMON, "pipe", "--interval", str(interval_ms)],
    stdout=subprocess.PIPE, bufsize=1, text=True,
)
for line in proc.stdout:
    sample = json.loads(line)
    metrics = []
    metrics.append(gauge("mac.power.total",  sample["all_power"], "W"))
    metrics.append(gauge("mac.power.cpu",    sample["cpu_power"], "W"))
    metrics.append(gauge("mac.power.gpu",    sample["gpu_power"], "W"))
    metrics.append(gauge("mac.power.ane",    sample["ane_power"], "W"))
    metrics.append(gauge("mac.power.ram",    sample["ram_power"], "W"))
    metrics.append(gauge("mac.power.system", sample["sys_power"], "W"))
    # ... temps, GPU usage, memory ...
    post_otlp(metrics, host_name=HOST, cluster="mac-exo")

Stdlib only, no requests, no opentelemetry-sdk. Just urllib.request.urlopen. It runs as a per-user LaunchAgent on both Macs and sends every 5 s. Happy to share the full adapter if you want it published, drop me a message via the contact form.

What lands at the central otelcol (and then fans out to all three backends):

MetricUnitWhat
mac.power.totalWWhole-board power draw
mac.power.gpuWGPU domain (where MLX inference runs)
mac.power.cpu / .ane / .ram / .systemWPer-domain breakdown
mac.temp.{cpu,gpu}CelAvg die temperatures
mac.gpu.usage0–1GPU utilization fraction
mac.cpu.{e,p}cpu_usage0–1E-core / P-core utilization
mac.memory.ram_{used,total}bytesMemory pressure

Resource attributes: host.name, cluster=mac-exo, deployment.environment=ozteklab-lab. Standard OTel hygiene, everything is filterable.

A SigNoz query against this looks like any other gauge:

SQL
-- Live mac-exo cluster total power
SELECT avg("mac.power.total") AS value
FROM signoz_metrics.distributed_time_series_v4
WHERE labels:cluster = 'mac-exo'

And the answer right now, with the cluster idle: ~20 W across both Macs. Under inference (Qwen3-Coder generating): ~60–80 W. That's the raw power signal, the same number EXO's UI shows internally, now landing as a real OTel gauge that any dashboard can query. The ratio (tokens / watt, the actual efficiency number) is the obvious next panel to build, the two source streams are both flowing into SigNoz already, see §7.


4. Four gotchas worth their own headlines

These are the things that ate my time. Documenting them so they don't eat yours.

4.1 macOS silently blocks unsigned binaries from LAN, even though ping, curl, and nc all work

This one took me 30 minutes the first time and 5 minutes every time after because I finally wrote it down.

You install otelcol-contrib from the upstream GitHub release on a Mac. It's a Go binary with ad-hoc linker signing (Signature=adhoc, no Apple notarization). You start it. Logs look fine. Then this in the export-error stream:

dial tcp <ai-stack-ip>:4318: connect: no route to host

Meanwhile, from the same shell:

BASH
$ curl http://<ai-stack-ip>:4318/v1/metrics
200 OK
$ ping <ai-stack-ip>   # fine
$ nc -zv <ai-stack-ip> 4318   # fine

macOS 14+ enforces a Local Network privacy framework on every process trying to reach an RFC1918 address. Unsigned binaries can't request the permission (no app bundle to attach the prompt to), so the syscall returns EHOSTUNREACH silently. There's no log line. The TCC database doesn't show a denial. You just get a fake "no route to host" forever.

Fix: codesign --force --deep -s - /path/to/binary (ad-hoc signature). After that, the first launch of the process triggers the popup; user clicks Allow; the grant persists per cdhash.

Sub-gotcha I burned on later: re-signing an already-signed binary strips the grant, even when the resulting Signature=adhoc identifier looks identical. Don't re-codesign on every deploy, make it conditional:

BASH
sig_ok=$(codesign -dv "$bin" 2>&1 | grep -c 'Signature=adhoc\|Identifier=otelcol' || true)
if [ "$sig_ok" -gt 0 ]; then
  echo "already signed, leaving alone"
else
  codesign --force --deep -s - "$bin"
fi

4.2 open -a EXO --args ... silently drops arguments

When the Mac peers got out of sync, I tried pinning EXO's libp2p port to make bootstrap deterministic across reboots:

BASH
open -a EXO --args --libp2p-port 9090

EXO bound a random port. Args were dropped. macOS's open accepts --args but the receiving .app has to declare it intends to read them; most bundled apps don't.

What works: invoke the bundled binary directly.

BASH
exec /Applications/EXO.app/Contents/Resources/exo/exo --libp2p-port 9090

What breaks if you do that: the bundled macmon subprocess can't spawn anymore (loses the .app's signed entitlements), and the topology UI's power column goes to 0. You can't have both. I chose the .app launch (wattage matters more than fixed port; cross-VLAN peer discovery via mDNS reflector works fine without).

4.3 libp2p mDNS across VLANs needs a UniFi gateway mDNS reflector

EXO's peer discovery is libp2p mDNS only in the current v1 build, no config.toml, no bootstrap-peers for the Mac UI to consume even though the CLI flag exists. (I confirmed by grep -r config.toml inside the bundle: one TODO comment, never loaded.)

Apple's Bonjour service browser announcements are link-local multicast, they don't cross L3 boundaries unless something forwards them. UniFi has had a built-in "Multicast DNS" / Bonjour reflector for years; you enable it per-VLAN. Mine had to span VLAN 60 (Mac Studio) and VLAN 80 (Mac Mini, on the Coder workspaces VLAN).

Two clicks in UniFi UI → both Macs see each other → EXO topology shows both nodes connected → MLX Ring inference works across the pair.

4.4 Splunk Observability Cloud silently rejects new metric names on trial accounts

The OTel collector reports otelcol_exporter_sent_metric_points{exporter="signalfx"} 979675, send_failed=0. The SignalFX ingest API returns 200 OK to direct mac.power.total POSTs. The metric never appears in /v2/metric or in SignalFlow queries.

You've hit the trial-account custom-MTS cap. The org-info endpoint returns metricTimeSeriesLimit: null when you're already past it. No error message anywhere in the pipeline.

Mac power data lands in SigNoz (self-hosted, no quota) and Splunk Enterprise (lab_otel_metrics index, on-prem, no quota) just fine. Only the cloud trial silently dropped it. Diagnostic check: confirm the metric exists in lab_otel_metrics:

SPL
| mcatalog values(metric_name) WHERE index=lab_otel_metrics metric_name=mac.*

You'll see all 13 mac.* series. The pipeline is healthy; the trial backend is just done with you.


5. Dashboards in three places

Same shape across backends so the screenshots are comparable:

  • SigNoz (exo, NVIDIA): row 1 cluster headline tiles, row 2 per-host load, rows 3-5 LLM perf, rows 6-7 token rate + cost, rows 8-9 power (the new bit).
  • Splunk O11y (exo, NVIDIA): identical layout, SignalFlow programs instead of SigNoz QueryBuilder. Limitation: SignalFX flattens OTel histograms to count/min/max only, so no p50/p95/p99 latency panels, just min/max/rate.
  • Splunk Enterprise dashboards: SPL with tstats over lab_otel (logs) and mstats over lab_otel_metrics (metrics). Power data queryable today with | mstats avg("mac.power.total") WHERE index=lab_otel_metrics BY host.name.
SigNoz NVIDIA cluster dashboard, per-host load, llama-server tokens/s, RPC worker fan-out across rpc-w-01/02/03
SigNoz NVIDIA cluster dashboard, per-host load, llama-server tokens/s, RPC worker fan-out across rpc-w-01/02/03
1/3

Two SigNoz-specific patterns that paid off:

  1. The Macs reporting count tile uses a ClickHouse SQL panel rather than a builder query, because SigNoz's value-panel reduceTo: count shows the first series's label (e.g. mac-studio) instead of the series count. Direct count(DISTINCT host.name) SQL fixed it.

  2. For the LLM perf panels, use gen_ai.usage.input_tokens and gen_ai.usage.output_tokens, not prompt_tokens / completion_tokens. Both names exist in SigNoz schema discovery; only the new semconv names are actually populated by LiteLLM. Hours of empty panels until I queried both.


5b. The operator UI: cluster-dashboard

There's a piece of glue I haven't talked about in this post but it's the one I look at most often: cluster-dashboard, a FastAPI + React app on :8801 that gives me an octagon-topology view of every node, a live nvidia-smi/macmon panel, model load/unload buttons that talk to llama-swap, and an in-browser benchmark that streams tokens directly from llama.cpp's OpenAI endpoint while recording p50/p95/TTFT/tokens-per-second into a SQLite history table. It's how I made the side-by-side Mac-vs-NVIDIA comparison concrete instead of vibes.

The Models library page is the unified view across HuggingFace cache and Ollama-format weights, with quick-load buttons and live VRAM accounting. The benchmark compare view puts two runs side-by-side so the A/B is concrete instead of a vibe-check:

cluster-dashboard Models library page, unified view of HuggingFace cache and Ollama-format weights with sizes, lifecycle state, and quick-load buttons
cluster-dashboard Models library page, unified view of HuggingFace cache and Ollama-format weights with sizes, lifecycle state, and quick-load buttons
1/2

The benchmark flow: pick a model, drop in a prompt, click Run. Tokens stream live; metrics fill in as the response completes. Every run is persisted so you can compare apples-to-apples across models, prompts, max_tokens settings, and RPC vs LOCAL backends.

Benchmark step 1, model dropdown auto-synced to whatever llama-swap has loaded, prompt textarea, max_tokens slider
Benchmark step 1, model dropdown auto-synced to whatever llama-swap has loaded, prompt textarea, max_tokens slider
1/5

I also put a Definitions tab in the dashboard with the quantization / precision / context-length / MLX-vs-GGUF glossary I keep wishing existed when I'm onboarding the next person to the lab, sources cited, no AI-generated waffle:

Definitions, quantization formats (Q4_K_M, Q5_K_M, Q8_0, AWQ, GPTQ) with citation links and trade-off table
Definitions, quantization formats (Q4_K_M, Q5_K_M, Q8_0, AWQ, GPTQ) with citation links and trade-off table
1/4

And the same Mac cluster, viewed through EXO.app's native UI on Mac Studio, same model, different operator perspective:

EXO.app on Mac Studio, model loaded into MLX, libp2p peer to Mac Mini visible, ring inference idle
EXO.app on Mac Studio, model loaded into MLX, libp2p peer to Mac Mini visible, ring inference idle
1/2

Sample output from a qwen-coder run against the local NVIDIA cluster, same OpenAI API contract any client expects, no remote calls, ~13 tok/s on the M6000 with the 7B model:

qwen-coder local test output, full code completion from llama-server on M6000, no cloud round-trip
qwen-coder local test output, full code completion from llama-server on M6000, no cloud round-trip
1/2

5c. The hardware story behind this lab (and what's next on the NVIDIA side)

This is the part for the hardware nerds. The M6000 is doing a job that the internet wrote off as "e-waste", and the lab is wired today to slot in a V100 SXM2 32 GB drop-in as soon as one lands. The plan:

  1. SXM2 V100 32 GB arrives → install in a desktop-PCIe carrier (the $200 V100 SMX hack on Tom's Hardware showed it's a real path, not a Twitter party trick).
  2. M6000 stays in the rig, second card for embeddings, vision pre-processing, and small models the router shouldn't have to swap. Nothing gets retired; everything earns its rack space.
  3. Add the V100 as a fourth member of the RPC ring, currently 1 M6000 prefill + 3 RPC workers; V100 becomes the workhorse for the 70B-class models the M6000 has to RPC-shard.
  4. EXO ring + NVIDIA RPC ring side-by-side, same prompt, same client, same dashboard, four backends now (Mac Studio, Mac Mini, M6000, V100). The watts/token comparison gets sharper because Volta in INT8 is a different efficiency curve than Maxwell or Apple Silicon.

The honest constraint on the SXM2 plan is software, not silicon. The serving ecosystem is moving on from Volta (compute capability sm_70):

  • vLLM still technically supports sm_70, but in practice, v0.20+ has broken kernel paths, BF16 isn't supported on Volta at all (BF16 requires compute capability 8.0+), and newer models hit no kernel image is available errors on V100 (vLLM issue #25456). Community forks like 1Cat-vLLM backport CUDA 12.8 + AWQ kernels to keep V100 viable, but you're maintaining your own fork the moment you commit to that path.
  • TensorRT-LLM, TGI, and Triton have all dropped sm_70 from their current container releases.
  • llama.cpp, the actual home of this lab's NVIDIA side, treats Volta as a first-class target. CUDA 12.x compiles cleanly, Q4_K_M/Q5_K_M paths work, FlashAttention-2 isn't supported on Volta but the regular attention path is fine for inference. This is why the cluster doesn't depend on vLLM at all, and why a V100 drop-in won't break it.

So: when the SXM2 arrives, the work is rebuild the llama.cpp container with sm_70 enabled, plug in the carrier, register it as a new RPC node, update the cluster-dashboard topology. That's an afternoon. The vLLM rebuild, if I ever need it, is a fork-and-pin-and-pray exercise; the lab's design deliberately avoids needing that.


6. What this looks like as a hiring artifact

I'm writing this with one eye on the Splunk Platform AI Architect / NVIDIA SE roles I'm targeting. So the explicit "why this matters" framing for a hiring manager:

  • It's not a script, it's a platform. Two backends, three observability stacks, one router, full GitOps, idempotent deploy. ~120 lines of Python for the power exporter, ~80 for the EXO LaunchAgent wrappers, the rest is OTel config + dashboard JSON, all managed in my private GitOps repo. If you want to see any specific piece on a public GitHub mirror, send me a note via the contact form and I'll publish the relevant bits.
  • The dashboards exist on a real backend you can audit. SigNoz instance is publicly reachable on signoz.ozteklab.com; the Splunk side is mine. Screenshots are nice; live URLs you can poke at are nicer.
  • The gotchas are documented in the open so the next engineer doesn't spend 30 minutes on the macOS Local Network grant. That's the difference between a homelab and a platform.

If you're a Splunk SE looking at this thinking "the AI Ops angle is real and he can actually run the gear", that's the message. Get in touch.


7. What's still missing

  • The Splunk O11y trial dies in a few days. Power data won't be in the cloud screenshots until either the trial extends or I move to paid. Not worth chasing further on a clock.
  • NVIDIA-side power telemetry. nvidia-smi gives me total board watts per second; need a similar Python adapter to emit gpu.power.{board,sm} as gauges. Should take an evening.
  • Tokens-per-watt as a dashboard metric. rate(gen_ai.usage.output_tokens) / mac.power.gpu gives a tokens/W series per model. The maths is one formula away; just need to ship the panel.
  • Cost overlay. When the metric lands, divide power × your local electricity rate to get $/1k-tokens on-prem. That's the real comparison vs OpenAI API pricing, and the bit that makes the on-prem ROI story land.

The full architecture, scripts, OpenTelemetry collector config, and dashboard JSON live in my private homelab GitOps repo. If anything in this post is useful for your setup and you'd like the specific piece published on GitHub, the macmon adapter, the SigNoz dashboard JSON, the otelcol config, the LaunchAgent plists, please drop me a note via the contact form and I'll publish the relevant bits. If the gotcha section saved you an hour, that's why I wrote it.

Powered by Ragtronic

Assistant

Ozteklab Logo

Hi! How can I help you with Ozteklab documentation today?

Ask me anything about the documentation, code examples, or best practices.