Deep dive

Everything beyond the Quick start for the msb-metrics sidecar: the complete flag set, what it emits, the attributes attached to each datapoint, and the operational notes you’ll want once it’s running in earnest.

Deployment constraints

msb-metrics reads the shm registry directly. Two constraints follow.

Same Unix user as `msb`

The shm object is mode 0600 (owner read/write only). Running msb-metrics as a different user produces EACCES on attach.

Same `$MSB_HOME`

The shm name is derived from stable_hash($MSB_HOME), so both processes must agree on it. Pass --msb-home explicitly if your environment doesn’t set $MSB_HOME; the default is ~/.microsandbox.

Per-host

The registry is per-host. One msb-metrics process per host covers every running sandbox there.

Metrics emitted

All metrics are emitted under the microsandbox.* namespace so they don’t collide with OTel semantic-convention system.* host metrics in the same backend tenant. The table below shows the suffix only; the fully-qualified name is microsandbox.<suffix>.

Suffix	Type	Unit	Notes
`cpu.utilization`	gauge	`1` (ratio)	Process CPU usage as vCPU-seconds per wall-second. A 2-vCPU sandbox at full load reports `2.0`. Divide by allocated vCPUs for a 0..1 fraction.
`memory.usage`	gauge	`By`	Resident memory in bytes.
`memory.limit`	gauge	`By`	Configured guest memory limit.
`disk.bytes_read`	gauge	`By`	Cumulative bytes read by the sandbox process.
`disk.bytes_written`	gauge	`By`	Cumulative bytes written.
`network.bytes_received`	gauge	`By`	Cumulative bytes from runtime to guest.
`network.bytes_sent`	gauge	`By`	Cumulative bytes from guest to runtime.
`upper.used`	gauge	`By`	Guest-visible used bytes on the OCI upper filesystem. Emitted only when the protected bundled-kernel reporter is available and fresh.
`upper.free`	gauge	`By`	Guest-visible bytes available to ordinary allocation on the OCI upper filesystem. Emitted only when the protected bundled-kernel reporter is available and fresh.
`upper.host_allocated`	gauge	`By`	Host-allocated bytes for the writable upper image. Emitted only when the host can observe an OCI upper image.
`uptime`	gauge	`s`	Sandbox uptime at sample time.

Cumulative byte fields are emitted as gauges carrying the absolute cumulative value. Use rate() (PromQL, OTel-flavored Prom) for throughput. The reason: each shm snapshot already carries an absolute value, and counter add() semantics would require us to track per-sandbox deltas across runs. The collector also emits its own operational series; see Collector self-observability below.

Collector self-observability

msb-metrics otel ships its own operational metrics through the same OTLP pipeline as the per-sandbox series, so a user can confirm the sidecar is actually flowing using the same Prometheus / Grafana / Datadog queries the rest of their telemetry runs through:

Suffix	Type	Notes
`collector.exports.success`	counter	Cumulative successful OTLP exports since process start.
`collector.exports.failure`	counter	Cumulative failed OTLP exports (timeouts, transport errors, non-2xx).
`collector.collections.dropped`	counter	Collections evicted from the per-exporter buffer because the cap was hit (drop-oldest, see `--max-buffered`).
`collector.last_success_timestamp`	gauge	Unix epoch seconds at the last successful export. `time() - microsandbox_collector_last_success_timestamp_seconds` is a sensible staleness alert source.

Scope: these series share the same OTel scope as the sandbox metrics, so they show up under otel_scope_name="microsandbox-metrics-collector" with otel_scope_version=<msb version>. A few queries you’ll want to wire up:

Are exports flowing?

Non-zero rate means yes.

rate(microsandbox_collector_exports_success_total[1m])

Failure ratio over the last 5 minutes

clamp_min keeps the denominator at 1 so an idle window doesn’t divide by zero.

rate(microsandbox_collector_exports_failure_total[5m])
  /
clamp_min(
  rate(microsandbox_collector_exports_success_total[5m]) +
  rate(microsandbox_collector_exports_failure_total[5m]),
  1
)

Staleness alert: no successful export in 5 minutes

Wire this to your alerting backend; the > 300 threshold is in seconds.

time() - microsandbox_collector_last_success_timestamp_seconds > 300

Attributes

Every datapoint carries a configurable set of attributes. Resource attributes describe the source. Defaults are set automatically; --resource KEY=VALUE overrides or adds.

Key	Default
`service.name`	`microsandbox`
`service.instance.id`	hostname, best-effort from `HOSTNAME` / `COMPUTERNAME`

Identity attributes describe which sandbox a datapoint belongs to. run_id and pid are opt-in because they create a fresh time series per sandbox restart, which inflates active-series counts on cardinality-billed backends.

Attribute	Default	Notes
`sandbox.name`	on	Low cardinality.
`sandbox.id`	on	Catalog id; low cardinality.
`sandbox.run_id`	off	Opt-in via `--emit-run-id`. Fresh series per restart.
`sandbox.pid`	off	Opt-in via `--emit-pid`. Fresh series per restart.

Label attributes are the user-defined labels set at sandbox creation (msb create --label user.id=alice). They are read from the catalog on first sight of each sandbox, cached, and attached to every datapoint as a plain attribute with the same key and value, so backends can group and filter by them. On by default; disable with --no-labels, or drop individual keys with --exclude-label-key <key> (repeatable).

Labels are emitted unconditionally and are not pre-declared, so a high-cardinality key like user.id multiplies active series far more than the opt-in run_id / pid do. Backends such as Grafana Cloud and Prometheus bill on active series. Keep label keys low-cardinality, pass --no-labels where attribution is not needed, or --exclude-label-key to drop specific noisy keys (e.g. an image’s commit-SHA label) while keeping the rest. Excluded keys stay in the catalog and remain visible to msb inspect; only the metric attribute is withheld.

Label enrichment is best-effort and never blocks metrics. If the catalog DB is missing (for example msb-metrics started before msb initialized $MSB_HOME) or a label query fails, that tick ships without labels and the collector retries on the next tick, so enrichment switches on automatically once the catalog is available. No restart required.

All flags

msb-metrics stdout

For local inspection of what msb-metrics is reading from shm without standing up an OTLP receiver. One human-readable line per snapshot.

msb-metrics stdout [--collect-interval=<dur>]
                   [--flush-interval=<dur>]
                   [--max-buffered=<n>]
                   [--export-timeout=<dur>]
                   [--msb-home=<path>]
                   [--no-labels]
                   [--exclude-label-key=<key>]...

The output format is not a stable contract; don’t pipe it into production parsers. Sample line:

2026-05-30T02:44:31Z sandbox=devbox id=33 cpu=0.000107 \
    mem=13.6 MiB / 512.0 MiB disk_r=89.3 MiB disk_w=644.7 MiB \
    net_rx=48.0 MiB net_tx=268.5 KiB uptime=2475m15s

msb-metrics otel

msb-metrics otel --endpoint=<URL>
                 [--protocol=grpc|http]
                 [--compression=none|gzip]
                 [--ca-cert=<path>]
                 [--header=KEY=VALUE]...
                 [--resource=KEY=VALUE]...
                 [--emit-run-id] [--emit-pid]
                 [--collect-interval=<dur>]
                 [--flush-interval=<dur>]
                 [--max-buffered=<n>]
                 [--export-timeout=<dur>]
                 [--msb-home=<path>]
                 [--no-labels]
                 [--exclude-label-key=<key>]...

Flag	Default	Notes
`--endpoint`	(required)	OTLP endpoint URL. With `--protocol=http`, pass the complete metrics signal URL (usually ending in `/v1/metrics`).
`--protocol`	`grpc`	`grpc` (port `4317`) or `http` (Protobuf body, port `4318`). HTTP endpoints are used exactly as provided.
`--compression`	`none`	`gzip` or `none`. gRPC-only in the current build; rejected at startup with `--protocol=http`. Meaningful bandwidth saving for direct provider gateways over public internet.
`--ca-cert`	none	Path to a PEM-encoded CA certificate to trust when negotiating TLS. Added on top of webpki roots, so a corporate gateway signed by a private CA works without disabling system trust. gRPC only; rejected at startup with `--protocol=http`.
`--header`	none	`KEY=VALUE`, repeatable. For auth (`Authorization`, `api-key`, etc.). Applied via `OTEL_EXPORTER_OTLP_HEADERS`.
`--resource`	none	`KEY=VALUE`, repeatable. Overrides or adds OTel resource attributes.
`--emit-run-id`	off	Add `sandbox.run_id` to every datapoint. Opt-in: high cardinality.
`--emit-pid`	off	Add `sandbox.pid` to every datapoint. Opt-in: high cardinality.
`--no-labels`	off	Stop attaching per-sandbox labels (skips the catalog lookup). Use to cap series cardinality from high-cardinality label keys.
`--exclude-label-key`	none	Label key to drop from emitted metrics, repeatable. The key stays in the catalog (visible to `msb inspect`) and is only withheld from metric attributes. Ignored when `--no-labels` is set.
`--collect-interval`	`1s`	How often shm is read. `humantime` durations (`1s`, `500ms`, `2m`).
`--flush-interval`	`10s`	Per-exporter scheduled flush cadence.
`--max-buffered`	`60`	Per-exporter buffer cap. Oldest collection drops on overflow; drop count surfaces on the next batch.
`--export-timeout`	`30s`	Per-call timeout for a single OTLP export.
`--msb-home`	`$MSB_HOME` ∨ `~/.microsandbox`	Used to derive the shm registry name.

Global flags

Flag	Default	Notes
`--log-level`	`info`	`error`, `warn`, `info`, `debug`, `trace`. Overridden by `RUST_LOG` if set.
`--log-format`	`text`	`text` for the human-readable tracing formatter, `json` for newline-delimited JSON one object per line. Use `json` when shipping the collector’s own logs into the same aggregator as your application logs.

Or just run msb-metrics otel --help for the full prose.

Tuning at scale

At ~1000 sandboxes per host the per-exporter buffer dominates heap usage. The shm registry stays a fixed ~512 KiB regardless of count, and the hot path is pure shm (no sqlite read).

`--max-buffered`	Worst-case heap, per exporter
`60` (default)	~21 MB
`20`	~7 MB

Worst-case heap is --max-buffered × active sandboxes × ~350B, reached only when the backend is slow enough to fill the buffer.

Shutdown behavior

SIGINT or SIGTERM triggers a clean drain:

Stop the collect ticker.
Push any buffered collections through one final export.
Call each exporter’s shutdown() (OTel: flushes and closes the OTLP transport).
Exit.

If an exporter’s final export hangs, it’s bounded by --export-timeout.

Backend unreachable

Failed exports are retried on the next flush; the failed batch is restored to the front of the buffer. If failures keep arriving, oldest collections drop first and the next successful export’s droppedCollectionCount reports how many were lost (and increments the microsandbox.collector.collections.dropped counter). The collector itself does not crash. The worker uses capped exponential backoff between scheduled retries: flush_interval, then 2× flush_interval, 4×, up to a 32× cap. At the default 10s flush interval that’s a worst case of ~5 minutes between retries during a sustained outage, instead of hammering the backend every 10s. Explicit RunningCollector::flush() calls bypass the backoff gate, so a caller that knows the upstream has recovered can force-retry immediately. On the first successful export the worker logs metrics exporter recovered at INFO and the multiplier resets to 1×.

Stopped sandboxes

A sandbox that stops releases its shm slot. msb-metrics reads the active snapshot only, so a stopped sandbox simply stops appearing in the export stream. Downstream the series goes stale (no fresh datapoints), which is the standard “host gone” signal in Prometheus and most TSDBs. There is no explicit “stopped” event. Crashed sandboxes behave the same way. A runtime killed hard (SIGKILL, OOM, host reboot) never releases its slot, and registry readers used to keep reporting its frozen last sample as if the sandbox were running. Readers now verify that the slot’s owner PID is alive and retire dead entries on first read, so a crashed sandbox drops out of the export stream on the next collection tick, the same “series goes stale” signal as a clean stop.

Counter resets across sandbox restarts

Disk and network byte fields are cumulative from the sandbox process’s point of view. When a sandbox restarts, the runtime gets a fresh slot and the counters start from zero again. rate() is robust to this (it detects counter resets), but in the brief window spanning the restart a query may return a small negative interval before the next sample lands. This is normal counter-reset behavior, not a bug.

Troubleshooting

EACCES opening the shm region

You’re running msb-metrics as a different Unix user from the one that owns the registry. Switch users or use sudo -u <msb-user>.

Empty metrics, no sandboxes show up

Either no sandboxes are running, or msb-metrics is reading a different registry than msb writes. Check --msb-home matches the runtime’s $MSB_HOME. Use --log-level=debug to see the registry name and collect cadence.

OTLP backend rejects the request (HTTP 401/403/422)

Auth or schema mismatch. Verify the --header value (especially Authorization base64 encoding) and that the endpoint URL matches the protocol. gRPC endpoints typically end at 4317; HTTP/Protobuf endpoints should be the full metrics URL expected by that backend (often /v1/metrics, but deployments can route it differently).

Sandbox restarts produce fresh time series

Expected if --emit-run-id or --emit-pid is on. Drop them if you want a single series per sandbox name across restarts.

Getting Started

Sandboxes

Networking

Observability

Images

Troubleshooting

Deployment constraints

Same Unix user as `msb`

Same `$MSB_HOME`

Per-host

Metrics emitted

Collector self-observability

Attributes

All flags

Tuning at scale

Shutdown behavior

Backend unreachable

Stopped sandboxes

Counter resets across sandbox restarts

Troubleshooting

See also

​Deployment constraints

​Same Unix user as msb

​Same $MSB_HOME

​Per-host

​Metrics emitted

​Collector self-observability

​Attributes

​All flags

​Tuning at scale

​Shutdown behavior

​Backend unreachable

​Stopped sandboxes

​Counter resets across sandbox restarts

​Troubleshooting

​See also

Deployment constraints

Same Unix user as `msb`

Same `$MSB_HOME`

Per-host

Metrics emitted

Collector self-observability

Attributes

All flags

Tuning at scale

Shutdown behavior

Backend unreachable

Stopped sandboxes

Counter resets across sandbox restarts

Troubleshooting

See also