Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.microsandbox.dev/llms.txt

Use this file to discover all available pages before exploring further.

Everything beyond the Quick start for the msb-metrics sidecar: the complete flag set, what it emits, the attributes attached to each datapoint, and the operational notes you’ll want once it’s running in earnest.

Deployment constraints

msb-metrics reads the shm registry directly. Two constraints follow.

Same Unix user as msb

The shm object is mode 0600 (owner read/write only). Running msb-metrics as a different user produces EACCES on attach.

Same $MSB_HOME

The shm name is derived from stable_hash($MSB_HOME), so both processes must agree on it. Pass --msb-home explicitly if your environment doesn’t set $MSB_HOME; the default is ~/.microsandbox.

Per-host

The registry is per-host. One msb-metrics process per host covers every running sandbox there.

Metrics emitted

All metrics are emitted under the microsandbox.* namespace so they don’t collide with OTel semantic-convention system.* host metrics in the same backend tenant. The table below shows the suffix only; the fully-qualified name is microsandbox.<suffix>.
SuffixTypeUnitNotes
cpu.utilizationgauge1 (ratio)Process CPU usage as vCPU-seconds per wall-second. A 2-vCPU sandbox at full load reports 2.0. Divide by allocated vCPUs for a 0..1 fraction.
memory.usagegaugeByResident memory in bytes.
memory.limitgaugeByConfigured guest memory limit.
disk.bytes_readgaugeByCumulative bytes read by the sandbox process.
disk.bytes_writtengaugeByCumulative bytes written.
network.bytes_receivedgaugeByCumulative bytes from runtime to guest.
network.bytes_sentgaugeByCumulative bytes from guest to runtime.
uptimegaugesSandbox uptime at sample time.
Cumulative byte fields are emitted as gauges carrying the absolute cumulative value. Use rate() (PromQL, OTel-flavored Prom) for throughput. The reason: each shm snapshot already carries an absolute value, and counter add() semantics would require us to track per-sandbox deltas across runs. The collector also emits its own operational series; see Collector self-observability below.

Collector self-observability

msb-metrics otel ships its own operational metrics through the same OTLP pipeline as the per-sandbox series, so you can confirm the sidecar is actually flowing using the same Prometheus / Grafana / Datadog queries the rest of your telemetry runs through:
SuffixTypeNotes
collector.exports.successcounterCumulative successful OTLP exports since process start.
collector.exports.failurecounterCumulative failed OTLP exports (timeouts, transport errors, non-2xx).
collector.collections.droppedcounterCollections evicted from the per-exporter buffer because the cap was hit (drop-oldest, see --max-buffered).
collector.last_success_timestampgaugeUnix epoch seconds at the last successful export. time() - microsandbox_collector_last_success_timestamp_seconds is a sensible staleness alert source.
Scope: these series share the same OTel scope as the sandbox metrics, so they show up under otel_scope_name="microsandbox-metrics-collector" with otel_scope_version=<msb version>. A few queries you’ll want to wire up:
Non-zero rate means yes.
rate(microsandbox_collector_exports_success_total[1m])
clamp_min keeps the denominator at 1 so an idle window doesn’t divide by zero.
rate(microsandbox_collector_exports_failure_total[5m])
  /
clamp_min(
  rate(microsandbox_collector_exports_success_total[5m]) +
  rate(microsandbox_collector_exports_failure_total[5m]),
  1
)
Wire this to your alerting backend; the > 300 threshold is in seconds.
time() - microsandbox_collector_last_success_timestamp_seconds > 300

Attributes

Every datapoint carries a configurable set of attributes. Resource attributes describe the source. Defaults are set automatically; --resource KEY=VALUE overrides or adds.
KeyDefault
service.namemicrosandbox
service.instance.idhostname, best-effort from HOSTNAME / COMPUTERNAME
Identity attributes describe which sandbox a datapoint belongs to. run_id and pid are opt-in because they create a fresh time series per sandbox restart, which inflates active-series counts on cardinality-billed backends.
AttributeDefaultNotes
sandbox.nameonLow cardinality.
sandbox.idonCatalog id; low cardinality.
sandbox.run_idoffOpt-in via --emit-run-id. Fresh series per restart.
sandbox.pidoffOpt-in via --emit-pid. Fresh series per restart.

All flags

For local inspection of what msb-metrics is reading from shm without standing up an OTLP receiver. One human-readable line per snapshot.
msb-metrics stdout [--collect-interval=<dur>]
                   [--flush-interval=<dur>]
                   [--max-buffered=<n>]
                   [--export-timeout=<dur>]
                   [--msb-home=<path>]
The output format is not a stable contract; don’t pipe it into production parsers. Sample line:
2026-05-30T02:44:31Z sandbox=devbox id=33 cpu=0.000107 \
    mem=13.6 MiB / 512.0 MiB disk_r=89.3 MiB disk_w=644.7 MiB \
    net_rx=48.0 MiB net_tx=268.5 KiB uptime=2475m15s
msb-metrics otel --endpoint=<URL>
                 [--protocol=grpc|http]
                 [--compression=none|gzip]
                 [--ca-cert=<path>]
                 [--header=KEY=VALUE]...
                 [--resource=KEY=VALUE]...
                 [--emit-run-id] [--emit-pid]
                 [--collect-interval=<dur>]
                 [--flush-interval=<dur>]
                 [--max-buffered=<n>]
                 [--export-timeout=<dur>]
                 [--msb-home=<path>]
FlagDefaultNotes
--endpoint(required)OTLP endpoint URL. With --protocol=http, pass the complete metrics signal URL (usually ending in /v1/metrics).
--protocolgrpcgrpc (port 4317) or http (Protobuf body, port 4318). HTTP endpoints are used exactly as provided.
--compressionnonegzip or none. gRPC-only in the current build; rejected at startup with --protocol=http. Meaningful bandwidth saving for direct provider gateways over public internet.
--ca-certnonePath to a PEM-encoded CA certificate to trust when negotiating TLS. Added on top of webpki roots, so a corporate gateway signed by a private CA works without disabling system trust. gRPC only; rejected at startup with --protocol=http.
--headernoneKEY=VALUE, repeatable. For auth (Authorization, api-key, etc.). Applied via OTEL_EXPORTER_OTLP_HEADERS.
--resourcenoneKEY=VALUE, repeatable. Overrides or adds OTel resource attributes.
--emit-run-idoffAdd sandbox.run_id to every datapoint. Opt-in: high cardinality.
--emit-pidoffAdd sandbox.pid to every datapoint. Opt-in: high cardinality.
--collect-interval1sHow often shm is read. humantime durations (1s, 500ms, 2m).
--flush-interval10sPer-exporter scheduled flush cadence.
--max-buffered60Per-exporter buffer cap. Oldest collection drops on overflow; drop count surfaces on the next batch.
--export-timeout30sPer-call timeout for a single OTLP export.
--msb-home$MSB_HOME~/.microsandboxUsed to derive the shm registry name.
FlagDefaultNotes
--log-levelinfoerror, warn, info, debug, trace. Overridden by RUST_LOG if set.
--log-formattexttext for the human-readable tracing formatter, json for newline-delimited JSON one object per line. Use json when shipping the collector’s own logs into the same aggregator as your application logs.
Or just run msb-metrics otel --help for the full prose.

Tuning at scale

At ~1000 sandboxes per host the per-exporter buffer dominates heap usage. The shm registry stays a fixed ~512 KiB regardless of count, and the hot path is pure shm (no sqlite read).
--max-bufferedWorst-case heap, per exporter
60 (default)~21 MB
20~7 MB
Worst-case heap is --max-buffered × active sandboxes × ~350B, reached only when the backend is slow enough to fill the buffer.

Shutdown behavior

SIGINT or SIGTERM triggers a clean drain:
  1. Stop the collect ticker.
  2. Push any buffered collections through one final export.
  3. Call each exporter’s shutdown() (OTel: flushes and closes the OTLP transport).
  4. Exit.
If an exporter’s final export hangs, it’s bounded by --export-timeout.

Backend unreachable

Failed exports are retried on the next flush; the failed batch is restored to the front of the buffer. If failures keep arriving, oldest collections drop first and the next successful export’s droppedCollectionCount reports how many were lost (and increments the microsandbox.collector.collections.dropped counter). The collector itself does not crash. The worker uses capped exponential backoff between scheduled retries: flush_interval, then 2× flush_interval, , up to a 32× cap. At the default 10s flush interval that’s a worst case of ~5 minutes between retries during a sustained outage, instead of hammering the backend every 10s. Explicit RunningCollector::flush() calls bypass the backoff gate, so a caller that knows the upstream has recovered can force-retry immediately. On the first successful export the worker logs metrics exporter recovered at INFO and the multiplier resets to 1×.

Stopped sandboxes

A sandbox that stops releases its shm slot. msb-metrics reads the active snapshot only, so a stopped sandbox simply stops appearing in the export stream. Downstream the series goes stale (no fresh datapoints), which is the standard “host gone” signal in Prometheus and most TSDBs. There is no explicit “stopped” event.

Counter resets across sandbox restarts

Disk and network byte fields are cumulative from the sandbox process’s point of view. When a sandbox restarts, the runtime gets a fresh slot and the counters start from zero again. rate() is robust to this (it detects counter resets), but in the brief window spanning the restart a query may return a small negative interval before the next sample lands. This is normal counter-reset behavior, not a bug.

Troubleshooting

You’re running msb-metrics as a different Unix user from the one that owns the registry. Switch users or use sudo -u <msb-user>.
Either no sandboxes are running, or msb-metrics is reading a different registry than msb writes. Check --msb-home matches the runtime’s $MSB_HOME. Use --log-level=debug to see the registry name and collect cadence.
Auth or schema mismatch. Verify the --header value (especially Authorization base64 encoding) and that the endpoint URL matches the protocol. gRPC endpoints typically end at 4317; HTTP/Protobuf endpoints should be the full metrics URL expected by that backend (often /v1/metrics, but deployments can route it differently).
Expected if --emit-run-id or --emit-pid is on. Drop them if you want a single series per sandbox name across restarts.

See also