Everything beyond the Quick start for theDocumentation Index
Fetch the complete documentation index at: https://docs.microsandbox.dev/llms.txt
Use this file to discover all available pages before exploring further.
msb-metrics sidecar: the complete flag
set, what it emits, the attributes attached to each datapoint, and the
operational notes you’ll want once it’s running in earnest.
Deployment constraints
msb-metrics reads the shm registry directly. Two constraints follow.
Same Unix user as msb
The shm object is mode 0600 (owner read/write only). Running
msb-metrics as a different user produces EACCES on attach.
Same $MSB_HOME
The shm name is derived from stable_hash($MSB_HOME), so both
processes must agree on it. Pass --msb-home explicitly if your
environment doesn’t set $MSB_HOME; the default is ~/.microsandbox.
Per-host
The registry is per-host. Onemsb-metrics process per host covers
every running sandbox there.
Metrics emitted
All metrics are emitted under themicrosandbox.* namespace so they
don’t collide with OTel semantic-convention system.* host metrics in
the same backend tenant. The table below shows the suffix only; the
fully-qualified name is microsandbox.<suffix>.
| Suffix | Type | Unit | Notes |
|---|---|---|---|
cpu.utilization | gauge | 1 (ratio) | Process CPU usage as vCPU-seconds per wall-second. A 2-vCPU sandbox at full load reports 2.0. Divide by allocated vCPUs for a 0..1 fraction. |
memory.usage | gauge | By | Resident memory in bytes. |
memory.limit | gauge | By | Configured guest memory limit. |
disk.bytes_read | gauge | By | Cumulative bytes read by the sandbox process. |
disk.bytes_written | gauge | By | Cumulative bytes written. |
network.bytes_received | gauge | By | Cumulative bytes from runtime to guest. |
network.bytes_sent | gauge | By | Cumulative bytes from guest to runtime. |
uptime | gauge | s | Sandbox uptime at sample time. |
rate() (PromQL, OTel-flavored Prom) for
throughput. The reason: each shm snapshot already carries an absolute
value, and counter add() semantics would require us to track
per-sandbox deltas across runs.
The collector also emits its own operational series; see
Collector self-observability below.
Collector self-observability
msb-metrics otel ships its own operational metrics through the same
OTLP pipeline as the per-sandbox series, so you can confirm the
sidecar is actually flowing using the same Prometheus / Grafana / Datadog
queries the rest of your telemetry runs through:
| Suffix | Type | Notes |
|---|---|---|
collector.exports.success | counter | Cumulative successful OTLP exports since process start. |
collector.exports.failure | counter | Cumulative failed OTLP exports (timeouts, transport errors, non-2xx). |
collector.collections.dropped | counter | Collections evicted from the per-exporter buffer because the cap was hit (drop-oldest, see --max-buffered). |
collector.last_success_timestamp | gauge | Unix epoch seconds at the last successful export. time() - microsandbox_collector_last_success_timestamp_seconds is a sensible staleness alert source. |
otel_scope_name="microsandbox-metrics-collector" with
otel_scope_version=<msb version>.
A few queries you’ll want to wire up:
Are exports flowing?
Are exports flowing?
Non-zero rate means yes.
Failure ratio over the last 5 minutes
Failure ratio over the last 5 minutes
clamp_min keeps the denominator at 1 so an idle window doesn’t
divide by zero.Staleness alert: no successful export in 5 minutes
Staleness alert: no successful export in 5 minutes
Wire this to your alerting backend; the
> 300 threshold is in
seconds.Attributes
Every datapoint carries a configurable set of attributes. Resource attributes describe the source. Defaults are set automatically;--resource KEY=VALUE overrides or adds.
| Key | Default |
|---|---|
service.name | microsandbox |
service.instance.id | hostname, best-effort from HOSTNAME / COMPUTERNAME |
run_id and pid are opt-in because they create a fresh time series
per sandbox restart, which inflates active-series counts on
cardinality-billed backends.
| Attribute | Default | Notes |
|---|---|---|
sandbox.name | on | Low cardinality. |
sandbox.id | on | Catalog id; low cardinality. |
sandbox.run_id | off | Opt-in via --emit-run-id. Fresh series per restart. |
sandbox.pid | off | Opt-in via --emit-pid. Fresh series per restart. |
All flags
msb-metrics stdout
msb-metrics stdout
For local inspection of what The output format is not a stable contract; don’t pipe it into
production parsers. Sample line:
msb-metrics is reading from shm
without standing up an OTLP receiver. One human-readable line per
snapshot.msb-metrics otel
msb-metrics otel
| Flag | Default | Notes |
|---|---|---|
--endpoint | (required) | OTLP endpoint URL. With --protocol=http, pass the complete metrics signal URL (usually ending in /v1/metrics). |
--protocol | grpc | grpc (port 4317) or http (Protobuf body, port 4318). HTTP endpoints are used exactly as provided. |
--compression | none | gzip or none. gRPC-only in the current build; rejected at startup with --protocol=http. Meaningful bandwidth saving for direct provider gateways over public internet. |
--ca-cert | none | Path to a PEM-encoded CA certificate to trust when negotiating TLS. Added on top of webpki roots, so a corporate gateway signed by a private CA works without disabling system trust. gRPC only; rejected at startup with --protocol=http. |
--header | none | KEY=VALUE, repeatable. For auth (Authorization, api-key, etc.). Applied via OTEL_EXPORTER_OTLP_HEADERS. |
--resource | none | KEY=VALUE, repeatable. Overrides or adds OTel resource attributes. |
--emit-run-id | off | Add sandbox.run_id to every datapoint. Opt-in: high cardinality. |
--emit-pid | off | Add sandbox.pid to every datapoint. Opt-in: high cardinality. |
--collect-interval | 1s | How often shm is read. humantime durations (1s, 500ms, 2m). |
--flush-interval | 10s | Per-exporter scheduled flush cadence. |
--max-buffered | 60 | Per-exporter buffer cap. Oldest collection drops on overflow; drop count surfaces on the next batch. |
--export-timeout | 30s | Per-call timeout for a single OTLP export. |
--msb-home | $MSB_HOME ∨ ~/.microsandbox | Used to derive the shm registry name. |
Global flags
Global flags
| Flag | Default | Notes |
|---|---|---|
--log-level | info | error, warn, info, debug, trace. Overridden by RUST_LOG if set. |
--log-format | text | text for the human-readable tracing formatter, json for newline-delimited JSON one object per line. Use json when shipping the collector’s own logs into the same aggregator as your application logs. |
msb-metrics otel --help for the full prose.
Tuning at scale
At ~1000 sandboxes per host the per-exporter buffer dominates heap usage. The shm registry stays a fixed ~512 KiB regardless of count, and the hot path is pure shm (no sqlite read).--max-buffered | Worst-case heap, per exporter |
|---|---|
60 (default) | ~21 MB |
20 | ~7 MB |
--max-buffered × active sandboxes × ~350B,
reached only when the backend is slow enough to fill the buffer.
Shutdown behavior
SIGINT or SIGTERM triggers a clean drain:- Stop the collect ticker.
- Push any buffered collections through one final export.
- Call each exporter’s
shutdown()(OTel: flushes and closes the OTLP transport). - Exit.
--export-timeout.
Backend unreachable
Failed exports are retried on the next flush; the failed batch is restored to the front of the buffer. If failures keep arriving, oldest collections drop first and the next successful export’sdroppedCollectionCount reports how many were lost (and increments the
microsandbox.collector.collections.dropped
counter). The collector itself does not crash.
The worker uses capped exponential backoff between scheduled retries:
flush_interval, then 2× flush_interval, 4×, up to a 32× cap. At
the default 10s flush interval that’s a worst case of ~5 minutes
between retries during a sustained outage, instead of hammering the
backend every 10s. Explicit RunningCollector::flush() calls bypass
the backoff gate, so a caller that knows the upstream has recovered
can force-retry immediately. On the first successful export the worker
logs metrics exporter recovered at INFO and the multiplier resets
to 1×.
Stopped sandboxes
A sandbox that stops releases its shm slot.msb-metrics reads the
active snapshot only, so a stopped sandbox simply stops appearing in
the export stream. Downstream the series goes stale (no fresh
datapoints), which is the standard “host gone” signal in Prometheus and
most TSDBs. There is no explicit “stopped” event.
Counter resets across sandbox restarts
Disk and network byte fields are cumulative from the sandbox process’s point of view. When a sandbox restarts, the runtime gets a fresh slot and the counters start from zero again.rate() is robust to this (it
detects counter resets), but in the brief window spanning the restart
a query may return a small negative interval before the next sample
lands. This is normal counter-reset behavior, not a bug.
Troubleshooting
EACCES opening the shm region
EACCES opening the shm region
You’re running
msb-metrics as a different Unix user from the one
that owns the registry. Switch users or use sudo -u <msb-user>.Empty metrics, no sandboxes show up
Empty metrics, no sandboxes show up
Either no sandboxes are running, or
msb-metrics is reading a
different registry than msb writes. Check --msb-home matches
the runtime’s $MSB_HOME. Use --log-level=debug to see the
registry name and collect cadence.OTLP backend rejects the request (HTTP 401/403/422)
OTLP backend rejects the request (HTTP 401/403/422)
Auth or schema mismatch. Verify the
--header value (especially
Authorization base64 encoding) and that the endpoint URL matches
the protocol. gRPC endpoints typically end at 4317; HTTP/Protobuf
endpoints should be the full metrics URL expected by that backend
(often /v1/metrics, but deployments can route it differently).Sandbox restarts produce fresh time series
Sandbox restarts produce fresh time series
Expected if
--emit-run-id or --emit-pid is on. Drop them if you
want a single series per sandbox name across restarts.See also
- Metrics collector: overview and quick
start for
msb-metrics. - Metrics backends: end-to-end recipes per provider.