Operations runbook
Running zecd on mainnet: what to back up, how to restore, what to monitor, how sends behave under failure, and how to upgrade. For getting the stack up in the first place, see Deployment; for config keys, see the configuration reference.
What to back up
Funds are recoverable from the mnemonic alone. Everything else is convenience.
| Artifact | Where | What it protects |
|---|---|---|
| 24-word mnemonic | shown once by zecd init | The funds. Record offline (paper/HSM). Loss of the server without it is loss of funds. |
| Birthday height | inside keys.toml; also record it with the mnemonic | Makes a from-seed restore fast. Any height at or before the wallet's first transaction works. |
keys.toml | <wallet dir>/keys.toml, or wherever keys_file points | The age-encrypted mnemonic plus network and birthday. Useless without the identity; pair the two for a full server restore. This is the file you ship as a Secret. |
identity.txt (age identity) | [keys] age_identity, default <datadir>/identity.txt | Decrypts keys.toml. This is spend authority. Store its backup separately from keys.toml backups. |
Do not back up data.sqlite or blocks/. They are caches derived from the chain: zecd is
stateless, so with the mnemonic (and birthday) the whole data
directory can be recreated. Shielded funds are unconditionally recoverable from seed;
transparent funds only within the gap-limit / initial-scan window (see
Transparent support).
Minimal runtime file set
Per wallet directory <dir>:
| Path | Role | Ship it? |
|---|---|---|
<dir>/keys.toml | Secret: encrypted seed + birthday/network | Yes. Mount as a Secret; relocate with keys_file / ZECD_KEYS_FILE. |
identity.txt | Secret: decrypts the seed (spend authority) | Yes, if auto-unlocking. Mount as a Secret (ZECD_AGE_IDENTITY). |
<dir>/data.sqlite (+ -wal/-shm) | Cache: account, scan progress, balances, history. Rebuilt from keys.toml plus a rescan. | No. |
<dir>/blocks/ | Cache: downloaded compact blocks. Can grow large; fully re-derivable. | No. Exclude from every snapshot. |
<datadir>/.cookie | Ephemeral RPC cookie, minted at startup, removed on clean shutdown | No. |
Keep secrets out of the TOML (which typically lives in a ConfigMap):
- RPC password:
ZECD_RPC_PASSWORD,--rpcpassword, or[rpc] password_file(flag/env >password_file> inlinepassword). Prefer the env var orpassword_file: a password on the command line is visible to any local user viaps, and zecd warns at startup when it is passed that way. keys.tomllocation:ZECD_KEYS_FILE/--keys-file/[keys] keys_file(per-wallet[wallets.<name>] keys_file).- age identity:
ZECD_AGE_IDENTITY/--age-identity/[keys] age_identity.
Restore procedures
Server restore (you have keys.toml + identity.txt)
Put both files back at their configured paths and start the daemon. With
[keys] bootstrap_from_keys (default true), an empty data directory next to a present
keys.toml is rebuilt automatically on boot: zecd recreates the account from the seed and
rescans from the stored birthday. No init needed. This is the disposable-datadir pattern:
mount one Secret, start with an empty volume.
When the rebuild runs depends on the custody model:
- Identity /
auto_unlock: the seed decrypts at startup, so the rebuild runs as soon as Zebra is reachable. No human action. - Encrypted (
init --encrypt): the wallet starts locked with no account yet; address and spend RPCs return "account is not ready", and/statusreportslocked: true. The rebuild runs at the firstwalletpassphrase, after which the wallet syncs (and stays synced while locked). zecd probes datadir writability when it loads the wallet, so a read-only datadir fails at startup rather than at unlock time. - Watch-only (
--ufvk): no seed, not covered by bootstrap. Recreate withzecd init --ufvkagainst an empty datadir (see Watch-only wallets).
Set bootstrap_from_keys = false to fail fast on an empty datadir instead.
From-seed restore (you have only the mnemonic)
zecd init --datadir /var/lib/zecd --restore --birthday <height>
# paste the mnemonic when prompted
Always pass --birthday (any height at or before the wallet's first transaction). Without
it, the restore scans from the activation height of the wallet's earliest enabled pool
(Orchard/NU5 for the default Orchard-only config, Sapling activation when Sapling is
enabled): safe (it can never miss notes) but slow on mainnet. History reappears as the scan
progresses; do not trust balances until the scan and enhancement backlog finish ("synced"
readiness, or /status showing fully_scanned at the tip and pending_enhancements 0; the
default "connected" readiness reports ready long before that).
Non-interactive restore: set ZECD_MNEMONIC, or pass --mnemonic-file <path>
(ZECD_MNEMONIC takes precedence; stdin is the fallback). For init --encrypt, set
ZECD_WALLET_PASSPHRASE instead of answering the prompt.
Watch-only replica
Export the viewing key on the spending host with zecd export-ufvk, then
zecd init --ufvk "uview1..." --birthday <height> on the replica. A watch-only wallet is
fully reconstructable from UFVK + birthday; record both. The UFVK cannot spend but reveals
the wallet's entire transaction graph, so treat it as confidential.
Monitoring and alerting
zecd serves unauthenticated probes on a separate port (default 9233) when [health] enabled
(the default):
| Endpoint | Semantics |
|---|---|
GET /healthz | Liveness. 200 ok while the process runs. |
GET /readyz | Readiness, 200/503, gated by [health] readiness. |
GET /status | JSON snapshot: per-wallet sync state, active upstream endpoint, conn_state (down | syncing | ready), pending_enhancements, locked. |
Readiness modes:
"connected"(default): ready once Zebra is connected and its tip is past the wallet's birthday. Does not wait for the scan, so readiness never flaps during a long catch-up; reads may lag the tip."synced": ready only once every wallet is connected, within[health] max_scan_lagblocks of the tip (default 4), and with an empty enhancement backlog. A from-birthday restore stays not-ready until it has scanned to its own funds and finished backfilling memos.
A 503 body carries a reason. Route alerts on it:
reason | Meaning | Action |
|---|---|---|
upstream_down | Zebra unreachable | Page someone. |
actor_down | A wallet's writer actor died | Restart the process. |
enhancing | Scanned to tip, still backfilling memos ("synced" mode only) | Wait; watch pending_enhancements trend to zero. |
syncing | Normal block catch-up | Wait. |
"Scanned to tip" is not "ready". Compact blocks carry no memos, so after the block scan
catches up, an enhancement pass fetches each transaction's full data from Zebra and decrypts
it to backfill memos. On a from-birthday restore of a busy wallet that is one fetch + decrypt
per transaction, potentially hours of work after scan_progress hits 1.0. While the
backlog drains, conn_state stays syncing, getwalletinfo.scanning and
getblockchaininfo.initialblockdownload stay truthy, and "synced" readiness holds 503 with
reason="enhancing". Watch /status pending_enhancements; if it drains slowly, check that
Zebra's getrawtransaction is fast.
locked (top-level on both /readyz and /status, plus per-wallet) is true when a
passphrase-encrypted wallet needs a walletpassphrase before it can spend. It is reported
independently of readiness (a locked wallet can be ready: true), so a controller can drive
an unlock without mistaking it for a sync stall.
For load visibility, getrpcinfo returns active_commands: one entry per executing call
with method and duration (microseconds).
Logs: set [log] format = "json" for aggregation (Loki/CloudWatch/Elastic). Every RPC call
logs method, wallet, elapsed_ms (debug on success; errors log at info and add
code/message). Sync and connection lifecycle events log at info; connection failures at
warn.
Suggested alerts:
/readyz503 withreason=upstream_downfor more than 5 minutes./statussync lag (chain tip minus scanned height) not shrinking for 30 minutes.- Sustained HTTP 503 from the RPC port (work queue exhausted).
- Daemon restarts.
The health server starts after wallets load, so cover prover init at boot with a
startupProbe / initialDelaySeconds. The port is unauthenticated by design and exposes
sync status only; keep it off the public internet anyway.
Send semantics under failure
See Sending for the RPC surface; this is the operational contract.
sendtoaddressandsendmanyare synchronous and compute Orchard proofs, so a call holds the HTTP connection for a few seconds plus any queueing behind other sends (sends serialize per wallet). Set client-side send timeouts well above that. (z_sendmanyreturns an operation id immediately; see async operations.)- A client timeout is not a failure. The send may still complete on the server. Retrying
a send that actually succeeded pays twice, exactly as with bitcoind, but the longer proving
window makes it likelier. On timeout, reconcile with
listtransactions(orgettransaction) before retrying. - A send whose initial broadcast fails in transport still returns the txid. The transaction
is already committed to the wallet, its inputs are locked, and the rebroadcast loop
re-submits it (at most once per
[sync] rebroadcast_secs, default 60) while it is unmined and unexpired. Never retry a send that returned a txid. - Only an explicit upstream rejection (Zebra examined the tx and refused it) errors, with
-26. The tx's notes stay locked until its expiry height, then become spendable again; an immediate retry fails with-6rather than double-paying. - An expired unmined tx reports
confirmations: -1andabandoned: true. Treat it as failed and safe to re-send. - Rapid back-to-back sends exhaust spendable notes and return
-6until change confirms (freshly created shielded change is not spendable unmined). The-6message appends any balance awaiting confirmations, so "retry after the next block" is distinguishable from "the wallet needs funding".
Reorgs
zecd follows reorgs automatically: the scanner detects the fork, rewinds, and rescans the
replacement chain. Transactions in reorged-away blocks revert to unconfirmed
(confirmations: 0) until re-mined; confirmation thresholds keep doing their job. One
operator-visible consequence: a listsinceblock cursor pointing at a reorged-away block
returns -5 Block not found (zecd keeps no stale-header history to walk back through, unlike
bitcoind). Treat -5 as "cursor invalid": re-baseline with a parameterless listsinceblock,
dedupe by txid, and store the fresh lastblock. See
Wallet: history & unspent.
Upgrades
- Stop with SIGINT or SIGTERM (both are graceful: in-flight requests finish, new ones get
503). The
stopRPC is regtest-only, so a stray RPC call cannot take down a production daemon. - Replace the binary or pull the new image.
- Start. Wallet DB migrations run automatically at open; the first start after a large librustzcash bump can take longer.
Downgrades across DB migrations are not supported. If you need a rollback path, stop the daemon and snapshot the datadir first. The worst case of a lost datadir is a from-seed restore, not lost funds.
Single-instance datadir lock
zecd takes an exclusive advisory lock on <datadir>/.lock while it owns the data directory
(the daemon for its whole lifetime, zecd init for the init). A second zecd run or
zecd init on the same datadir fails fast with Cannot lock data directory .... The lock is
an OS advisory lock the kernel releases when the process exits, including a crash or kill, so
there is never a stale lockfile to delete: if the error appears and no zecd is running, just
retry. Two commands are exempt because they never write the datadir: zecd export-ufvk
(read-only DB access, so you can export a UFVK while the daemon runs) and zecd rpcauth.
Mainnet checklist
-
network = "main"and a real[rpc] password(the daemon refuses to start with theCHANGE-MEplaceholder). -
RPC bound to
127.0.0.1or a private network; TLS or a reverse proxy in front if it must cross a network boundary. RPC credentials are spend authority (see the threat model). -
Key custody chosen deliberately: for unattended sending, the age identity stored
outside the datadir (secrets manager, separate mount,
ZECD_AGE_IDENTITY); for human-operated wallets,zecd init --encryptso spending requireswalletpassphrasewith a timeout. See Key custody. - Mnemonic and birthday recorded offline; restore procedure tested on testnet.
-
Local Zebra full node configured (
server = "zebra"orzebra://host:port); Docker images pinned to verified releases. -
/readyzwired into the orchestrator with astartupProbecovering initial sync; alerts onupstream_down.