Pre-compiled binaries for Envoy, Coraza WASM, and Elchi Client
Single-VM Ubuntu 24.04 install that brings up a kind cluster and deploys the Helm chart:
Example: Loading...
Install the entire elchi stack as systemd services on 1, 2, or 3+ Linux VMs.
The script runs once on the first node ("M1", the local machine) and SSHes into the
rest to provision them. Source lives at
deploy/standalone/;
no separate release tarball β the installer is unversioned and always runs from the
main branch. Component versions (elchi-backend, UI, envoy, coredns) are
pinned per-flag.
Without --nodes the installer defaults to a single-VM setup on this machine
(auto-detects the first non-loopback IPv4 from hostname -I):
Versions default from deploy/standalone/lib/versions.sh: UI v1.4.4,
backend elchi-v1.4.8-v0.14.0-envoy1.36.2, envoy v1.36.2,
coredns v0.1.4, collector v0.1.8. Override any with the matching
--ui-version= / --backend-version= / --envoy-version= /
--coredns-version= / --collector-version= flag.
GSLB zone: CoreDNS GSLB plugin is enabled by default. If you skip
--gslb-zone=..., the installer falls back to elchi.local β a non-routable
.local-style namespace safe for internal clusters / testing. Pass
--gslb-zone=<your-delegated-domain> for a real authoritative deployment, or
--no-gslb to skip the plugin entirely.
Post-install UI activation (required for GSLB to actually serve records).
The installer ships and boots the CoreDNS daemon (TCP/UDP :53, webhook
:8053), but the backend-side configuration that the plugin polls for the
authoritative snapshot is OFF until you turn it on in the UI:
--gslb-zone; warning:
zone cannot be changed later without a re-install).sudo elchi-stack show-secret gslb β this is the
X-Elchi-Secret the plugin uses to authenticate its
/dns/snapshot poll to the backend, and the values MUST match.
--gslb-sync-interval (default 1 min) every node's CoreDNS plugin
pulls the fresh snapshot and starts answering queries.
Verify after activation: dig @<node-ip> <zone> SOA +short on any
node should return the SOA record. If it doesn't,
journalctl -u elchi-coredns -n 50 and the plugin /health
endpoint on 127.0.0.1:8053 will say why (auth failure / snapshot poll error).
Variants & replicas: each --backend-version entry is
ONE variant. The number of variants determines how many backend processes per node:
3 variants = 1 controller + 3 control-planes per node (one control-plane per Envoy
version). Same variant cannot appear twice β duplicates collide on the registry
name <hostname>-controlplane-<X.Y.Z> and the installer rejects
them. Capacity scales by adding nodes, not by replicating a variant on the same node.
When no --ssh-key / --ssh-password is passed and
--nodes includes remote hosts, the installer auto-enables
--ssh-bootstrap and prompts for each remote node's password
once. Single-VM installs (only the local IP in --nodes) skip
SSH provisioning entirely.
--ssh-bootstrap mints a fresh ed25519 key on M1, then prompts the
operator once per remote node for that node's password. Each password is
used only for that node's ssh-copy-id and is discarded immediately
after. M1 itself is local β no password prompt for it. Subsequent SSH (orchestration,
upgrades, uninstall) all use the generated key.
Dedicated admin user (default-on). By default the bootstrap also
creates a key-only, passwordless-sudo admin user (elchi-cluster-admin)
on every node β including M1 β and locks all subsequent orchestration to that
identity. After the first install, the operator can lock root's password, disable
root SSH login, or delete the root account; upgrade and
uninstall keep working because they run as elchi-cluster-admin
with sudo. Override the name with --admin-user=<name> or opt out
with --no-admin-user (legacy: orchestration stays on root).
install.sh β full flag reference
Every variant tag in --backend-version is a full release-asset name
(elchi-vX.Y.Z-vA.B.C-envoyP.Q.R), downloaded from the public
elchi-archive releases
β mirrored there from the private elchi-backend repo by the
build-elchi-backend workflow. Multiple variants are comma-separated and
each gets its own systemd template unit + /etc/elchi/<variant>/
config dir + /var/lib/elchi/<variant>/ HOME dir.
--nodes=<csv>
--ssh-user=<user>
root.
--ssh-port=<n>
22.--ssh-key=<path>
--ssh-password=<pwd>
sshpass). Avoid for production.--ssh-bootstrap
--admin-user=<name>
elchi-cluster-admin (active by
default).
Dedicated admin user provisioned on every node during the first bootstrap. Runs
useradd + /etc/sudoers.d/10-elchi-admin (NOPASSWD) + drops
the cluster pubkey into the new user's authorized_keys, then flips the
orchestrator's SSH user to it (persisted to /etc/elchi/orchestrator.env).
After this, the cluster does NOT depend on the initial login user (root): the
operator can rotate root's password, disable root SSH login, or even delete root
entirely without breaking upgrade or uninstall.
Idempotent on rerun.
--no-admin-user
--backend-version=<csv>
elchi-v1.4.8-v0.14.0-envoy1.36.2
(from lib/versions.sh). Alias: --backend-variants=.
--backend-release=<tag>
--ui-version=<vX.Y.Z>
elchi-dist-vX.Y.Z.tar.gz), downloaded from the
public elchi-archive releases β mirrored from the private elchi repo
by the build-elchi-ui workflow. Default:
v1.4.4.
--envoy-version=<vX.Y.Z>
v1.37.0.--coredns-version=<vX.Y.Z>
--gslb).
Default: v0.1.3. v0.1.1 shipped without the elchi
plugin compiled in β pinning to that version causes Unknown directive 'elchi'.
--collector-version=<vX.Y.Z>
v0.1.8. Mirrored to the public
elchi-archive releases by the build-elchi-collector workflow.
--no-collector
Replica count is fixed by design β there is no flag to tune it:
versions[0]'s binary. Registers as bare <hostname>.
<hostname>-controlplane-<envoy-X.Y.Z>.
Capacity for a different Envoy version β add another variant tag. Capacity for the same Envoy version β add another node. Running the same variant twice on the same host would collide on the registry name and is rejected by topology compute.
--main-address=<dns|ip>
--port=<n>
443.--hostnames=<csv>
--tls=self-signed|provided
self-signed
(10-year
ECDSA-P256 generated by openssl).--cert=<path>
--tls=provided).--key=<path>
--tls=provided).--ca=<path>
--timezone=<tz>
UTC.--mongo=local|external
local.--mongo-uri=<uri>
mongodb[+srv]://... for --mongo=external;
granular
flags below win on conflicts.--mongo-version=auto|6.0|7.0|8.0
auto picks the highest version supported on the
detected
distro. Default: auto.--mongo-hosts=<csv>
host1:port1,host2:port2,...--mongo-username=<user>
--mongo-password=<pwd>
--mongo-database=<name>
elchi.--mongo-scheme=mongodb|mongodb+srv
mongodb.
--mongo-port=<n>
27017.--mongo-replicaset=<name>
elchi-rs.--mongo-tls=true|false
false.
--mongo-auth-source=<db>
admin.--mongo-auth-mechanism=<mech>
SCRAM-SHA-256. Empty = backend default.--mongo-timeout-ms=<ms>
9000.
--mongo-data-dir=<path>
/var/lib/mongodb.--vm=local|external
local.--vm-endpoint=<url|host:port>
--vm=external.--vm-data-dir=<path>
/var/lib/elchi/victoriametrics.--vm-retention=<dur>
15d.
--grafana-user=<user>
elchi.--grafana-password=<pwd>
grafana-cli on every install since bare-metal persists
/var/lib/grafana.
--grafana-allow-plugin=<csv>
--gslb
--no-gslb
--gslb-zone=<domain>
gslb.example.com).
Default: elchi.local β a non-routable
.local-style domain that's safe out of the box for internal cluster DNS /
testing.
Override with your own delegated domain in production.
--gslb-admin-email=<email>
hostmaster@<zone>
(RFC
2142 convention). Becomes the SOA RNAME (with @ β .).
Override if your DNS contact is different.--gslb-nameservers=<csv>
ns1:ip,ns2:ip,... NS records + glue.--gslb-regions=<csv>
--gslb-tls-skip-verify
/dns/snapshot over
HTTPS.
--gslb-ttl=<sec>
300.
--gslb-sync-interval=<dur>
1m.--gslb-timeout=<dur>
4s.
--gslb-static-records=<csv>
--gslb-secret=<value>
X-Elchi-Secret shared secret.--gslb-forwarders=<csv>
8.8.8.8,8.8.4.4.--internal-communication=true|false
false.--cors-origins=<csv>
*.
--jwt-access-duration=<dur>
1h.
--jwt-refresh-duration=<dur>
5h.
--enable-demo
--log-level=<level>
info.
--log-format=text|json
text.--non-interactive
--no-firewall
firewalld/ufw port opening.--dry-run
/tmp/elchi-dryrun-*; skip every
side-effect (no SSH, no SCP, no binary download, no service start).--force-redownload
--keep-bundle
/tmp/ after
orchestration. Also preserves the orchestrator-staged secrets-from file so the
operator can re-decrypt mid-incident.--bundle-key-out=<path>
--quiet-key
/etc/elchi/.bundle-key (sealed via systemd-creds
when available) and is recoverable via
elchi-stack show-secret bundle-key. Use this when capturing
install logs to artifact storage so the key doesn't leak into screen
recordings / tmux scrollback / CI logs.
-h | --help
--skip-orchestration
--bundle.--node-index=<n>
--bundle=<path>
--bundle-key=<hex>
upgrade.sh β version-diff upgrade
Run on M1. Computes the diff against the running cluster
(added / kept / removed variants) and re-runs
install.sh with the union. Every elchi-* systemd unit goes
through hash-based reconcile (install_and_apply /
reconcile_external) so binary or config changes trigger a restart;
unchanged services stay running. Single-flight via flock /run/elchi-upgrade.lock.
No SSH flags needed after install. install.sh persists
ELCHI_SSH_USER / KEY / PORT to /etc/elchi/orchestrator.env
(mode 0600 root). Re-run upgrade or uninstall without
--ssh-user / --ssh-key / --ssh-port and they'll fall back to the
persisted values. Pass a flag explicitly to override (e.g. you switched
the deploy user). Password is never persisted β key auth only, since
--ssh-bootstrap already distributed the cluster key during install.
--add-backend-version appends to the current variant set
without making you re-list everything that's already deployed. The new
variant gets a fresh control-plane systemd unit + binary on every node,
ports are allocated deterministically, and the UI's config.js
AVAILABLE_VERSIONS regenerates so the new envoy version
shows up in the UI version dropdown automatically.
Backend / envoy / coredns are not touched β every other component's
fingerprint stays identical so install.sh's reconcile marks them as
noop. Only nginx may restart if its config block changes.
Each form fans out the new artifacts to every node in
/etc/elchi/nodes.list using the persisted SSH credentials,
runs the hash-based reconcile, and gates the result through
verify::deep_health on every node. A failed health check
triggers per-binary rollback to the .prev snapshot on the
bad nodes; healthy nodes keep the new version.
--backend-version=<csv>
--add-backend-version=<csv>
--prune-version /
--prune-missing. Triggers control-plane unit creation +
UI config.js regeneration cluster-wide.
--ui-version=<vX.Y.Z>
--envoy-version=<vX.Y.Z>
--coredns-version=<vX.Y.Z>
--mongo-version=auto|6.0|7.0|8.0
--grafana-user=<user>
--grafana-password=<pwd>
--prune-version=<tag>
--prune-missing.--prune-missing
--backend-version list.
--ssh-user=<user>
--ssh-key=<path>
--ssh-port=<n>
--skip-health-gate
verify::deep_health. Faster but unsafer; only
use
when verify itself is the problem.-h | --help
KEPT=[v1.2.0/envoy1.36.2], ADDED=[v1.2.0/envoy1.37.0]. The kept variant's fingerprint is unchanged β noop. The added variant gets a fresh template unit + per-instance envs + binary download + start. Port allocations for the kept variant stay stable (deterministic offset from variant position).
install.sh re-runs with [new]; controller's ExecStart now points at new binary β
fingerprint diff β restart. Then prune step removes the old variant's unit,
binary, .prev snapshot, /etc/elchi/<old>/, /var/lib/elchi/<old>/,
fingerprint files, and ports.json entry. /etc/hosts + Envoy
bootstrap are re-rendered with the new variant set.
KEPT=β , ADDED=[v1.2.0/envoy1.37.0, v1.2.0/envoy1.38.0], REMOVED=[everything in CUR not in new list]. Same installβprune flow.
After install.sh finishes, every node runs verify::deep_health:
systemd state + journalctl registration log + Envoy admin
/listeners bind check. A failure triggers per-binary rollback
on the failed nodes (.prev snapshot β restart). Healthy nodes keep
the new version; the operator retries against the bad node.
Orphan warning: if you call upgrade.sh --backend-version=B
against a cluster that has [A] running, and you don't pass
--prune-version=A or --prune-missing, the installer
keeps A running and emits a warning. Variants are never silently removed.
uninstall.sh β remove the stackTwo ways to invoke. Both routes hit the same script β pick whichever matches how you installed.
No SSH flags needed for --all-nodes.
Same persistence trick as upgrade: /etc/elchi/orchestrator.env
holds the ELCHI_SSH_USER / KEY / PORT from install time, and
uninstall.sh reads it on startup. The cluster key was distributed at install
(or bootstrapped via --ssh-bootstrap), so password auth isn't
required. Pass --ssh-user=β¦ only if you want to override
the persisted value.
Stops + disables every elchi-* unit, removes unit files,
binaries, the operator helper, the nginx vhost, the journald drop-in,
the managed /etc/hosts block, and reverts firewall ports.
Mongo / VictoriaMetrics / Grafana data + secrets + TLS material are
preserved unless you add a --purge* flag.
Reads /etc/elchi/nodes.list on M1, SSHes into every M2..Mn
using the SSH credentials saved at install time, and runs the local
uninstall on each. Order is reverse-by-design (Mn first, M1 last) so
shared state on M1 is dropped only after the dependents are gone.
Add --continue-on-error if you want partial-cluster
uninstall to finish all reachable nodes instead of aborting on the
first SSH failure.
--purge-all is destructive and irreversible: drops
Mongo + VictoriaMetrics + Grafana + nginx packages, deletes
/var/lib/{mongodb,grafana,elchi}, removes the cluster
SSH key + known_hosts pin + our authorized_keys entry, and clears the
CA we added to the system trust store. Combine with
--all-nodes only when you genuinely want a clean slate
across the whole fleet.
--purge
/etc/elchi, /var/lib/elchi,
/var/log/elchi, /opt/elchi, system trust-store anchors, and SSH
bootstrap material (cluster key, known_hosts.elchi, our authorized_keys entry).
--purge-mongo
/var/lib/mongodb + repo files.
Implies
--purge.
--purge-vm
--purge.--purge-grafana
/var/lib/grafana + repo files. Implies
--purge.
--purge-nginx
nginx.conf backup.
Implies --purge.--purge-all
--all-nodes
/etc/elchi/nodes.list (M1 last, in
reverse, so shared state is dropped last).--continue-on-error
--ssh-user=<user> --ssh-key=<path> --ssh-port=<n>
--all-nodes.--yes-i-mean-it
--non-interactive purge.
-h | --help
Default uninstall is non-destructive: services stop, unit files / binaries / installer
payload / nginx vhost / journald drop-in / firewall ports / managed
/etc/hosts block all go. Mongo, VictoriaMetrics, Grafana data + secrets
+ TLS material are preserved unless you opt in via the matching
--purge* flag.
/etc/elchi/validate.sh β per-node post-install auditRead-only. The installer drops this on EVERY node so you can confirm the install end-to-end without leaning on the orchestrator. Run it on each machine after install (or any time you want a sanity check):
What it walks (in order):
runs_mongo, runs_otel, β¦),
the backend_variants set.elchi-* unit + mongod /
grafana-server / nginx (where present). active = β,
activating = warning (still coming up), failed
or inactive = β. Watchdog timer state checked separately.
ss -lntp
against the expected per-node set + M1 singletons + per-variant
control-plane ports from ports.full.json. Flags any
M1-only port that shows up on Mn (and vice-versa).mongod ping via mongosh, VictoriaMetrics
/api/v1/query, Grafana
/api/health, otelcol health extension on
:13133.
53/tcp + 53/udp (DNS) and
:8053 (health/metrics) on every node where the
plugin is enabled. Flags missing binds β typical cause is
systemd-resolved still holding :53.
net.core.somaxconn, vm.max_map_count,
file-max), Transparent Huge Pages disabled
(elchi-thp.service), swap behaviour
(vm.swappiness), MongoDB LimitNOFILE
drop-in, Envoy LimitNOFILE β₯ 1048576.
/ready,
/clusters health flags for every cluster
(registry / controller-rest / otel / grafana / victoriametrics +
every per-node controller and control-plane), and
/listeners bind verification for the public TLS
listener and the loopback internal listener.
envoy.yaml, topology.full.yaml,
ports.full.json, nodes.list,
tls/server.crt. Compare hashes by hand across nodes
to confirm the bundle distributed cleanly.
/etc/elchi/<variant>/ dir or
elchi-control-plane-<sanitized>@.service unit
whose variant tag isn't in the topology's backend_variants
list. (install.sh auto-prunes these on its next run, so
this should normally show "no stale variants".)
Output is colored, with a final
PASS / WARN / FAIL count. Exit code is non-zero on any
FAIL β friendly to ssh node N -- 'sudo /etc/elchi/validate.sh'
in a loop.
Why per-node? The installer renders Envoy bootstrap + bundle on M1 and SCPs to Mn β drift between nodes is the most common "weird symptom" cause. Running validate on each box and diffing the sha256 lines surfaces it in one shell command.
elchi-stack β operator helper (/usr/local/bin/elchi-stack)elchi-stack status
systemctl is-active for
every
elchi-* unit).elchi-stack logs <unit> [-f]
-f follows.elchi-stack reload-envoy
elchi-stack add-node <ip>
/etc/hosts + Envoy bootstrap to all peers.elchi-stack init-replica-set
rs.initiate() on M1 (idempotent β checks rs.status()
first).elchi-stack mongo-status
rs.status() snapshot β PRIMARY identification, per-member
state / health / uptime / replication lag / lastHeartbeatMessage. Recovery hint
surfaces when NotYetInitialized (gate dropped between phase 1 and
rs.initiate()).
elchi-stack clickhouse-status
system.clusters member health, elchi database engine
(Replicated in cluster mode), replicated tables.
elchi-stack mongosh [args...]
/etc/elchi/mongo/root.env. Pass any further mongosh args
(--eval 'rs.status()', scripts, etc.).
elchi-stack ch-client [args...]
elchi user from secrets.env.
elchi-stack ssh <node>
/etc/elchi/orchestrator.env β no flags needed.elchi-stack stack-version
topology.full.yaml + binary versions on this host
(mongo, clickhouse, envoy, otel, grafana, nginx, elchi-collector, backend variants).elchi-stack tls-info
elchi-stack endpoint-test
/, VictoriaMetrics
/api/v1/query, Grafana /grafana/api/health, internal
plaintext listener :8080/. Prints HTTP status for each.
elchi-stack collector-stats
:18091/metrics):
events received / dropped, active ALS streams, batcher queue depth, ClickHouse rows
inserted, ClickHouse / Mongo errors, flush count, pipeline panics. Best-effort β
missing metrics render as 0.elchi-stack verify
/clusters health flags + Envoy /listeners public listener bind +
ClickHouse Keeper leader probe + (M1-side) mongo RS PRIMARY probe.
elchi-stack export-bundle <out> [--reuse-bundle-key]
--reuse-bundle-key
reuses the install-time key persisted via systemd-creds at /etc/elchi/.bundle-key
so
the bundle can be reapplied without redistributing a fresh key.
elchi-stack show-secret <name>
name is one of
grafana (UI admin login), jwt (backend API auth),
gslb (CoreDNS plugin β backend auth), mongo-app,
mongo-root, clickhouse (CH user/pwd),
collector (HASH_SALT β never rotatable; rotating breaks event correlation),
bundle-key (re-decrypts /etc/elchi/.bundle-key when sealed),
or all (full table). Persisted in /etc/elchi/secrets.env
(mode 0600 root:root); preserved across re-runs and upgrades.
elchi-stack rotate-secret <jwt|gslb|grafana>
common.env and pushed cluster-wide via SSH; Grafana
password gets re-applied via grafana-cli on M1 only (singleton).--port)
Envoy internal127.0.0.1:8080Plaintext loopback (UI/API to backend)
Envoy admin127.0.0.1:9901Hardcoded loopback only
nginx (UI)127.0.0.1:8081SPA + config.js, fronted by Envoy
Registry gRPC0.0.0.0:1870HA peer set on every node; Envoy gRPC HC picks the leader
Registry metrics:9091Hardcoded in backend; OTel scrape target
Controller REST:1980Singleton per node (uses versions[0] binary)
Controller gRPC:1960Singleton per node
Control-plane:1990, 1991, β¦One port per variant by 0-indexed list position; same variant gets same port
on
every node
MongoDB:27017Standalone for 1-2 VM topology, RS-3 for 3+
Grafana127.0.0.1:3000M1 only; reverse-proxied at /grafana/
VictoriaMetrics0.0.0.0:8428M1 only (with --vm=local)
OTel gRPC:4317Every node (per-node sink for envoy /opentelemetry); each
collector
remote-writes to M1 VM
OTel HTTP:4318Every node
OTel health:13133Every node
OTel prom:8888Every node (collector self-metrics)
CoreDNS:53/tcp+udpEvery node (only with --gslb)
CoreDNS webhook0.0.0.0:8053M1 β M2/M3 push notifications (X-Elchi-Secret auth)
ClickHouse native0.0.0.0:9000CH server TCP wire protocol; cluster member on every node
ClickHouse HTTP0.0.0.0:8123CH HTTP interface; used by collector + backend for queries
ClickHouse interserver0.0.0.0:9009Inter-replica replication (only on
3+ node clusters)
ClickHouse Keeper0.0.0.0:9181Embedded Raft coordination client port (3+ node clusters only)
ClickHouse Keeper Raft0.0.0.0:9234Keeper inter-peer Raft consensus
traffic (3+ node)
elchi-collector gRPC0.0.0.0:18090ALS sink β Envoy data-plane
proxies push Access Log Service streams here
elchi-collector HTTP0.0.0.0:18091Prometheus /metrics
+ health endpoint
1 VM: all-in-one. 2 VM: Mongo standalone on M1; M2
connects over LAN. 3+ VM: Mongo replica set across the first 3 nodes;
additional nodes (4+) run no mongod. Registry runs on every node with HA leader
election (Mongo lease, TTL 30s, renew 10s). UI/Envoy/backend run on every node β each
node's front-door Envoy round-robins UI traffic across all peers' nginx instances and
uses ext_proc + the registry to decide which control-plane / controller to route each
request to (x-target-cluster header).
OTEL collector on every node. Each node ships its own
otelcol-contrib instance bound to 0.0.0.0:4317/4318;
that node's Envoy routes /opentelemetry traffic to
127.0.0.1:4317 (no cross-node hop). All collectors export to the
singleton VictoriaMetrics on M1 β or to --vm-endpoint when
--vm=external. Failure mode: M1 OTEL outage no longer cascades to
M2/M3 envoys, and the per-node collector's sending_queue buffers
writes if the VM is briefly unreachable.
Storage tier stays on M1: VictoriaMetrics TSDB and Grafana UI
are still singletons. With --vm=external the TSDB moves out
entirely; Grafana stays on M1.
ClickHouse cluster on every node. 1-2 node installs run
clickhouse-server standalone. 3+ node installs run a clustered CH
with embedded Keeper on each member: the elchi database is created
as ENGINE = Replicated('/clickhouse/databases/elchi', '{shard}', '{replica}'),
so the cluster-unaware collector's plain CREATE TABLE DDL is
auto-promoted to ReplicatedMergeTree and tables replicate across
all members via Keeper. Each node's collector writes to 127.0.0.1
β the Replicated engine handles fan-out β which keeps a 2 β 3 node growth from
creating a peer-DDL race.
elchi-collector on every node. The collector ingests the Envoy
ALS (Access Log Service) gRPC stream from data-plane proxies on its local
:18090 and writes events to local ClickHouse
(via loopback) + MongoDB. /metrics on :18091 is
scraped by the per-node OTEL collector. ClickHouse replication carries the
rows cluster-wide; cross-node ALS routing is not needed.
Every install lands a production tuning baseline. The defaults below come from the upstream production checklists for Envoy, MongoDB, and the Linux kernel β they're not opinionated guesses, they're the values these projects explicitly call out.
/etc/sysctl.d/99-elchi-stack.conf)/etc/systemd/system/mongod.service.d/10-elchi.conf)
Mongo's package unit ships almost no resource limits; we override:
Plus a one-shot elchi-disable-thp.service
(Before=mongod.service) that writes
never to
/sys/kernel/mm/transparent_hugepage/{enabled,defrag}.
THP-induced khugepaged compaction is the most common cause of
second-scale latency spikes in WiredTiger.
Every elchi-* unit (envoy, otel, victoriametrics, grafana, registry, controller, control-plane@, coredns) ships with a uniform hardening set:
NoNewPrivileges=true, PrivateTmp=trueProtectSystem=strict, ProtectHome=true,
ReadWritePaths=
minimum
ProtectKernelTunables/Modules/ControlGroups/Logs=trueProtectClock=true, ProtectHostname=true,
ProtectProc=invisible, ProcSubset=pid
RestrictSUIDSGID=true, LockPersonality=true,
RestrictRealtime=true, RestrictNamespaces=true
SystemCallArchitectures=native, KeyringMode=private,
RemoveIPC=yes, UMask=0077
CapabilityBoundingSet= (drop ALL) β except Envoy + CoreDNS keep
CAP_NET_BIND_SERVICE for :443 / :53
Per-service resource limits:
ELCHI_ENVOY_NOFILE) β front-door scale needs 1M
FDs
control-plane / controller / registryLimitNOFILE=65536, LimitNPROC=65536, LimitMEMLOCK=64MgRPC fan-in
otel / victoriametrics / corednsLimitNOFILE=65536, LimitNPROC=65536/4096local
sink + TSDB + DNS
grafana-server (drop-in)LimitNOFILE=65536,
LimitNPROC=4096, MemoryMax=1GUI; not in hot path
Before any side-effect, preflight::check_ram_swap warns if
total system RAM is below 4 GB and if any swap is active. Both are
soft warnings on a normal install; set
ELCHI_REQUIRE_HEALTHY=1 to escalate to fatal. To remove
swap permanently:
Verifying the hardening landed: run
sudo /etc/elchi/validate.sh on every node. Β§8 "System tuning"
checks somaxconn, vm.max_map_count,
vm.swappiness, fs.file-max,
THP state, swap state, mongo's LimitNOFILE/MEMLOCK, and
envoy's LimitNOFILE.
Every setup module uses hash-based reconcile
(systemd::install_and_apply for elchi-* units;
systemd::reconcile_external for grafana-server / mongod / nginx).
The fingerprint = sha256(unit_file β EnvironmentFile contents β ExecStart binary)
and is persisted at /var/lib/elchi/.unit-fingerprint/<unit>.
Decision matrix on rerun:
restartstartnoop (zero downtime)start (crash recovery)
Binary downloads keep a .prev snapshot for rollback. upgrade.sh
fails closed if any node fails the deep-health gate; per-binary rollback is automatic.
Ubuntu 22.04 + 24.04 Β· Debian 12 Β· RHEL / Rocky / Alma / Oracle 9. amd64 only (arm64 lands when upstream backend ships arm64 binaries).
MongoDB 8.0 is the cluster-wide canonical default and pins the apt/yum floor β Debian 11 (bullseye) and Ubuntu 20.04 (focal) are dropped because Mongo 8.0 has no apt repo for them. RHEL / Rocky / Alma / Oracle 8 is dropped on a separate axis: the systemd hardening directives we rely on (ProtectKernelLogs, ProtectClock, ProcSubset, β¦) require systemd β₯ 247, which RHEL 8 ships older. EL10 (Rocky/Alma/CentOS-Stream 10) is also not accepted yet β MongoDB has not published the el10 server RPMs (the el10 repo ships only client tools), so the bundled local mongod can't be installed there. The pre-flight homogeneity check refuses heterogeneous clusters (mixed major / family / arch) upfront so version drift can't sneak in via a re-imaged node.
Bring up the entire elchi stack on Docker Swarm with a single
docker stack deploy β online or fully offline
(docker save/docker load). It reuses the pre-built
jhonbrownn/* images (the same ones the Helm chart consumes); third-party
services (MongoDB, ClickHouse, VictoriaMetrics, Grafana, OpenTelemetry, Envoy) use
their official upstream images β nothing is built locally. Source lives at
deploy/docker/;
the installer is unversioned (always runs from the main branch) β component
image tags are pinned per-flag, defaulting from
deploy/docker/versions.env.
Prerequisites: none beyond a Linux host. get.sh
auto-installs whatever's missing β Docker Engine (via the official
get.docker.com), plus curl/tar/gzip/openssl
(needs root, i.e. sudo). The install command runs on a
single machine (the Swarm manager) and initializes Swarm
automatically. For multi-node HA, join the other machines with
docker swarm join first β Swarm then distributes the containers itself
(no SSH fan-out, unlike the bare-metal installer).
Initializes Swarm if needed, mints secrets, generates a self-signed cert, renders every
config and deploys the elchi stack. MongoDB + ClickHouse run standalone
(single-node). Prints the UI / Grafana URLs and the Grafana password when done:
Image tags default from versions.env: UI v1.4.6, backend
v1.4.9-v0.14.0-envoy1.36.2, coredns v0.1.4, collector
v0.1.8, plus official mongo:8.0 /
clickhouse/clickhouse-server / victoriametrics /
grafana / otel-collector-contrib / envoyproxy/envoy.
Each variant gets its own control-plane service + Envoy cluster. The embedded envoy version must be unique per variant:
Publish the edge on a non-standard port (TLS stays on); --port=80 implies
plaintext unless TLS is forced. --dry-run renders config + the stack file
into the state dir without deploying β inspect ~/.elchi-docker/gen/:
Every flag below has an env-var equivalent; CLI flags win. Run
install.sh --help for the canonical list.
--main-address=<dns|ip>
API_URL. Use a real DNS name if you want ACME / browser-trusted certs.--port=<n>
443. 80 implies
plaintext unless TLS is set explicitly.--ui-port=<n>
80.--backend-version=<csv>
jhonbrownn/elchi-backend tags, comma-separated.
One control-plane service + Envoy cluster per variant; the embedded envoy version
must be unique. Default:
v1.4.9-v0.14.0-envoy1.36.2 (from versions.env).--ui-version=<tag>
jhonbrownn/elchi) tag.
Default: v1.4.6.--coredns-version=<tag>
jhonbrownn/elchi-coredns) tag.
Default: v0.1.4.--collector-version=<tag>
jhonbrownn/elchi-collector) tag.
Default: v0.1.8.--image-repo=<repo>
jhonbrownn. Point at a private
mirror (e.g. a local registry:2) for air-gapped multi-node.--tls=self-signed|provided
self-signed β a
10-year ECDSA-P256 cert with --main-address + elchi-envoy
+ loopback as SANs, mounted into Envoy as a Docker secret.--cert=<path> --key=<path>
--tls=provided β your own cert/key files.--no-gslb
--gslb-zone=<domain>
elchi.local.--gslb-publish
:53 on the host (Swarm ingress).
Off by default to avoid clashing with the host
resolver β GSLB is reachable on the overlay regardless.--gslb-forwarders=<csv>
8.8.8.8,8.8.4.4.--gslb-regions=<csv>
--no-collector
--mongo=local|external
local (a
container with a scoped elchi app user). For external, also pass
--mongo-uri= (collector) and
--mongo-hosts= --mongo-username= --mongo-password= --mongo-database=
--mongo-replicaset= (backend).--clickhouse=local|external
local. External:
--clickhouse-uri=clickhouse://user:pass@host:9000/elchi.--vm=local|external
local. External:
--vm-endpoint=<url|host:port> (used by OTel remote-write +
Grafana datasource).--grafana-user=<u>
admin (set at
first DB init).--grafana-password=<p>
__FILE).
Default: random elchi-<hex>,
printed in the install summary.--enable-demo
window.APP_CONFIG.ENABLE_DEMO).--log-level=<level>
info.--nodes=<csv>
node<i>-* in the
Envoy config. Default: single node (this host).--ssh-user=<user>
root (non-root needs passwordless sudo).--ssh-key=<path>
--ssh-password=<pwd>
sshpass).--no-ssh
docker swarm join command).There are NO storage / HA flags. MongoDB / ClickHouse clustering is derived purely from the node count, exactly like the bare-metal installer:
--nodes host is always M1
(VictoriaMetrics + Grafana).--offline=<tarball>
docker load a save-images.sh bundle before
deploy and resolve images never (air-gapped).--stack-name=<name>
elchi.--placement-m1="<expr>"
--nodes host
(node.hostname == β¦), or node.role == manager for a
single host.--state-dir=<path>
~/.elchi-docker. Must persist
(Grafana bind-mounts dashboards from here).--dry-run
--non-interactive
All services share the elchi-net overlay network and address each other by
Swarm service DNS (tasks.<service>).
elchi-envoy
envoyproxy/envoy Β· global β edge L7 router + TLS,
publishes :<port>.elchi-registry
elchi-backend Β· global β xDS routing / ext_proc
target (leader-elected via Mongo).elchi-controller-node<i>
elchi-backend β REST + gRPC API; one per
node (version-agnostic singleton).elchi-cp-<envoy>-node<i>
elchi-backend β control-plane (xDS); one per
node per variant, addressable as
node<i>-controlplane-<X.Y.Z> in the Envoy config.elchi-ui
jhonbrownn/elchi Β· global β SPA (nginx);
config.js injected.elchi-mongo[-1..3]
mongo:8.0 β single instance, or a 3-member replica set
automatically at 3+ nodes.elchi-clickhouse[-1..3]
clickhouse-server β event store; single instance, or a
Keeper cluster automatically at 3+ nodes.elchi-victoriametrics
victoria-metrics β metrics TSDB (M1).elchi-grafana
grafana/grafana β served at /grafana/ (M1).elchi-otel
otel-collector-contrib Β· global β per-node
metrics sink.elchi-collector
jhonbrownn/elchi-collector Β· global β Envoy ALS
β ClickHouse.elchi-coredns
jhonbrownn/elchi-coredns Β· global β GSLB DNS
(optional).On a machine with internet, save the pinned image set to a single tarball (honours the same version flags so the bundle matches your install):
Copy both the image tarball and the installer (a
git clone of this repo, or its main tarball) to the air-gapped
host β it has no internet, so it can't bootstrap via curl. Run the
transferred installer with --offline (it docker loads the
images and deploys with --resolve-image=never):
Multi-node air-gapped: docker load the bundle on every node,
or run a throwaway registry:2, push the loaded images there and pass
--image-repo=<registry>:<port>.
Run it once on M1 (the first --nodes host). Like the
standalone installer, M1 SSHes into the other nodes, installs Docker, joins them to the
Swarm (logging each step), then deploys β no manual join, no HA flags. At 3+ nodes the
first 3 form the storage cluster; the first node is M1:
Open the Swarm ports between nodes (2377/tcp, 7946/tcp+udp,
4789/udp). SSH auto-join is idempotent; use --no-ssh to join the
workers yourself.
With 3+ nodes the stateful tier automatically becomes:
a MongoDB replica set (3 pinned services elchi-mongo-1..3
with keyfile auth; member-1 retries rs.initiate until all members are up,
then creates the scoped app user), and a ClickHouse Keeper cluster
(3 servers with embedded Raft; the Replicated elchi database is created
post-deploy so it's never accidentally created as a plain Atomic DB). Stateless services
(envoy / otel / collector / coredns / registry) run as Swarm global services
on every node. Verified live: RS forms PRIMARY + 2 SECONDARY with app
auth + writes; ClickHouse reports Replicated engine on all members with a
healthy Keeper quorum.
Point the stack at managed/external datastores instead of running them in-cluster:
install.sh is fully idempotent (secrets preserved; configs are
content-hashed, so a re-render β new name β Swarm rolling update). upgrade.sh
is the same flow re-run with new --*-version flags:
Full multi-node wipe β pass the same --nodes so M1 SSHes
into the workers (reusing the bootstrap key) to delete THEIR data volumes and make every
node leave the Swarm:
Worker data volumes are node-local β without --nodes the teardown only
cleans this node (the stack itself is removed everywhere by docker stack rm).
--leave-swarm dissolves the Swarm (every node leaves).
<--port> (443)
8080
1870 / 9091
1980 / 1960
1990+
27017
9000 / 8123 / 9181 / 9234
4317 / 4318 / 13133
8428 / 3000
18090 / 18091
53 / 8053
A separate render layer (deploy/docker/lib/render.sh) mirrors the config
shapes of deploy/standalone/ but with these deliberate Docker
divergences:
tasks.<service>),
not /etc/hosts aliases; Envoy clusters are STRICT_DNS.CONTROLLER_ID /
CONTROL_PLANE_ID) pinned to DNS-safe service names, so the
x-target-cluster header matches the generated Envoy cluster names.elchi-envoy:8080) β keeps traffic on the overlay, avoids
shipping the self-signed CA into every backend container.~/.elchi-docker and must persist β Grafana bind-mounts its dashboards
from there (pinned to the manager).--main-address is a real public DNS name with a reachable
:443; self-signed (the default) is the safe choice otherwise.node_ip is set to --main-address (an overlay
container can't learn its host's external IP); true multi-region GSLB needs
host-network CoreDNS.jhonbrownn/elchi-* images are linux/amd64 β deploy on an
amd64 Docker host (Swarm won't schedule them on arm64 nodes).
Install the elchi-client agent on a remote VM / bare-metal host so the
node registers itself with an existing elchi-stack control plane and starts
receiving Envoy / xDS configuration. Run on the TARGET host (the one that will
serve traffic) β NOT on the control-plane M1.
Prerequisites: a running elchi-stack (Helm or Bare-Metal); the
control plane's --main-address reachable over HTTPS; a valid
auth token (min 8 chars) minted from the UI (Settings β Tokens).
--cloud=other)
If you are deploying on OpenStack, pass --cloud=<your-cloud-name>
exactly as it appears in the UI's Settings β Clouds list β the
client uses this to look up the right metadata service / region.
--enable-bgp additionally installs and configures FRR so the node
can announce VIPs / advertise prefixes for the routes elchi-control-plane
pushes via BGP. Without it the client only manages Envoy data-plane config.
--name=<NAME>
web-server-01, edge-router-01,
db-replica-2).
--host=<HOST>
--main-address in the standalone install or the public LB in Helm.
--port=<PORT>
443.
Valid range 1-65535.--tls=true|false
true.
Only set to false on a dev / inside-VPC plaintext install.--token=<TOKEN>
--name.
--cloud=<CLOUD>
other. Use the exact name from
Settings β Clouds in the UI when running on OpenStack /
AWS / GCP / etc. β the client uses this to pull region + metadata from
the right provider plugin.
--enable-bgp
frr package and writes
/etc/frr/frr.conf with a baseline session config.
When you'd rather skip the installer wrapper and just drop the binary in place (e.g., from a config-management system or a custom systemd unit):
Verify checksum:
wget https://github.com/CloudNativeWorks/elchi-archive/releases/download/elchi-client-v1.1.0/elchi-client-linux-amd64.sha256
&& sha256sum -c elchi-client-linux-amd64.sha256
Pre-compiled Envoy binaries for Linux AMD64 and ARM64 architectures.
elchi-backend control-plane binaries for Linux AMD64. One binary per control-plane / Envoy variant - pick the one that matches your Envoy fleet.
elchi UI static distribution bundle (index.html + assets), served by nginx on bare-metal.
Compiled WASM modules for Coraza Web Application Firewall.
Pre-built Elchi Client binaries and installation scripts.
CoreDNS with Elchi GSLB plugin - GSLB-like DNS resolution with in-memory caching and webhook-based updates.
elchi-collector - ingests the Envoy ALS (Access Log Service) gRPC stream into ClickHouse and MongoDB.