Elchi Archive

Pre-compiled binaries for Envoy, Coraza WASM, and Elchi Client

🚀 Quick Install Elchi Stack (Kubernetes via kind + Helm)

Single-VM Ubuntu 24.04 install that brings up a kind cluster and deploys the Helm chart:

                    Loading latest install script...
                

Example: Loading...

Loading download link...

⚙️ Bare-Metal Standalone Install (no Docker, no Kubernetes)

Install the entire elchi stack as systemd services on 1, 2, or 3+ Linux VMs. The script runs once on the first node ("M1", the local machine) and SSHes into the rest to provision them. Source lives at deploy/standalone/; no separate release tarball — the installer is unversioned and always runs from the main branch. Component versions (elchi-backend, UI, envoy, coredns) are pinned per-flag.

1. Quick start

Single VM (all-in-one)

Without --nodes the installer defaults to a single-VM setup on this machine (auto-detects the first non-loopback IPv4 from hostname -I):

curl -fsSL https://raw.githubusercontent.com/CloudNativeWorks/elchi-archive/main/deploy/standalone/get.sh \ | sudo bash -s -- \ --main-address=elchi.example.com

Versions default from deploy/standalone/lib/versions.sh: UI v1.4.4, backend elchi-v1.4.8-v0.14.0-envoy1.36.2, envoy v1.36.2, coredns v0.1.4, collector v0.1.8. Override any with the matching --ui-version= / --backend-version= / --envoy-version= / --coredns-version= / --collector-version= flag.

3-VM cluster, multi-version backend, key-based SSH

curl -fsSL https://raw.githubusercontent.com/CloudNativeWorks/elchi-archive/main/deploy/standalone/get.sh \ | sudo bash -s -- \ --nodes=10.10.10.2,10.10.10.3,10.10.10.4 \ --ssh-user=ubuntu --ssh-key=/root/.ssh/cluster_key \ --main-address=elchi.example.com \ --gslb-zone=gslb.example.com \ --backend-version=elchi-v1.4.8-v0.14.0-envoy1.35.3,elchi-v1.4.8-v0.14.0-envoy1.36.2,elchi-v1.4.8-v0.14.0-envoy1.38.0

GSLB zone: CoreDNS GSLB plugin is enabled by default. If you skip --gslb-zone=..., the installer falls back to elchi.local — a non-routable .local-style namespace safe for internal clusters / testing. Pass --gslb-zone=<your-delegated-domain> for a real authoritative deployment, or --no-gslb to skip the plugin entirely.

Post-install UI activation (required for GSLB to actually serve records). The installer ships and boots the CoreDNS daemon (TCP/UDP :53, webhook :8053), but the backend-side configuration that the plugin polls for the authoritative snapshot is OFF until you turn it on in the UI:

Open the UI → Settings → GSLB.
Toggle Enable GSLB.
Set DNS Zone (the same value you passed as --gslb-zone; warning: zone cannot be changed later without a re-install).
Paste the DNS Secret. Grab it on M1 with sudo elchi-stack show-secret gslb — this is the X-Elchi-Secret the plugin uses to authenticate its /dns/snapshot poll to the backend, and the values MUST match.
Click Update Configuration. Within one --gslb-sync-interval (default 1 min) every node's CoreDNS plugin pulls the fresh snapshot and starts answering queries.

Verify after activation: dig @<node-ip> <zone> SOA +short on any node should return the SOA record. If it doesn't, journalctl -u elchi-coredns -n 50 and the plugin /health endpoint on 127.0.0.1:8053 will say why (auth failure / snapshot poll error).

Variants & replicas: each --backend-version entry is ONE variant. The number of variants determines how many backend processes per node: 3 variants = 1 controller + 3 control-planes per node (one control-plane per Envoy version). Same variant cannot appear twice — duplicates collide on the registry name <hostname>-controlplane-<X.Y.Z> and the installer rejects them. Capacity scales by adding nodes, not by replicating a variant on the same node.

3-VM cluster, no SSH key set up yet (interactive bootstrap)

When no --ssh-key / --ssh-password is passed and --nodes includes remote hosts, the installer auto-enables --ssh-bootstrap and prompts for each remote node's password once. Single-VM installs (only the local IP in --nodes) skip SSH provisioning entirely.

--ssh-bootstrap mints a fresh ed25519 key on M1, then prompts the operator once per remote node for that node's password. Each password is used only for that node's ssh-copy-id and is discarded immediately after. M1 itself is local — no password prompt for it. Subsequent SSH (orchestration, upgrades, uninstall) all use the generated key.

Dedicated admin user (default-on). By default the bootstrap also creates a key-only, passwordless-sudo admin user (elchi-cluster-admin) on every node — including M1 — and locks all subsequent orchestration to that identity. After the first install, the operator can lock root's password, disable root SSH login, or delete the root account; upgrade and uninstall keep working because they run as elchi-cluster-admin with sudo. Override the name with --admin-user=<name> or opt out with --no-admin-user (legacy: orchestration stays on root).

2. `install.sh` — full flag reference

Every variant tag in --backend-version is a full release-asset name (elchi-vX.Y.Z-vA.B.C-envoyP.Q.R), downloaded from the public elchi-archive releases — mirrored there from the private elchi-backend repo by the build-elchi-backend workflow. Multiple variants are comma-separated and each gets its own systemd template unit + /etc/elchi/<variant>/ config dir + /var/lib/elchi/<variant>/ HOME dir.

Topology & SSH

--nodes=<csv>

Comma-separated host list, M1 first. REQUIRED. M1 is the local machine; M2..Mn are reached over SSH.

--ssh-user=<user>

SSH login on M2..Mn. Default: root.

--ssh-port=<n>

SSH port. Default: 22.

--ssh-key=<path>

Private key for non-interactive auth (recommended for production).

--ssh-password=<pwd>

Password fallback (uses sshpass). Avoid for production.

--ssh-bootstrap

Mint an ed25519 key on M1 and copy it to every remote node. Prompts INTERACTIVELY for each remote node's password (M1 skipped). Subsequent SSH uses the generated key; passwords are discarded.

--admin-user=<name>

Default: elchi-cluster-admin (active by default). Dedicated admin user provisioned on every node during the first bootstrap. Runs useradd + /etc/sudoers.d/10-elchi-admin (NOPASSWD) + drops the cluster pubkey into the new user's authorized_keys, then flips the orchestrator's SSH user to it (persisted to /etc/elchi/orchestrator.env). After this, the cluster does NOT depend on the initial login user (root): the operator can rotate root's password, disable root SSH login, or even delete root entirely without breaking upgrade or uninstall. Idempotent on rerun.

--no-admin-user

Opt OUT of the default admin-user flow. Orchestration stays on the original login user (root). Use only when your environment forbids provisioning users (rare).

Versioning

--backend-version=<csv>

One or more variant tags (release-asset basenames). Each variant runs side-by-side. Default: elchi-v1.4.8-v0.14.0-envoy1.36.2 (from lib/versions.sh). Alias: --backend-variants=.

--backend-release=<tag>

Deprecated — release tag is now derived per variant from the asset name. Accepted but ignored.

--ui-version=<vX.Y.Z>

UI bundle version (elchi-dist-vX.Y.Z.tar.gz), downloaded from the public elchi-archive releases — mirrored from the private elchi repo by the build-elchi-ui workflow. Default: v1.4.4.

--envoy-version=<vX.Y.Z>

Front-door Envoy proxy binary version. Default: v1.37.0.

--coredns-version=<vX.Y.Z>

Custom CoreDNS-with-elchi-plugin version (used only with --gslb). Default: v0.1.3. v0.1.1 shipped without the elchi plugin compiled in — pinning to that version causes Unknown directive 'elchi'.

--collector-version=<vX.Y.Z>

elchi-collector binary version (ALS gRPC sink → ClickHouse / Mongo). Default: v0.1.8. Mirrored to the public elchi-archive releases by the build-elchi-collector workflow.

--no-collector

Skip the elchi-collector install entirely (cluster runs without ALS ingestion; envoy data-plane logs are not captured).

Backend instance count per node

Replica count is fixed by design — there is no flag to tune it:

Controller — exactly ONE per node. Version-agnostic singleton; uses versions[0]'s binary. Registers as bare <hostname>.
Control-plane — exactly ONE per (node, variant). Total per node = number of variants. Each registers as <hostname>-controlplane-<envoy-X.Y.Z>.

Capacity for a different Envoy version → add another variant tag. Capacity for the same Envoy version → add another node. Running the same variant twice on the same host would collide on the registry name and is rejected by topology compute.

Network & TLS

--main-address=<dns|ip>

Public address — REQUIRED. Cert SAN. Use a DNS name with A records pointing at every node IP for round-robin, or a single VIP.

--port=<n>

Public HTTPS port; Envoy terminates TLS here. Default: 443.

--hostnames=<csv>

Extra cert SANs (e.g. each node's hostname).

--tls=self-signed|provided

TLS mode. Default: self-signed (10-year ECDSA-P256 generated by openssl).

--cert=<path>

PEM cert (with --tls=provided).

--key=<path>

PEM private key (with --tls=provided).

--ca=<path>

Optional CA bundle for client trust verification.

--timezone=<tz>

TZ env var written into every elchi-* unit. Default: UTC.

MongoDB

--mongo=local|external

Use bundled mongod (1 VM standalone / 2 VM standalone on M1 / 3+ VM RS-3) or operator-supplied URI. Default: local.

--mongo-uri=<uri>

Full mongodb[+srv]://... for --mongo=external; granular flags below win on conflicts.

--mongo-version=auto|6.0|7.0|8.0

Mongo major. auto picks the highest version supported on the detected distro. Default: auto.

--mongo-hosts=<csv>

External: host1:port1,host2:port2,...

--mongo-username=<user>

External: app user.

--mongo-password=<pwd>

External: app password.

--mongo-database=<name>

App DB name. Default: elchi.

--mongo-scheme=mongodb|mongodb+srv

URI scheme override. Default: mongodb.

--mongo-port=<n>

Per-host port (used when granular hosts list omits explicit port). Default: 27017.

--mongo-replicaset=<name>

External RS name. Local mode uses elchi-rs.

--mongo-tls=true|false

TLS to external mongo. Default: false.

--mongo-auth-source=<db>

Default: admin.

--mongo-auth-mechanism=<mech>

e.g. SCRAM-SHA-256. Empty = backend default.

--mongo-timeout-ms=<ms>

Server-selection timeout. Default: 9000.

--mongo-data-dir=<path>

Local mode data dir. Default: /var/lib/mongodb.

VictoriaMetrics

--vm=local|external

Bundle a VM instance on M1 or use external endpoint. Default: local.

--vm-endpoint=<url|host:port>

Required when --vm=external.

--vm-data-dir=<path>

Local TSDB path. Default: /var/lib/elchi/victoriametrics.

--vm-retention=<dur>

Storage retention period. Default: 15d.

Grafana

--grafana-user=<user>

Admin login. Default: elchi.

--grafana-password=<pwd>

Admin password. Default: random (printed in summary). Reset is forced via grafana-cli on every install since bare-metal persists /var/lib/grafana.

--grafana-allow-plugin=<csv>

Allow-list of unsigned plugin IDs. Pass once per plugin or comma-separated. Locks plugin admin closed by default; passing this flag opens it.

GSLB / CoreDNS plugin

--gslb

Default: ON. CoreDNS GSLB plugin installs on every node (port 53 TCP+UDP, webhook on 8053). The flag is a no-op when default is already on; kept for explicitness.

--no-gslb

Opt out of the GSLB CoreDNS install entirely.

--gslb-zone=<domain>

Authoritative zone (e.g. gslb.example.com). Default: elchi.local — a non-routable .local-style domain that's safe out of the box for internal cluster DNS / testing. Override with your own delegated domain in production.

--gslb-admin-email=<email>

Optional. Default: hostmaster@<zone> (RFC 2142 convention). Becomes the SOA RNAME (with @ → .). Override if your DNS contact is different.

--gslb-nameservers=<csv>

ns1:ip,ns2:ip,... NS records + glue.

--gslb-regions=<csv>

Region tags for the regions directive.

--gslb-tls-skip-verify

Skip TLS verify when plugin polls backend /dns/snapshot over HTTPS.

--gslb-ttl=<sec>

Default record TTL. Default: 300.

--gslb-sync-interval=<dur>

Backend snapshot poll interval. Default: 1m.

--gslb-timeout=<dur>

Snapshot HTTP timeout. Default: 4s.

--gslb-static-records=<csv>

Inline static A/AAAA/CNAME records.

--gslb-secret=<value>

Override the auto-generated X-Elchi-Secret shared secret.

--gslb-forwarders=<csv>

Recursive resolvers for non-zone queries. Default: 8.8.8.8,8.8.4.4.

Backend behavior & JWT

--internal-communication=true|false

Use internal addresses for inter-service traffic. Default: false.

--cors-origins=<csv>

Backend CORS allow-list. Default: *.

--jwt-access-duration=<dur>

Access token lifetime. Default: 1h.

--jwt-refresh-duration=<dur>

Refresh token lifetime. Default: 5h.

--enable-demo

Backend demo mode (read-only sample data).

--log-level=<level>

Backend log level. Default: info.

--log-format=text|json

Default: text.

Op-mode

--non-interactive

Never prompt; fail if a confirmation would be required.

--no-firewall

Skip firewalld/ufw port opening.

--dry-run

Render config + topology to /tmp/elchi-dryrun-*; skip every side-effect (no SSH, no SCP, no binary download, no service start).

--force-redownload

Bypass sha256 cache; re-download every binary even when local matches upstream.

--keep-bundle

Preserve the encrypted handoff bundle artifact at /tmp/ after orchestration. Also preserves the orchestrator-staged secrets-from file so the operator can re-decrypt mid-incident.

--bundle-key-out=<path>

Write the bundle decryption key to a file (mode 0600).

--quiet-key

Suppress the plaintext bundle-key emission at end of install; only a sha256 fingerprint is shown. Full key is persisted at /etc/elchi/.bundle-key (sealed via systemd-creds when available) and is recoverable via elchi-stack show-secret bundle-key. Use this when capturing install logs to artifact storage so the key doesn't leak into screen recordings / tmux scrollback / CI logs.

-h | --help

Print full usage and exit.

Internal (set automatically by the orchestrator)

--skip-orchestration

Run the local-installer path on this single node; topology + secrets must be supplied via --bundle.

--node-index=<n>

1-based position of this node in the cluster.

--bundle=<path>

Encrypted bundle path on disk.

--bundle-key=<hex>

Bundle decryption key.

3. `upgrade.sh` — version-diff upgrade

Run on M1. Computes the diff against the running cluster (added / kept / removed variants) and re-runs install.sh with the union. Every elchi-* systemd unit goes through hash-based reconcile (install_and_apply / reconcile_external) so binary or config changes trigger a restart; unchanged services stay running. Single-flight via flock /run/elchi-upgrade.lock.

No SSH flags needed after install. install.sh persists ELCHI_SSH_USER / KEY / PORT to /etc/elchi/orchestrator.env (mode 0600 root). Re-run upgrade or uninstall without --ssh-user / --ssh-key / --ssh-port and they'll fall back to the persisted values. Pass a flag explicitly to override (e.g. you switched the deploy user). Password is never persisted — key auth only, since --ssh-bootstrap already distributed the cluster key during install.

Add a new variant alongside what's already running (additive)

curl -fsSL https://raw.githubusercontent.com/CloudNativeWorks/elchi-archive/main/deploy/standalone/get.sh \ | sudo bash -s -- --upgrade \ --add-backend-version=elchi-v1.4.8-v0.14.0-envoy1.37.0

--add-backend-version appends to the current variant set without making you re-list everything that's already deployed. The new variant gets a fresh control-plane systemd unit + binary on every node, ports are allocated deterministically, and the UI's config.js AVAILABLE_VERSIONS regenerates so the new envoy version shows up in the UI version dropdown automatically.

Bump just the UI (sadece UI güncelleme)

curl -fsSL https://raw.githubusercontent.com/CloudNativeWorks/elchi-archive/main/deploy/standalone/get.sh \ | sudo bash -s -- --upgrade --ui-version=v1.1.6

Backend / envoy / coredns are not touched — every other component's fingerprint stays identical so install.sh's reconcile marks them as noop. Only nginx may restart if its config block changes.

Bump just CoreDNS (only when GSLB is enabled)

curl -fsSL https://raw.githubusercontent.com/CloudNativeWorks/elchi-archive/main/deploy/standalone/get.sh \ | sudo bash -s -- --upgrade --coredns-version=v0.1.4

Replace the variant set explicitly (full union)

curl -fsSL https://raw.githubusercontent.com/CloudNativeWorks/elchi-archive/main/deploy/standalone/get.sh \ | sudo bash -s -- --upgrade \ --backend-version=elchi-v1.4.8-v0.14.0-envoy1.36.2,elchi-v1.4.8-v0.14.0-envoy1.37.0

Replace a variant + drop the old one (declarative)

Bump the UI + Envoy proxy together

curl -fsSL https://raw.githubusercontent.com/CloudNativeWorks/elchi-archive/main/deploy/standalone/get.sh \ | sudo bash -s -- --upgrade \ --ui-version=v1.1.4 \ --envoy-version=v1.38.0

Rotate the Grafana admin password

curl -fsSL https://raw.githubusercontent.com/CloudNativeWorks/elchi-archive/main/deploy/standalone/get.sh \ | sudo bash -s -- --upgrade \ --grafana-password=$(openssl rand -hex 16)

Each form fans out the new artifacts to every node in /etc/elchi/nodes.list using the persisted SSH credentials, runs the hash-based reconcile, and gates the result through verify::deep_health on every node. A failed health check triggers per-binary rollback to the .prev snapshot on the bad nodes; healthy nodes keep the new version.

Flags

--backend-version=<csv>

New variant set (replaces current). Omit to keep the current set.

--add-backend-version=<csv>

Additive: appends to the current variant set. Use when you want to start serving an additional Envoy version without re-listing what's already there. Mutually exclusive with --prune-version / --prune-missing. Triggers control-plane unit creation + UI config.js regeneration cluster-wide.

--ui-version=<vX.Y.Z>

Bump UI bundle.

--envoy-version=<vX.Y.Z>

Bump front-door Envoy.

--coredns-version=<vX.Y.Z>

Bump CoreDNS plugin (only with GSLB enabled).

--mongo-version=auto|6.0|7.0|8.0

Forwarded to install.sh; package upgrade if differs.

--grafana-user=<user>

Rotate Grafana admin login.

--grafana-password=<pwd>

Rotate Grafana admin password (also resets via grafana-cli).

--prune-version=<tag>

Remove this specific variant after install. Repeatable / csv. Mutually exclusive with --prune-missing.

--prune-missing

Declarative mode — remove every CURRENT variant that isn't in the new --backend-version list.

--ssh-user=<user>

Override SSH credentials saved at install time.

--ssh-key=<path>

Override SSH key path.

--ssh-port=<n>

Override SSH port.

--skip-health-gate

Bypass post-upgrade verify::deep_health. Faster but unsafer; only use when verify itself is the problem.

-h | --help

Print full usage and exit.

Scenarios & behavior

Add a new variant alongside existing ones

sudo upgrade.sh \ --backend-version=elchi-v1.4.8-v0.14.0-envoy1.36.2,elchi-v1.4.8-v0.14.0-envoy1.37.0

KEPT=[v1.2.0/envoy1.36.2], ADDED=[v1.2.0/envoy1.37.0]. The kept variant's fingerprint is unchanged → noop. The added variant gets a fresh template unit + per-instance envs + binary download + start. Port allocations for the kept variant stay stable (deterministic offset from variant position).

Update an existing variant (replace one tag with another)

sudo upgrade.sh \ --backend-version=elchi-v1.4.8-v0.14.0-envoy1.37.0 \ --prune-version=elchi-v1.4.8-v0.14.0-envoy1.36.2

install.sh re-runs with [new]; controller's ExecStart now points at new binary → fingerprint diff → restart. Then prune step removes the old variant's unit, binary, .prev snapshot, /etc/elchi/<old>/, /var/lib/elchi/<old>/, fingerprint files, and ports.json entry. /etc/hosts + Envoy bootstrap are re-rendered with the new variant set.

Both at once (declarative)

sudo upgrade.sh \ --backend-version=elchi-v1.4.8-v0.14.0-envoy1.37.0,elchi-v1.4.8-v0.14.0-envoy1.38.0 \ --prune-missing

KEPT=∅, ADDED=[v1.2.0/envoy1.37.0, v1.2.0/envoy1.38.0], REMOVED=[everything in CUR not in new list]. Same install→prune flow.

Health gate & rollback

After install.sh finishes, every node runs verify::deep_health: systemd state + journalctl registration log + Envoy admin /listeners bind check. A failure triggers per-binary rollback on the failed nodes (.prev snapshot → restart). Healthy nodes keep the new version; the operator retries against the bad node.

Orphan warning: if you call upgrade.sh --backend-version=B against a cluster that has [A] running, and you don't pass --prune-version=A or --prune-missing, the installer keeps A running and emits a warning. Variants are never silently removed.

4. `uninstall.sh` — remove the stack

Two ways to invoke. Both routes hit the same script — pick whichever matches how you installed.

No SSH flags needed for --all-nodes. Same persistence trick as upgrade: /etc/elchi/orchestrator.env holds the ELCHI_SSH_USER / KEY / PORT from install time, and uninstall.sh reads it on startup. The cluster key was distributed at install (or bootstrapped via --ssh-bootstrap), so password auth isn't required. Pass --ssh-user=… only if you want to override the persisted value.

Single node (this machine only)

curl -fsSL https://raw.githubusercontent.com/CloudNativeWorks/elchi-archive/main/deploy/standalone/get.sh \ | sudo bash -s -- --uninstall --yes-i-mean-it

Stops + disables every elchi-* unit, removes unit files, binaries, the operator helper, the nginx vhost, the journald drop-in, the managed /etc/hosts block, and reverts firewall ports. Mongo / VictoriaMetrics / Grafana data + secrets + TLS material are preserved unless you add a --purge* flag.

Whole cluster — fan out from M1 to every node

curl -fsSL https://raw.githubusercontent.com/CloudNativeWorks/elchi-archive/main/deploy/standalone/get.sh \ | sudo bash -s -- --uninstall --all-nodes --yes-i-mean-it

Reads /etc/elchi/nodes.list on M1, SSHes into every M2..Mn using the SSH credentials saved at install time, and runs the local uninstall on each. Order is reverse-by-design (Mn first, M1 last) so shared state on M1 is dropped only after the dependents are gone. Add --continue-on-error if you want partial-cluster uninstall to finish all reachable nodes instead of aborting on the first SSH failure.

Wipe everything (data + packages + secrets + SSH bootstrap material)

curl -fsSL https://raw.githubusercontent.com/CloudNativeWorks/elchi-archive/main/deploy/standalone/get.sh \ | sudo bash -s -- --uninstall --all-nodes --purge-all --yes-i-mean-it

--purge-all is destructive and irreversible: drops Mongo + VictoriaMetrics + Grafana + nginx packages, deletes /var/lib/{mongodb,grafana,elchi}, removes the cluster SSH key + known_hosts pin + our authorized_keys entry, and clears the CA we added to the system trust store. Combine with --all-nodes only when you genuinely want a clean slate across the whole fleet.

Flags

--purge

Wipe /etc/elchi, /var/lib/elchi, /var/log/elchi, /opt/elchi, system trust-store anchors, and SSH bootstrap material (cluster key, known_hosts.elchi, our authorized_keys entry).

--purge-mongo

Also remove mongo packages + /var/lib/mongodb + repo files. Implies --purge.

--purge-vm

Wipe VictoriaMetrics data directory. Implies --purge.

--purge-grafana

Remove grafana package + /var/lib/grafana + repo files. Implies --purge.

--purge-nginx

Remove nginx package + restore the original nginx.conf backup. Implies --purge.

--purge-all

All purge flags above.

--all-nodes

Fan out to every node from /etc/elchi/nodes.list (M1 last, in reverse, so shared state is dropped last).

--continue-on-error

Don't abort on per-node failure; collect errors and print a summary at the end. Non-zero overall exit if any node failed.

--ssh-user=<user> --ssh-key=<path> --ssh-port=<n>

Forwarded to ssh::configure for --all-nodes.

--yes-i-mean-it

Skip the destructive-action confirmation. Required for --non-interactive purge.

-h | --help

Print usage banner extracted from the script header.

Default uninstall is non-destructive: services stop, unit files / binaries / installer payload / nginx vhost / journald drop-in / firewall ports / managed /etc/hosts block all go. Mongo, VictoriaMetrics, Grafana data + secrets + TLS material are preserved unless you opt in via the matching --purge* flag.

5. `/etc/elchi/validate.sh` — per-node post-install audit

Read-only. The installer drops this on EVERY node so you can confirm the install end-to-end without leaning on the orchestrator. Run it on each machine after install (or any time you want a sanity check):

sudo /etc/elchi/validate.sh

What it walks (in order):

Topology context — this node's index, role flags (runs_mongo, runs_otel, …), the backend_variants set.
Systemd — every elchi-* unit + mongod / grafana-server / nginx (where present). active = ✓, activating = warning (still coming up), failed or inactive = ✗. Watchdog timer state checked separately.
Listening ports — compares ss -lntp against the expected per-node set + M1 singletons + per-variant control-plane ports from ports.full.json. Flags any M1-only port that shows up on Mn (and vice-versa).
Service health — mongod ping via mongosh, VictoriaMetrics /api/v1/query, Grafana /api/health, otelcol health extension on :13133.
CoreDNS GSLB ports — checks 53/tcp + 53/udp (DNS) and :8053 (health/metrics) on every node where the plugin is enabled. Flags missing binds — typical cause is systemd-resolved still holding :53.
System tuning — sysctl drop-in present (net.core.somaxconn, vm.max_map_count, file-max), Transparent Huge Pages disabled (elchi-thp.service), swap behaviour (vm.swappiness), MongoDB LimitNOFILE drop-in, Envoy LimitNOFILE ≥ 1048576.
Envoy admin — /ready, /clusters health flags for every cluster (registry / controller-rest / otel / grafana / victoriametrics + every per-node controller and control-plane), and /listeners bind verification for the public TLS listener and the loopback internal listener.
Config integrity — sha256 of envoy.yaml, topology.full.yaml, ports.full.json, nodes.list, tls/server.crt. Compare hashes by hand across nodes to confirm the bundle distributed cleanly.
Stale variant detection — flags any /etc/elchi/<variant>/ dir or elchi-control-plane-<sanitized>@.service unit whose variant tag isn't in the topology's backend_variants list. (install.sh auto-prunes these on its next run, so this should normally show "no stale variants".)

Output is colored, with a final PASS / WARN / FAIL count. Exit code is non-zero on any FAIL — friendly to ssh node N -- 'sudo /etc/elchi/validate.sh' in a loop.

Why per-node? The installer renders Envoy bootstrap + bundle on M1 and SCPs to Mn — drift between nodes is the most common "weird symptom" cause. Running validate on each box and diffing the sha256 lines surfaces it in one shell command.

6. `elchi-stack` — operator helper (`/usr/local/bin/elchi-stack`)

elchi-stack status

Cluster-wide service summary (each node's systemctl is-active for every elchi-* unit).

elchi-stack logs <unit> [-f]

Tail journalctl for the named unit on every node. -f follows.

elchi-stack reload-envoy

Re-render Envoy bootstrap + restart Envoy on every node (after a topology change).

elchi-stack add-node <ip>

Extend the cluster: provision the new node with the existing bundle, recompute topology, push updated /etc/hosts + Envoy bootstrap to all peers.

elchi-stack init-replica-set

Run rs.initiate() on M1 (idempotent — checks rs.status() first).

elchi-stack mongo-status

M1-only: rs.status() snapshot — PRIMARY identification, per-member state / health / uptime / replication lag / lastHeartbeatMessage. Recovery hint surfaces when NotYetInitialized (gate dropped between phase 1 and rs.initiate()).

elchi-stack clickhouse-status

Local ClickHouse + Keeper diagnostics: version, uptime, Keeper quorum reachability, system.clusters member health, elchi database engine (Replicated in cluster mode), replicated tables.

elchi-stack mongosh [args...]

Open mongosh against the local mongod, authenticated as root via /etc/elchi/mongo/root.env. Pass any further mongosh args (--eval 'rs.status()', scripts, etc.).

elchi-stack ch-client [args...]

Open clickhouse-client against the local server, authenticated as the elchi user from secrets.env.

elchi-stack ssh <node>

Open an SSH session to a cluster node using the persisted cluster credentials from /etc/elchi/orchestrator.env — no flags needed.

elchi-stack stack-version

One-screen "what's actually installed on this node" report: cluster-wide pins from topology.full.yaml + binary versions on this host (mongo, clickhouse, envoy, otel, grafana, nginx, elchi-collector, backend variants).

elchi-stack tls-info

Cluster TLS cert summary: subject, SAN list, validity window, days-to-expiry (colored red < 30, yellow < 90), sha256 fingerprint.

elchi-stack endpoint-test

Round-trip probe through the public Envoy: UI /, VictoriaMetrics /api/v1/query, Grafana /grafana/api/health, internal plaintext listener :8080/. Prints HTTP status for each.

elchi-stack collector-stats

Per-node elchi-collector metrics summary (from :18091/metrics): events received / dropped, active ALS streams, batcher queue depth, ClickHouse rows inserted, ClickHouse / Mongo errors, flush count, pipeline panics. Best-effort — missing metrics render as 0.

elchi-stack verify

End-to-end cluster health: per-node systemd + Envoy admin /clusters health flags + Envoy /listeners public listener bind + ClickHouse Keeper leader probe + (M1-side) mongo RS PRIMARY probe.

elchi-stack export-bundle <out> [--reuse-bundle-key]

Repackage cluster artifacts into an encrypted bundle. --reuse-bundle-key reuses the install-time key persisted via systemd-creds at /etc/elchi/.bundle-key so the bundle can be reapplied without redistributing a fresh key.

elchi-stack show-secret <name>

Print a stored credential without rotating it. name is one of grafana (UI admin login), jwt (backend API auth), gslb (CoreDNS plugin ↔ backend auth), mongo-app, mongo-root, clickhouse (CH user/pwd), collector (HASH_SALT — never rotatable; rotating breaks event correlation), bundle-key (re-decrypts /etc/elchi/.bundle-key when sealed), or all (full table). Persisted in /etc/elchi/secrets.env (mode 0600 root:root); preserved across re-runs and upgrades.

elchi-stack rotate-secret <jwt|gslb|grafana>

Mint a new JWT, GSLB, or Grafana admin password. JWT/GSLB get re-rendered into every variant's common.env and pushed cluster-wide via SSH; Grafana password gets re-applied via grafana-cli on M1 only (singleton).

7. Port atlas

Envoy public0.0.0.0:443/tcpTLS, every node (configurable via --port) Envoy internal127.0.0.1:8080Plaintext loopback (UI/API to backend) Envoy admin127.0.0.1:9901Hardcoded loopback only nginx (UI)127.0.0.1:8081SPA + config.js, fronted by Envoy Registry gRPC0.0.0.0:1870HA peer set on every node; Envoy gRPC HC picks the leader Registry metrics:9091Hardcoded in backend; OTel scrape target Controller REST:1980Singleton per node (uses versions[0] binary) Controller gRPC:1960Singleton per node Control-plane:1990, 1991, …One port per variant by 0-indexed list position; same variant gets same port on every node MongoDB:27017Standalone for 1-2 VM topology, RS-3 for 3+ Grafana127.0.0.1:3000M1 only; reverse-proxied at /grafana/ VictoriaMetrics0.0.0.0:8428M1 only (with --vm=local) OTel gRPC:4317Every node (per-node sink for envoy /opentelemetry); each collector remote-writes to M1 VM OTel HTTP:4318Every node OTel health:13133Every node OTel prom:8888Every node (collector self-metrics) CoreDNS:53/tcp+udpEvery node (only with --gslb) CoreDNS webhook0.0.0.0:8053M1 → M2/M3 push notifications (X-Elchi-Secret auth) ClickHouse native0.0.0.0:9000CH server TCP wire protocol; cluster member on every node ClickHouse HTTP0.0.0.0:8123CH HTTP interface; used by collector + backend for queries ClickHouse interserver0.0.0.0:9009Inter-replica replication (only on 3+ node clusters) ClickHouse Keeper0.0.0.0:9181Embedded Raft coordination client port (3+ node clusters only) ClickHouse Keeper Raft0.0.0.0:9234Keeper inter-peer Raft consensus traffic (3+ node) elchi-collector gRPC0.0.0.0:18090ALS sink — Envoy data-plane proxies push Access Log Service streams here elchi-collector HTTP0.0.0.0:18091Prometheus /metrics + health endpoint

8. Topology

1 VM: all-in-one. 2 VM: Mongo standalone on M1; M2 connects over LAN. 3+ VM: Mongo replica set across the first 3 nodes; additional nodes (4+) run no mongod. Registry runs on every node with HA leader election (Mongo lease, TTL 30s, renew 10s). UI/Envoy/backend run on every node — each node's front-door Envoy round-robins UI traffic across all peers' nginx instances and uses ext_proc + the registry to decide which control-plane / controller to route each request to (x-target-cluster header).

OTEL collector on every node. Each node ships its own otelcol-contrib instance bound to 0.0.0.0:4317/4318; that node's Envoy routes /opentelemetry traffic to 127.0.0.1:4317 (no cross-node hop). All collectors export to the singleton VictoriaMetrics on M1 — or to --vm-endpoint when --vm=external. Failure mode: M1 OTEL outage no longer cascades to M2/M3 envoys, and the per-node collector's sending_queue buffers writes if the VM is briefly unreachable.

Storage tier stays on M1: VictoriaMetrics TSDB and Grafana UI are still singletons. With --vm=external the TSDB moves out entirely; Grafana stays on M1.

ClickHouse cluster on every node. 1-2 node installs run clickhouse-server standalone. 3+ node installs run a clustered CH with embedded Keeper on each member: the elchi database is created as ENGINE = Replicated('/clickhouse/databases/elchi', '{shard}', '{replica}'), so the cluster-unaware collector's plain CREATE TABLE DDL is auto-promoted to ReplicatedMergeTree and tables replicate across all members via Keeper. Each node's collector writes to 127.0.0.1 — the Replicated engine handles fan-out — which keeps a 2 → 3 node growth from creating a peer-DDL race.

elchi-collector on every node. The collector ingests the Envoy ALS (Access Log Service) gRPC stream from data-plane proxies on its local :18090 and writes events to local ClickHouse (via loopback) + MongoDB. /metrics on :18091 is scraped by the per-node OTEL collector. ClickHouse replication carries the rows cluster-wide; cross-node ALS routing is not needed.

9. Production hardening (kernel + systemd)

Every install lands a production tuning baseline. The defaults below come from the upstream production checklists for Envoy, MongoDB, and the Linux kernel — they're not opinionated guesses, they're the values these projects explicitly call out.

Kernel sysctl (/etc/sysctl.d/99-elchi-stack.conf)

net.core.somaxconn65535listen() backlog ceiling for Envoy + nginx + grpc net.core.netdev_max_backlog10000NIC RX queue per CPU net.ipv4.ip_local_port_range10240-65535~55K ephemeral ports for Envoy upstream + mongo failover churn net.ipv4.tcp_tw_reuse1reuse TIME_WAIT sockets (RFC 6191 safe) net.ipv4.tcp_fin_timeout15recycle FIN_WAIT2 (default 60) net.ipv4.tcp_keepalive_time120detect dead peers in 2min, not 2hr net.ipv4.tcp_syncookies1SYN flood protection (explicit) fs.file-max2097152system-wide FD ceiling above any LimitNOFILE vm.swappiness1never page out unless OOM (mongo prerequisite) vm.max_map_count262144WiredTiger mmap regions fs.inotify.max_queued_events65536event queue depth (default 16384) fs.inotify.max_user_instances8192RHEL 9 default 128 — too low for VM/Grafana/mongo together fs.inotify.max_user_watches524288RHEL 9 default 8192; Ubuntu already 524288 user.max_inotify_instances8192per-userns (Linux 5.11+); default 128 user.max_inotify_watches524288per-userns; default 65536

MongoDB systemd drop-in (/etc/systemd/system/mongod.service.d/10-elchi.conf)

Mongo's package unit ships almost no resource limits; we override:

LimitNOFILE64000file per collection + index + cursor + connection LimitNPROC64000WiredTiger thread pool + connection pool LimitMEMLOCKinfinityrequired to silence the "ulimit -l too low" warnings OOMScoreAdjust-1000never let mongod be the OOM victim (lowest oom_score) TasksMaxinfinitycgroup task limit (default ~4915 on RHEL is not enough)

Plus a one-shot elchi-disable-thp.service (Before=mongod.service) that writes never to /sys/kernel/mm/transparent_hugepage/{enabled,defrag}. THP-induced khugepaged compaction is the most common cause of second-scale latency spikes in WiredTiger.

Per-service systemd hardening

Every elchi-* unit (envoy, otel, victoriametrics, grafana, registry, controller, control-plane@, coredns) ships with a uniform hardening set:

NoNewPrivileges=true, PrivateTmp=true
ProtectSystem=strict, ProtectHome=true, ReadWritePaths= minimum
ProtectKernelTunables/Modules/ControlGroups/Logs=true
ProtectClock=true, ProtectHostname=true, ProtectProc=invisible, ProcSubset=pid
RestrictSUIDSGID=true, LockPersonality=true, RestrictRealtime=true, RestrictNamespaces=true
SystemCallArchitectures=native, KeyringMode=private, RemoveIPC=yes, UMask=0077
CapabilityBoundingSet= (drop ALL) — except Envoy + CoreDNS keep CAP_NET_BIND_SERVICE for :443 / :53

Per-service resource limits:

elchi-envoyLimitNOFILE=1048576(override ELCHI_ENVOY_NOFILE) — front-door scale needs 1M FDs control-plane / controller / registryLimitNOFILE=65536, LimitNPROC=65536, LimitMEMLOCK=64MgRPC fan-in otel / victoriametrics / corednsLimitNOFILE=65536, LimitNPROC=65536/4096local sink + TSDB + DNS grafana-server (drop-in)LimitNOFILE=65536, LimitNPROC=4096, MemoryMax=1GUI; not in hot path

Preflight RAM/swap checks

Before any side-effect, preflight::check_ram_swap warns if total system RAM is below 4 GB and if any swap is active. Both are soft warnings on a normal install; set ELCHI_REQUIRE_HEALTHY=1 to escalate to fatal. To remove swap permanently:

sudo swapoff -a sudo sed -i.bak '/\sswap\s/d' /etc/fstab

Verifying the hardening landed: run sudo /etc/elchi/validate.sh on every node. §8 "System tuning" checks somaxconn, vm.max_map_count, vm.swappiness, fs.file-max, THP state, swap state, mongo's LimitNOFILE/MEMLOCK, and envoy's LimitNOFILE.

10. Idempotency & reconcile

Every setup module uses hash-based reconcile (systemd::install_and_apply for elchi-* units; systemd::reconcile_external for grafana-server / mongod / nginx). The fingerprint = sha256(unit_file ‖ EnvironmentFile contents ‖ ExecStart binary) and is persisted at /var/lib/elchi/.unit-fingerprint/<unit>. Decision matrix on rerun:

Fingerprint changed + active → restart
Fingerprint changed + inactive → start
Fingerprint same + active → noop (zero downtime)
Fingerprint same + inactive → start (crash recovery)

Binary downloads keep a .prev snapshot for rollback. upgrade.sh fails closed if any node fails the deep-health gate; per-binary rollback is automatic.

11. Supported distros

Ubuntu 22.04 + 24.04 · Debian 12 · RHEL / Rocky / Alma / Oracle 9. amd64 only (arm64 lands when upstream backend ships arm64 binaries).

MongoDB 8.0 is the cluster-wide canonical default and pins the apt/yum floor — Debian 11 (bullseye) and Ubuntu 20.04 (focal) are dropped because Mongo 8.0 has no apt repo for them. RHEL / Rocky / Alma / Oracle 8 is dropped on a separate axis: the systemd hardening directives we rely on (ProtectKernelLogs, ProtectClock, ProcSubset, …) require systemd ≥ 247, which RHEL 8 ships older. EL10 (Rocky/Alma/CentOS-Stream 10) is also not accepted yet — MongoDB has not published the el10 server RPMs (the el10 repo ships only client tools), so the bundled local mongod can't be installed there. The pre-flight homogeneity check refuses heterogeneous clusters (mixed major / family / arch) upfront so version drift can't sneak in via a re-imaged node.

📖 Full README 💻 Source

🐳 Docker Swarm Install (containers, no Kubernetes)

Bring up the entire elchi stack on Docker Swarm with a single docker stack deploy — online or fully offline (docker save/docker load). It reuses the pre-built jhonbrownn/* images (the same ones the Helm chart consumes); third-party services (MongoDB, ClickHouse, VictoriaMetrics, Grafana, OpenTelemetry, Envoy) use their official upstream images — nothing is built locally. Source lives at deploy/docker/; the installer is unversioned (always runs from the main branch) — component image tags are pinned per-flag, defaulting from deploy/docker/versions.env.

Prerequisites: none beyond a Linux host. get.sh auto-installs whatever's missing — Docker Engine (via the official get.docker.com), plus curl/tar/gzip/openssl (needs root, i.e. sudo). The install command runs on a single machine (the Swarm manager) and initializes Swarm automatically. For multi-node HA, join the other machines with docker swarm join first — Swarm then distributes the containers itself (no SSH fan-out, unlike the bare-metal installer).

1. Quick start

Single host (all-in-one)

Initializes Swarm if needed, mints secrets, generates a self-signed cert, renders every config and deploys the elchi stack. MongoDB + ClickHouse run standalone (single-node). Prints the UI / Grafana URLs and the Grafana password when done:

curl -fsSL https://raw.githubusercontent.com/CloudNativeWorks/elchi-archive/main/deploy/docker/get.sh \ | sudo bash -s -- \ --main-address=<your-dns-or-ip>

Image tags default from versions.env: UI v1.4.6, backend v1.4.9-v0.14.0-envoy1.36.2, coredns v0.1.4, collector v0.1.8, plus official mongo:8.0 / clickhouse/clickhouse-server / victoriametrics / grafana / otel-collector-contrib / envoyproxy/envoy.

Multi-version backend (CSV)

Each variant gets its own control-plane service + Envoy cluster. The embedded envoy version must be unique per variant:

curl -fsSL https://raw.githubusercontent.com/CloudNativeWorks/elchi-archive/main/deploy/docker/get.sh \ | sudo bash -s -- \ --main-address=elchi.example.com \ --backend-version=v1.4.9-v0.14.0-envoy1.36.2,v1.4.9-v0.14.0-envoy1.35.3 \ --ui-version=v1.4.6

Custom port / plaintext

Publish the edge on a non-standard port (TLS stays on); --port=80 implies plaintext unless TLS is forced. --dry-run renders config + the stack file into the state dir without deploying — inspect ~/.elchi-docker/gen/:

# custom edge port (TLS stays on) curl -fsSL https://raw.githubusercontent.com/CloudNativeWorks/elchi-archive/main/deploy/docker/get.sh \ | sudo bash -s -- --main-address=10.0.0.5 --port=8443 # render only, no deploy (inspect ~/.elchi-docker/gen/) curl -fsSL https://raw.githubusercontent.com/CloudNativeWorks/elchi-archive/main/deploy/docker/get.sh \ | sudo bash -s -- --main-address=10.0.0.5 --dry-run

2. install.sh — full flag reference

Every flag below has an env-var equivalent; CLI flags win. Run install.sh --help for the canonical list.

Required & network

--main-address=<dns|ip>

Required. Public address — TLS cert SAN + the UI's API_URL. Use a real DNS name if you want ACME / browser-trusted certs.

--port=<n>

Public Envoy edge port (published via Swarm ingress). Default: 443. 80 implies plaintext unless TLS is set explicitly.

--ui-port=<n>

Internal port the UI nginx image listens on (Envoy upstream). Default: 80.

Versioning (image tags)

--backend-version=<csv>

One or more jhonbrownn/elchi-backend tags, comma-separated. One control-plane service + Envoy cluster per variant; the embedded envoy version must be unique. Default: v1.4.9-v0.14.0-envoy1.36.2 (from versions.env).

--ui-version=<tag>

UI image (jhonbrownn/elchi) tag. Default: v1.4.6.

--coredns-version=<tag>

CoreDNS GSLB image (jhonbrownn/elchi-coredns) tag. Default: v0.1.4.

--collector-version=<tag>

elchi-collector image (jhonbrownn/elchi-collector) tag. Default: v0.1.8.

--image-repo=<repo>

Docker Hub namespace / registry for the elchi images. Default: jhonbrownn. Point at a private mirror (e.g. a local registry:2) for air-gapped multi-node.

TLS

--tls=self-signed|provided

Default: self-signed — a 10-year ECDSA-P256 cert with --main-address + elchi-envoy + loopback as SANs, mounted into Envoy as a Docker secret.

--cert=<path> --key=<path>

For --tls=provided — your own cert/key files.

GSLB (CoreDNS)

--no-gslb

Disable the CoreDNS GSLB service. GSLB is ON by default.

--gslb-zone=<domain>

Authoritative zone. Default: elchi.local.

--gslb-publish

Publish CoreDNS :53 on the host (Swarm ingress). Off by default to avoid clashing with the host resolver — GSLB is reachable on the overlay regardless.

--gslb-forwarders=<csv>

Upstream resolvers for non-authoritative queries. Default: 8.8.8.8,8.8.4.4.

--gslb-regions=<csv>

Region directive for the GSLB plugin.

Collector / ClickHouse / MongoDB / VictoriaMetrics

--no-collector

Skip elchi-collector and ClickHouse (no ALS ingestion). Collector is ON by default.

--mongo=local|external

Default: local (a container with a scoped elchi app user). For external, also pass --mongo-uri= (collector) and

--mongo-hosts= --mongo-username= --mongo-password= --mongo-database=
                            --mongo-replicaset=

(backend).

--clickhouse=local|external

Default: local. External: --clickhouse-uri=clickhouse://user:pass@host:9000/elchi.

--vm=local|external

Default: local. External: --vm-endpoint=<url|host:port> (used by OTel remote-write + Grafana datasource).

Grafana / UI behaviour

--grafana-user=

Default: admin (set at first DB init).

--grafana-password=

Admin password (Docker secret via __FILE). Default: random elchi-<hex>, printed in the install summary.

--enable-demo

Enable UI demo mode (window.APP_CONFIG.ENABLE_DEMO).

--log-level=<level>

Backend + collector log level. Default: info.

Multi-node — run once on M1, SSH auto-join (standalone parity)

--nodes=<csv>

Node IPs or hostnames (the FIRST is M1, where you run the installer). M1 SSHes into the others, installs Docker + joins them to the Swarm (per-node logging), then deploys. Every node runs 1 controller + one control-plane PER variant + UI, addressable as node-* in the Envoy config. Default: single node (this host).

--ssh-user=<user>

SSH user for the other nodes. Default: root (non-root needs passwordless sudo).

--ssh-key=<path>

SSH private key for the other nodes.

--ssh-password=<pwd>

SSH password (via sshpass).

--no-ssh

Skip auto-join; join the workers yourself (M1 prints the docker swarm join command).

There are NO storage / HA flags. MongoDB / ClickHouse clustering is derived purely from the node count, exactly like the bare-metal installer:

1-2 nodes → a single mongo / clickhouse on the first node (no quorum possible).
3+ nodes → a 3-member MongoDB replica set + ClickHouse Keeper cluster on the first 3 nodes only. A 4th / 5th node runs the elchi tier and connects to that cluster over the overlay — it does not run mongo / clickhouse.
The first --nodes host is always M1 (VictoriaMetrics + Grafana).

Operational

--offline=<tarball>

docker load a save-images.sh bundle before deploy and resolve images never (air-gapped).

--stack-name=<name>

Swarm stack name. Default: elchi.

--placement-m1="<expr>"

Override the placement constraint for M1 singletons. Default: the first --nodes host (node.hostname == …), or node.role == manager for a single host.

--state-dir=<path>

Where secrets / TLS / rendered config / dashboards live. Default: ~/.elchi-docker. Must persist (Grafana bind-mounts dashboards from here).

--dry-run

Render config + the stack file only — no Swarm init, no deploy.

--non-interactive

Never prompt.

3. What runs

All services share the elchi-net overlay network and address each other by Swarm service DNS (tasks.<service>).

elchi-envoy

envoyproxy/envoy · global — edge L7 router + TLS, publishes :<port>.

elchi-registry

elchi-backend · global — xDS routing / ext_proc target (leader-elected via Mongo).

elchi-controller-node

elchi-backend — REST + gRPC API; one per node (version-agnostic singleton).

elchi-cp-<envoy>-node

elchi-backend — control-plane (xDS); one per node per variant, addressable as node-controlplane-<X.Y.Z> in the Envoy config.

elchi-ui

jhonbrownn/elchi · global — SPA (nginx); config.js injected.

elchi-mongo[-1..3]

mongo:8.0 — single instance, or a 3-member replica set automatically at 3+ nodes.

elchi-clickhouse[-1..3]

clickhouse-server — event store; single instance, or a Keeper cluster automatically at 3+ nodes.

elchi-victoriametrics

victoria-metrics — metrics TSDB (M1).

elchi-grafana

grafana/grafana — served at /grafana/ (M1).

elchi-otel

otel-collector-contrib · global — per-node metrics sink.

elchi-collector

jhonbrownn/elchi-collector · global — Envoy ALS → ClickHouse.

elchi-coredns

jhonbrownn/elchi-coredns · global — GSLB DNS (optional).

4. Offline / air-gapped

On a machine with internet, save the pinned image set to a single tarball (honours the same version flags so the bundle matches your install):

curl -fsSL https://raw.githubusercontent.com/CloudNativeWorks/elchi-archive/main/deploy/docker/get.sh \ | sudo bash -s -- --script=save-images.sh --output=elchi-images.tar

Copy both the image tarball and the installer (a git clone of this repo, or its main tarball) to the air-gapped host — it has no internet, so it can't bootstrap via curl. Run the transferred installer with --offline (it docker loads the images and deploys with --resolve-image=never):

sudo deploy/docker/install.sh --main-address=10.0.0.5 --offline=elchi-images.tar

Multi-node air-gapped: docker load the bundle on every node, or run a throwaway registry:2, push the loaded images there and pass --image-repo=<registry>:<port>.

5. High availability (multi-node)

Run it once on M1 (the first --nodes host). Like the standalone installer, M1 SSHes into the other nodes, installs Docker, joins them to the Swarm (logging each step), then deploys — no manual join, no HA flags. At 3+ nodes the first 3 form the storage cluster; the first node is M1:

curl -fsSL https://raw.githubusercontent.com/CloudNativeWorks/elchi-archive/main/deploy/docker/get.sh \ | sudo bash -s -- \ --main-address=45.13.226.177 \ --nodes=45.13.226.177,45.13.226.226,198.105.112.107 \ --ssh-key=/root/.ssh/id_rsa

Open the Swarm ports between nodes (2377/tcp, 7946/tcp+udp, 4789/udp). SSH auto-join is idempotent; use --no-ssh to join the workers yourself.

With 3+ nodes the stateful tier automatically becomes: a MongoDB replica set (3 pinned services elchi-mongo-1..3 with keyfile auth; member-1 retries rs.initiate until all members are up, then creates the scoped app user), and a ClickHouse Keeper cluster (3 servers with embedded Raft; the Replicated elchi database is created post-deploy so it's never accidentally created as a plain Atomic DB). Stateless services (envoy / otel / collector / coredns / registry) run as Swarm global services on every node. Verified live: RS forms PRIMARY + 2 SECONDARY with app auth + writes; ClickHouse reports Replicated engine on all members with a healthy Keeper quorum.

6. External MongoDB / ClickHouse / VictoriaMetrics

Point the stack at managed/external datastores instead of running them in-cluster:

curl -fsSL https://raw.githubusercontent.com/CloudNativeWorks/elchi-archive/main/deploy/docker/get.sh \ | sudo bash -s -- \ --main-address=elchi.example.com \ --mongo=external --mongo-uri='mongodb://u:p@m1:27017,m2:27017/?replicaSet=rs0' \ --mongo-hosts=m1:27017,m2:27017 --mongo-username=u --mongo-password=p --mongo-replicaset=rs0 \ --clickhouse=external --clickhouse-uri='clickhouse://u:p@ch:9000/elchi' \ --vm=external --vm-endpoint=vm.internal:8428

7. upgrade.sh — rolling update

install.sh is fully idempotent (secrets preserved; configs are content-hashed, so a re-render → new name → Swarm rolling update). upgrade.sh is the same flow re-run with new --*-version flags:

curl -fsSL https://raw.githubusercontent.com/CloudNativeWorks/elchi-archive/main/deploy/docker/get.sh \ | sudo bash -s -- --upgrade \ --main-address=elchi.example.com \ --ui-version=v1.4.6 \ --backend-version=v1.4.9-v0.14.0-envoy1.36.2

8. uninstall.sh — remove the stack

# remove the stack, keep data volumes curl -fsSL https://raw.githubusercontent.com/CloudNativeWorks/elchi-archive/main/deploy/docker/get.sh \ | sudo bash -s -- --uninstall # also delete volumes + configs + secrets + state (this node) curl -fsSL https://raw.githubusercontent.com/CloudNativeWorks/elchi-archive/main/deploy/docker/get.sh \ | sudo bash -s -- --uninstall --purge

Full multi-node wipe — pass the same --nodes so M1 SSHes into the workers (reusing the bootstrap key) to delete THEIR data volumes and make every node leave the Swarm:

curl -fsSL https://raw.githubusercontent.com/CloudNativeWorks/elchi-archive/main/deploy/docker/get.sh \ | sudo bash -s -- \ --uninstall --purge --leave-swarm \ --nodes=45.13.226.177,45.13.226.226,198.105.112.107

Worker data volumes are node-local — without --nodes the teardown only cleans this node (the stack itself is removed everywhere by docker stack rm). --leave-swarm dissolves the Swarm (every node leaves).

9. Port atlas

<--port> (443)

Envoy public edge (published, TLS).

8080

Envoy internal plaintext listener (registry client path + CoreDNS).

1870 / 9091

Registry gRPC / metrics.

1980 / 1960

Controller REST / gRPC.

1990+

Control-plane xDS (one per variant).

27017

MongoDB.

9000 / 8123 / 9181 / 9234

ClickHouse native / HTTP / Keeper / Raft.

4317 / 4318 / 13133

OTel gRPC / HTTP / health.

8428 / 3000

VictoriaMetrics / Grafana.

18090 / 18091

Collector gRPC (ALS) / HTTP.

53 / 8053

CoreDNS DNS / webhook.

10. How it differs from the bare-metal installer

A separate render layer (deploy/docker/lib/render.sh) mirrors the config shapes of deploy/standalone/ but with these deliberate Docker divergences:

Service discovery via Swarm DNS (tasks.<service>), not /etc/hosts aliases; Envoy clusters are STRICT_DNS.
Backend identity (CONTROLLER_ID / CONTROL_PLANE_ID) pinned to DNS-safe service names, so the x-target-cluster header matches the generated Envoy cluster names.
Registry client path uses the Envoy internal plaintext listener (elchi-envoy:8080) — keeps traffic on the overlay, avoids shipping the self-signed CA into every backend container.
Secrets: TLS, mongo root creds and the Grafana password are Docker secrets; other rendered config ships as Docker configs (content-hashed for clean rolling updates).
No SSH: Swarm distributes containers across joined nodes; the bare-metal installer SSHes from M1 and ships an encrypted bundle.

11. Notes & limitations

State (secrets / TLS / rendered config / dashboards) lives in ~/.elchi-docker and must persist — Grafana bind-mounts its dashboards from there (pinned to the manager).
ACME (Let's Encrypt) is enabled in the backend config but only works when --main-address is a real public DNS name with a reachable :443; self-signed (the default) is the safe choice otherwise.
The full Grafana dashboard JSON (~850 KB) exceeds the Docker config size limit, so dashboards are bind-mounted from the state dir instead of shipped as configs.
CoreDNS GSLB node_ip is set to --main-address (an overlay container can't learn its host's external IP); true multi-region GSLB needs host-network CoreDNS.
The jhonbrownn/elchi-* images are linux/amd64 — deploy on an amd64 Docker host (Swarm won't schedule them on arm64 nodes).

📖 Full README 💻 Source

🛰️ Elchi Client (data-plane agent)

Install the elchi-client agent on a remote VM / bare-metal host so the node registers itself with an existing elchi-stack control plane and starts receiving Envoy / xDS configuration. Run on the TARGET host (the one that will serve traffic) — NOT on the control-plane M1.

Prerequisites: a running elchi-stack (Helm or Bare-Metal); the control plane's --main-address reachable over HTTPS; a valid auth token (min 8 chars) minted from the UI (Settings → Tokens).

1. Quick start

Production setup (default `--cloud=other`)

wget https://github.com/CloudNativeWorks/elchi-archive/releases/download/elchi-client-v1.1.0/elchi-install.sh sudo bash elchi-install.sh \ --name=web-server-01 \ --host=backend.elchi.io \ --port=443 \ --tls=true \ --token=your-auth-token

OpenStack deployment (cloud name from UI)

sudo bash elchi-install.sh \ --name=openstack-vm \ --host=controller.elchi.io \ --port=443 \ --tls=true \ --token=prod-token \ --cloud=my-openstack

If you are deploying on OpenStack, pass --cloud=<your-cloud-name> exactly as it appears in the UI's Settings → Clouds list — the client uses this to look up the right metadata service / region.

With BGP routing (FRR install)

sudo bash elchi-install.sh \ --enable-bgp \ --name=edge-router \ --host=controller.elchi.io \ --port=443 \ --tls=true \ --token=prod-token \ --cloud=production

--enable-bgp additionally installs and configures FRR so the node can announce VIPs / advertise prefixes for the routes elchi-control-plane pushes via BGP. Without it the client only manages Envoy data-plane config.

2. Configuration options

Required parameters

--name=<NAME>

Client identifier. Shows up in the UI's Clients list — pick something operator-friendly per host (web-server-01, edge-router-01, db-replica-2).

--host=<HOST>

Control-plane address (DNS or IP). Same value the UI is published at; --main-address in the standalone install or the public LB in Helm.

--port=<PORT>

Control-plane HTTPS port. Usually 443. Valid range 1-65535.

--tls=true|false

TLS to control plane. Default: true. Only set to false on a dev / inside-VPC plaintext install.

--token=<TOKEN>

Authentication token (minimum 8 characters) minted from the UI under Settings → Tokens. Bind it to the same project as the client's --name.

Optional parameters

--cloud=<CLOUD>

Cloud / infrastructure provider name. Default: other. Use the exact name from Settings → Clouds in the UI when running on OpenStack / AWS / GCP / etc. — the client uses this to pull region + metadata from the right provider plugin.

--enable-bgp

Install and configure FRR alongside the client so the node can participate in BGP route advertisements (VIPs, prefixes pushed from the elchi control plane). Adds the frr package and writes /etc/frr/frr.conf with a baseline session config.

3. Manual installation

When you'd rather skip the installer wrapper and just drop the binary in place (e.g., from a config-management system or a custom systemd unit):

wget https://github.com/CloudNativeWorks/elchi-archive/releases/download/elchi-client-v1.1.0/elchi-client-linux-amd64 sudo mv elchi-client-linux-amd64 /usr/local/bin/elchi-client sudo chmod +x /usr/local/bin/elchi-client

Verify checksum: wget https://github.com/CloudNativeWorks/elchi-archive/releases/download/elchi-client-v1.1.0/elchi-client-linux-amd64.sha256 && sha256sum -c elchi-client-linux-amd64.sha256

📦 All elchi-client releases 💻 Source

Envoy

Pre-compiled Envoy binaries for Linux AMD64 and ARM64 architectures.

Loading Envoy releases...

Elchi Backend

elchi-backend control-plane binaries for Linux AMD64. One binary per control-plane / Envoy variant - pick the one that matches your Envoy fleet.

Loading Elchi Backend releases...

Elchi UI

elchi UI static distribution bundle (index.html + assets), served by nginx on bare-metal.

Loading Elchi UI releases...

Coraza Proxy WASM

Compiled WASM modules for Coraza Web Application Firewall.

Loading Coroza WASM releases...

Elchi Client

Pre-built Elchi Client binaries and installation scripts.

Loading Elchi Client releases...

Elchi GSLB

CoreDNS with Elchi GSLB plugin - GSLB-like DNS resolution with in-memory caching and webhook-based updates.

Loading Elchi GSLB releases...

Elchi Collector

elchi-collector - ingests the Envoy ALS (Access Log Service) gRPC stream into ClickHouse and MongoDB.

Loading Elchi Collector releases...

Elchi Archive

🚀 Quick Install Elchi Stack (Kubernetes via kind + Helm)

⚙️ Bare-Metal Standalone Install (no Docker, no Kubernetes)

1. Quick start

Single VM (all-in-one)

3-VM cluster, multi-version backend, key-based SSH

3-VM cluster, no SSH key set up yet (interactive bootstrap)

2. install.sh — full flag reference

3. upgrade.sh — version-diff upgrade

Add a new variant alongside what's already running (additive)

Bump just the UI (sadece UI güncelleme)

Bump just CoreDNS (only when GSLB is enabled)

Replace the variant set explicitly (full union)

Replace a variant + drop the old one (declarative)

Bump the UI + Envoy proxy together

Rotate the Grafana admin password

Add a new variant alongside existing ones

Update an existing variant (replace one tag with another)

Both at once (declarative)

Health gate & rollback

4. uninstall.sh — remove the stack

Single node (this machine only)

Whole cluster — fan out from M1 to every node

Wipe everything (data + packages + secrets + SSH bootstrap material)

5. /etc/elchi/validate.sh — per-node post-install audit

6. elchi-stack — operator helper (/usr/local/bin/elchi-stack)

7. Port atlas

8. Topology

9. Production hardening (kernel + systemd)

10. Idempotency & reconcile

11. Supported distros

🐳 Docker Swarm Install (containers, no Kubernetes)

1. Quick start

Single host (all-in-one)

Multi-version backend (CSV)

Custom port / plaintext

2. install.sh — full flag reference

3. What runs

4. Offline / air-gapped

5. High availability (multi-node)

6. External MongoDB / ClickHouse / VictoriaMetrics

7. upgrade.sh — rolling update

8. uninstall.sh — remove the stack

9. Port atlas

10. How it differs from the bare-metal installer

11. Notes & limitations

🛰️ Elchi Client (data-plane agent)

1. Quick start

Production setup (default --cloud=other)

OpenStack deployment (cloud name from UI)

With BGP routing (FRR install)

2. Configuration options

3. Manual installation

Envoy

Elchi Backend

Elchi UI

Coraza Proxy WASM

Elchi Client

Elchi GSLB

Elchi Collector

2. `install.sh` — full flag reference

3. `upgrade.sh` — version-diff upgrade

4. `uninstall.sh` — remove the stack

5. `/etc/elchi/validate.sh` — per-node post-install audit

6. `elchi-stack` — operator helper (`/usr/local/bin/elchi-stack`)

Production setup (default `--cloud=other`)