Skip to content

Recreate K3s Cluster from Scratch - Disaster Recovery

This guide walks through recreating the entire K3s cluster from scratch using this repository and secrets stored in OCI Vault.

Before starting, ensure you have:

  • OCI Account with Always Free eligibility
  • Cloudflare Account with a managed domain
  • GitHub Account with fork of this repository
  • Local Tools: Terraform, OCI CLI, kubectl
  1. Install OCI CLI

    Terminal window
    brew install oci-cli
  2. Configure authentication

    Terminal window
    oci setup config

    This creates ~/.oci/config with your tenancy details.

  3. Verify connection

    Terminal window
    oci iam user get --user-id <your-user-ocid>

If recreating after a disaster, secrets are stored in OCI Vault:

Terminal window
# List all secrets in the vault
oci vault secret list \
--compartment-id <compartment-ocid> \
--query 'data[].{"name":"secret-name","id":id}' \
--output table
Terminal window
# Generic retrieval command
oci secrets secret-bundle get \
--secret-id <secret-ocid> \
--query 'data."secret-bundle-content".content' \
--raw-output | base64 -d

Create the Terraform variables file with values from Vault:

Terminal window
cd tf-k3s
cat > terraform.tfvars << 'EOF'
# OCI Authentication (from ~/.oci/config or password manager)
tenancy_ocid = "ocid1.tenancy.oc1..xxxxx"
user_ocid = "ocid1.user.oc1..xxxxx"
fingerprint = "xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx"
private_key_path = "~/.oci/oci_api_key.pem"
region = "us-ashburn-1"
compartment_ocid = "ocid1.compartment.oc1..xxxxx"
# SSH (from Vault: ssh-public-key)
ssh_public_key_path = "./oci_key.pub"
# Cloudflare (from Vault)
cloudflare_api_token = "<from vault: cloudflare-api-token>"
cloudflare_zone_id = "<from vault: cloudflare-zone-id>"
domain_name = "<from vault: domain-name>"
# GitHub (from Vault)
git_repo_url = "<from vault: git-repo-url>"
git_pat = "<from vault: github-pat>"
git_username = "<from vault: github-username>"
# K3s (from Vault)
k3s_token = "<from vault: k3s-token>"
# Let's Encrypt (from Vault)
acme_email = "<from vault: acme-email>"
# ArgoCD (from Vault)
argocd_admin_password = "<from vault: argocd-admin-password>"
EOF

If you don’t have the SSH key:

Terminal window
# Generate new key pair
ssh-keygen -t ed25519 -f ./oci_key -N ""
# Or retrieve from Vault
oci secrets secret-bundle get \
--secret-id <ssh-public-key-ocid> \
--query 'data."secret-bundle-content".content' \
--raw-output | base64 -d > oci_key.pub
Terminal window
terraform init
terraform plan
terraform apply

After Terraform completes, wait approximately 5 minutes for:

  1. Cloud-init to complete on all nodes
  2. K3s cluster to form
  3. ArgoCD to bootstrap
  4. Applications to sync
Terminal window
# Get ingress IP from Terraform output
INGRESS_IP=$(terraform output -raw ingress_public_ip)
# SSH to server via ingress jump host
ssh -J ubuntu@$INGRESS_IP ubuntu@10.0.2.10
Terminal window
sudo kubectl get nodes
sudo kubectl get pods -A
Terminal window
sudo kubectl get applications -n argocd

The ArgoCD admin password is synced from OCI Vault via External Secrets Operator.

Terminal window
kubectl -n argocd get secret argocd-initial-admin-secret \
-o jsonpath='{.data.password}' | base64 -d

This retrieves the password that External Secrets synced from Vault.

Terminal window
argocd login argocd.<your-domain> \
--username admin \
--password <password-from-above>

If Terraform state is locked from a previous run:

Terminal window
terraform force-unlock <lock-id>

If applications aren’t syncing, check the repo credentials:

Terminal window
kubectl -n argocd get secret repo-creds -o yaml

Wait for External DNS to create records (up to 5 minutes), then verify:

Terminal window
dig @1.1.1.1 argocd.<your-domain>

After cluster recreation, worker nodes may fail to join with TLS errors:

level=error msg="Failed to connect to proxy" error="tls: failed to verify certificate: x509: certificate signed by unknown authority"

Fix by resetting the worker’s certificates:

Terminal window
ssh -J ubuntu@$INGRESS_IP ubuntu@<worker-ip>
sudo systemctl stop k3s-agent
sudo rm -rf /var/lib/rancher/k3s/agent/*.kubeconfig /var/lib/rancher/k3s/agent/client*
sudo systemctl start k3s-agent

If certificates fail with 429 rateLimited error, you can:

  1. Wait for the rate limit to reset (7 days)
  2. Create a temporary self-signed certificate:
    Terminal window
    openssl req -x509 -nodes -days 365 -newkey rsa:2048 \
    -keyout /tmp/tls.key -out /tmp/tls.crt \
    -subj "/CN=<your-domain>"
    kubectl create secret tls docs-tls \
    --cert=/tmp/tls.crt --key=/tmp/tls.key \
    -n default --dry-run=client -o yaml | kubectl apply -f -

If Envoy pods are stuck in Pending state after a restart:

Terminal window
# Find and delete the old pod to free hostPort 80/443
kubectl delete pod -n envoy-gateway-system <old-pod-name> --grace-period=10
  • All nodes are Ready (kubectl get nodes)
  • Worker node joined (check for TLS errors if missing)
  • ArgoCD applications are Synced
  • DNS records are created (wait up to 5 minutes)
  • TLS certificates are issued (check for rate limiting)
  • Applications are accessible via HTTPS
  • Envoy Gateway pod is Running (not Pending)