Recreate K3s Cluster from Scratch - Disaster Recovery
This guide walks through recreating the entire K3s cluster from scratch using this repository and secrets stored in OCI Vault.
Prerequisites
Section titled “Prerequisites”Before starting, ensure you have:
- OCI Account with Always Free eligibility
- Cloudflare Account with a managed domain
- GitHub Account with fork of this repository
- Local Tools: Terraform, OCI CLI, kubectl
Step 1: OCI CLI Setup
Section titled “Step 1: OCI CLI Setup”-
Install OCI CLI
Terminal window brew install oci-cli -
Configure authentication
Terminal window oci setup configThis creates
~/.oci/configwith your tenancy details. -
Verify connection
Terminal window oci iam user get --user-id <your-user-ocid>
Step 2: Retrieve Secrets from Vault
Section titled “Step 2: Retrieve Secrets from Vault”If recreating after a disaster, secrets are stored in OCI Vault:
# List all secrets in the vaultoci vault secret list \ --compartment-id <compartment-ocid> \ --query 'data[].{"name":"secret-name","id":id}' \ --output tableRetrieve Individual Secrets
Section titled “Retrieve Individual Secrets”# Generic retrieval commandoci secrets secret-bundle get \ --secret-id <secret-ocid> \ --query 'data."secret-bundle-content".content' \ --raw-output | base64 -dStep 3: Create terraform.tfvars
Section titled “Step 3: Create terraform.tfvars”Create the Terraform variables file with values from Vault:
cd tf-k3s
cat > terraform.tfvars << 'EOF'# OCI Authentication (from ~/.oci/config or password manager)tenancy_ocid = "ocid1.tenancy.oc1..xxxxx"user_ocid = "ocid1.user.oc1..xxxxx"fingerprint = "xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx"private_key_path = "~/.oci/oci_api_key.pem"region = "us-ashburn-1"compartment_ocid = "ocid1.compartment.oc1..xxxxx"
# SSH (from Vault: ssh-public-key)ssh_public_key_path = "./oci_key.pub"
# Cloudflare (from Vault)cloudflare_api_token = "<from vault: cloudflare-api-token>"cloudflare_zone_id = "<from vault: cloudflare-zone-id>"domain_name = "<from vault: domain-name>"
# GitHub (from Vault)git_repo_url = "<from vault: git-repo-url>"git_pat = "<from vault: github-pat>"git_username = "<from vault: github-username>"
# K3s (from Vault)k3s_token = "<from vault: k3s-token>"
# Let's Encrypt (from Vault)acme_email = "<from vault: acme-email>"
# ArgoCD (from Vault)argocd_admin_password = "<from vault: argocd-admin-password>"EOFStep 4: Create SSH Key
Section titled “Step 4: Create SSH Key”If you don’t have the SSH key:
# Generate new key pairssh-keygen -t ed25519 -f ./oci_key -N ""
# Or retrieve from Vaultoci secrets secret-bundle get \ --secret-id <ssh-public-key-ocid> \ --query 'data."secret-bundle-content".content' \ --raw-output | base64 -d > oci_key.pubStep 5: Initialize Terraform
Section titled “Step 5: Initialize Terraform”terraform initterraform planterraform apply# State is in OCI Object Storage bucketterraform initterraform planStep 6: Wait for Bootstrap
Section titled “Step 6: Wait for Bootstrap”After Terraform completes, wait approximately 5 minutes for:
- Cloud-init to complete on all nodes
- K3s cluster to form
- ArgoCD to bootstrap
- Applications to sync
Step 7: Verify Cluster
Section titled “Step 7: Verify Cluster”Connect via SSH
Section titled “Connect via SSH”# Get ingress IP from Terraform outputINGRESS_IP=$(terraform output -raw ingress_public_ip)
# SSH to server via ingress jump hostssh -J ubuntu@$INGRESS_IP ubuntu@10.0.2.10Check Kubernetes
Section titled “Check Kubernetes”sudo kubectl get nodessudo kubectl get pods -ACheck ArgoCD Applications
Section titled “Check ArgoCD Applications”sudo kubectl get applications -n argocdStep 8: Access ArgoCD UI
Section titled “Step 8: Access ArgoCD UI”Get Credentials
Section titled “Get Credentials”The ArgoCD admin password is synced from OCI Vault via External Secrets Operator.
kubectl -n argocd get secret argocd-initial-admin-secret \ -o jsonpath='{.data.password}' | base64 -dThis retrieves the password that External Secrets synced from Vault.
oci secrets secret-bundle get \ --secret-id <argocd-admin-password-ocid> \ --query 'data."secret-bundle-content".content' \ --raw-output | base64 -dargocd login argocd.<your-domain> \ --username admin \ --password <password-from-above>Troubleshooting
Section titled “Troubleshooting”State Lock Issues
Section titled “State Lock Issues”If Terraform state is locked from a previous run:
terraform force-unlock <lock-id>ArgoCD Sync Issues
Section titled “ArgoCD Sync Issues”If applications aren’t syncing, check the repo credentials:
kubectl -n argocd get secret repo-creds -o yamlDNS Not Resolving
Section titled “DNS Not Resolving”Wait for External DNS to create records (up to 5 minutes), then verify:
dig @1.1.1.1 argocd.<your-domain>Known Issues After Recreation
Section titled “Known Issues After Recreation”Worker Node Certificate Mismatch
Section titled “Worker Node Certificate Mismatch”After cluster recreation, worker nodes may fail to join with TLS errors:
level=error msg="Failed to connect to proxy" error="tls: failed to verify certificate: x509: certificate signed by unknown authority"Fix by resetting the worker’s certificates:
ssh -J ubuntu@$INGRESS_IP ubuntu@<worker-ip>sudo systemctl stop k3s-agentsudo rm -rf /var/lib/rancher/k3s/agent/*.kubeconfig /var/lib/rancher/k3s/agent/client*sudo systemctl start k3s-agentLet’s Encrypt Rate Limiting
Section titled “Let’s Encrypt Rate Limiting”If certificates fail with 429 rateLimited error, you can:
- Wait for the rate limit to reset (7 days)
- Create a temporary self-signed certificate:
Terminal window openssl req -x509 -nodes -days 365 -newkey rsa:2048 \-keyout /tmp/tls.key -out /tmp/tls.crt \-subj "/CN=<your-domain>"kubectl create secret tls docs-tls \--cert=/tmp/tls.crt --key=/tmp/tls.key \-n default --dry-run=client -o yaml | kubectl apply -f -
Envoy Gateway Pod Restart
Section titled “Envoy Gateway Pod Restart”If Envoy pods are stuck in Pending state after a restart:
# Find and delete the old pod to free hostPort 80/443kubectl delete pod -n envoy-gateway-system <old-pod-name> --grace-period=10Post-Recreation Checklist
Section titled “Post-Recreation Checklist”- All nodes are Ready (
kubectl get nodes) - Worker node joined (check for TLS errors if missing)
- ArgoCD applications are Synced
- DNS records are created (wait up to 5 minutes)
- TLS certificates are issued (check for rate limiting)
- Applications are accessible via HTTPS
- Envoy Gateway pod is Running (not Pending)