Reliability and Disaster Recovery

The QALITA platform stores certain critical data that must be backed up and restored in case of a disaster.

🧩 Backup

The platform backup is performed by the system administrator, ensuring the persistence of the following elements:

Element	Criticality Level	Recommended Backup Frequency
Postgresql	➕➕➕ ⚠️ Critical ⚠️	Daily
Seaweedfs	➕➕ Moderate	Weekly
Agents	➕➕ Moderate	(Can be backed up by PVC to ensure optimal service continuity)
Redis	None (stateless)	None
Frontend	None (stateless)	None
Backend	None (stateless)	None

📦 Postgresql

Backup can be configured via the Bitnami Helm chart backup functionality.

warning

By default, with the QALITA Helm chart, this backup is not enabled. Once enabled, the backup is performed in a PVC that must itself be backed up in cold or semi-cold storage.

🗃️ Seaweedfs

Seaweedfs storage is less critical because it contains only:

Logs
Pack archives whose code is versioned in a VCS (e.g., GitLab)

info

In Kubernetes, back up the PVC containing the data. If you do not manage your cluster, ensure that the PVCs are included in the backup strategy (contact the cluster administrator).

🛰️ Agent

Agents store important information for their operation:

Sources (qalita-conf.yaml) This file contains the source definitions. It is important to back it up. In the QALITA deployment, a local agent can be deployed with persistence of the ~/.qalita/ directory.
Platform Connection (.agent) Contains recent connection information.

warning
⚠️ These files should not be backed up. Use environment variables for authentication. The .env-<login> files are temporary and sensitive.
Execution Results (~/.qalita/agent_temp_run/) Can be configured to be persistent and backed up.

💾 Restoration

PostgreSQL

Follow the Bitnami PostgreSQL documentation

Seaweedfs

See the official SeaweedFS documentation

Agents

Re-register the Agent: Run qalita agent login and copy the ~/.qalita/ directory.
Restore Sources: Restore the qalita-conf.yaml file.
Synchronize Source IDs: Use qalita source push to realign local sources with those on the platform.

⚠️ Degraded Mode

In case of partial loss of a component, the platform can continue to operate in degraded mode. Here are the main scenarios:

Missing Component	Impact	Possible Workaround
PostgreSQL	Complete platform blockage	None, mandatory restoration
Seaweedfs	Temporary loss of logs and archives	Partial operation possible
Agent Unavailable	Scheduled or manual executions fail	Restart the agent or use a local agent
Web Platform (Frontend)	Read access impossible	Use the REST API or CLI (if backend is still active)
Backend Unavailable	All API access and executions are blocked	None, requires redeployment or restoration
Redis	Performance loss on certain operations	Manual re-executions, partially stable operation

🔭 Supervision and SRE

🧠 Observability

Recommended tools:

Component	Recommended Tool
Logs	Loki + Grafana / ELK
Metrics	Prometheus + Grafana
Uptime / Probes	Uptime Kuma / Blackbox
Tracing	Jaeger / OpenTelemetry

📢 Proactive Alerts

Set critical thresholds:

Backend > 2s latency
HTTP 5xx rate > 2%
PostgreSQL PVC > 85% usage

Send via:

Email
Slack / MS Teams
Opsgenie / PagerDuty

🛡️ Recommended Resilience

Domain	Recommended SRE Practice
DB	Backups + regular restoration tests
Storage	Weekly backups, volume monitoring
Network	LB with health checks + retries in ingress
Deployment	Rolling update
Incidents	Runbooks + postmortems
Agents	Deployment with PVC, cron job for automatic restart

🔁 Resilience Tests

Intentional deletion of a pod
Simulated crash on DB
Failover test if replicas are available
Simulated network outage
Real restoration of a backup

📚 Runbooks

1. 🔥 Backend Not Responding

Symptoms

REST API unavailable (5xx, timeout)
Web interface not loading (error 502/504)

Diagnostics

kubectl get pods -n qalita
kubectl logs <pod-backend> -n qalita

Immediate Actions

Delete the faulty pod: kubectl delete pod <pod-backend> -n qalita
Check resources: kubectl top pod <pod-backend> -n qalita
If the error is due to an inaccessible DB: kubectl exec <pod-backend> -- psql <url>

Recovery

Pod recreated automatically? ✅
API tests: curl <backend-url>/health
Test a business API call

Postmortem

Reason for the crash? (OOM, crash, logical error)
Need to increase resources? Add a readiness probe?

2. 📉 PostgreSQL Down

Symptoms

Backend crashing in a loop
Logs containing could not connect to server: Connection refused
kubectl get pvc indicates an attachment issue

Diagnostics

kubectl get pods -n qalita
kubectl describe pod <postgresql-pod>
kubectl logs <postgresql-pod> -n qalita

Immediate Actions

Delete the pod: kubectl delete pod <postgresql-pod> -n qalita
Check the PVC: kubectl describe pvc <postgresql-pvc> -n qalita

Restoration

If data is lost → restore from backup:

helm install postgresql bitnami/postgresql \
  --set postgresql.restore.enabled=true \
  --set postgresql.restore.backupSource=<source>

Postmortem

Why did the pod crash?
Recent valid backups?
Automated restoration tests to be scheduled?

3. 🧊 SeaweedFS Inaccessible

Symptoms

Archive downloads fail
Unable to display task logs

Diagnostics

kubectl logs <seaweedfs-pod> -n qalita
kubectl describe pvc <seaweedfs-pvc> -n qalita

Immediate Actions

Check the PVC status
Delete and restart the pod
Restart the volume if using a CSI driver (EBS, Ceph...)

Recovery

Validate that objects are accessible via the platform
Rerun a task that generates a log

Postmortem

Is it a PVC saturation issue?
Did the lack of alerting prolong the outage?

4. ⏳ Agent Blocked or Offline

Symptoms

Tasks no longer executing
Agent no longer appearing in the interface

Diagnostics

kubectl logs <agent-pod> -n qalita
qalita agent ping

Immediate Actions

Restart the local agent: qalita agent restart
Re-register: qalita agent login
Check network access (can it reach the API?)

Recovery

Test a simple task via qalita task run
Verify the result reception on the platform

Postmortem

Agent too old? DNS issue?
Monitor agents via regular heartbeat

5. 🟣 Memory or CPU Saturated

Symptoms

Pod restarting in a loop (CrashLoopBackOff)
High API latency

Diagnostics

kubectl top pod -n qalita
kubectl describe pod <pod-name>

Immediate Actions

Increase resources in values.yaml
Check if a process is consuming abnormally (profiling via pprof)

Recovery

Apply new resource settings:

helm upgrade qalita-platform ./chart --values values.yaml

Postmortem

Is HPA enabled?
Is there a spike related to a specific task?

6. 🚧 TLS Certificate Expired

Symptoms

Unable to access the interface
Browser error "connection not secure"

Diagnostics

kubectl describe certificate -n qalita
kubectl get cert -n qalita

Immediate Actions

Manually renew:
```
kubectl cert-manager renew <cert-name>
```
Force redeploy Traefik/Ingress

Recovery

Wait for the certificate to be "Ready":

kubectl get certificate -n qalita -o wide

Postmortem

Is cert-manager functioning correctly?
Set up an alert for expiration at J-15

7. 🔒 License Issue

Symptoms

Messages Invalid or expired license
API returns 401 on login

Diagnostics

Check the QALITA_LICENSE_KEY variable

Verification test:

curl -H "Authorization: Bearer <token>" \
  https://<registry>/v2/_catalog

Immediate Actions

Check the expiration date (contained in the JWT token)
Extend via the portal or contact support

Recovery

Redeploy the backend with the new license (if the variable is mounted via secret/env)

Postmortem

Is renewal automated or monitored?
Was an alert triggered in advance?

🔁 Automation Suggestions

Incident	Possible Automation
Agent Offline	Cron job for `qalita agent ping` with alert
TLS Cert Expired	Script to monitor certificates at J-30
PostgreSQL Saturated	Prometheus alerts on `pg_stat_activity`
PVC Nearly Full	Alerting on disk usage via kubelet or metrics

🗂️ Tip: Classify Incidents in Grafana OnCall / Opsgenie

Category: backend, db, network, tls, storage
Priority: P1 (blocking), P2 (degraded), P3 (minor)
Responsible: infra, dev, data

🧩 Backup​

📦 Postgresql​

🗃️ Seaweedfs​

🛰️ Agent​

💾 Restoration​

PostgreSQL​

Seaweedfs​

Agents​

⚠️ Degraded Mode​

🔭 Supervision and SRE​

🧠 Observability​

📢 Proactive Alerts​

🛡️ Recommended Resilience​

🔁 Resilience Tests​

📚 Runbooks​

1. 🔥 Backend Not Responding​

Symptoms​

Diagnostics​

Immediate Actions​

Recovery​

Postmortem​

2. 📉 PostgreSQL Down​

Symptoms​

Diagnostics​

Immediate Actions​

Restoration​

Postmortem​

3. 🧊 SeaweedFS Inaccessible​

Symptoms​

Diagnostics​

Immediate Actions​

Recovery​

Postmortem​

4. ⏳ Agent Blocked or Offline​

Symptoms​

Diagnostics​

Immediate Actions​

Recovery​

Postmortem​

5. 🟣 Memory or CPU Saturated​

Symptoms​

Diagnostics​

Immediate Actions​

Recovery​

Postmortem​

6. 🚧 TLS Certificate Expired​

Symptoms​

Diagnostics​

Immediate Actions​

Recovery​

Postmortem​

7. 🔒 License Issue​

Symptoms​

Diagnostics​

Immediate Actions​

Recovery​

Postmortem​

🔁 Automation Suggestions​

🗂️ Tip: Classify Incidents in Grafana OnCall / Opsgenie​

🧩 Backup

📦 Postgresql

🗃️ Seaweedfs

🛰️ Agent

💾 Restoration

PostgreSQL

Seaweedfs

Agents

⚠️ Degraded Mode

🔭 Supervision and SRE

🧠 Observability

📢 Proactive Alerts

🛡️ Recommended Resilience

🔁 Resilience Tests

📚 Runbooks

1. 🔥 Backend Not Responding

Symptoms

Diagnostics

Immediate Actions

Recovery

Postmortem

2. 📉 PostgreSQL Down

Symptoms

Diagnostics

Immediate Actions

Restoration

Postmortem

3. 🧊 SeaweedFS Inaccessible

Symptoms

Diagnostics

Immediate Actions

Recovery

Postmortem

4. ⏳ Agent Blocked or Offline

Symptoms

Diagnostics

Immediate Actions

Recovery

Postmortem

5. 🟣 Memory or CPU Saturated

Symptoms

Diagnostics

Immediate Actions

Recovery

Postmortem

6. 🚧 TLS Certificate Expired

Symptoms

Diagnostics

Immediate Actions

Recovery

Postmortem

7. 🔒 License Issue

Symptoms

Diagnostics

Immediate Actions

Recovery

Postmortem

🔁 Automation Suggestions

🗂️ Tip: Classify Incidents in Grafana OnCall / Opsgenie