基于 Nomad 和 Prometheus 构建面向 SLO 的 Django 应用 GitOps 交付闭环


我们团队在维护一套中等规模的内部服务时,一直面临一个棘手的问题:应用的发布流程与它的可观测性状态是完全脱节的。开发者提交代码,CI/CD流水线构建镜像、推送到仓库,最后触发部署。而另一边,SRE团队需要手动去更新Prometheus的告警规则、调整Grafana仪表盘,甚至在发布后才发现某个关键业务指标的采集方式已经变更。这种割裂导致了发布后的“监控静默期”,以及在出现问题时排查效率的低下。我们的目标很明确,必须将服务的“健康定义”——即服务等级指标(SLI)和服务等级目标(SLO)——纳入到GitOps的流程中,实现应用与其可观测性配置的原子化、一体化交付。

初步构想是,每一次 git push 不仅要构建和部署应用,还必须同步更新与之配套的Prometheus告警规则和记录规则。整个系统的核心技术栈选型如下:

  • 应用框架: Django。稳定、成熟,且拥有优秀的Prometheus metrics导出库 django-prometheus
  • 容器构建: Buildah。替代传统的Docker daemon,在CI环境中更安全、更轻量。
  • 服务编排: Nomad。相对于Kubernetes,它更简单,二进制文件单一,对我们的运维团队来说心智负担更小,尤其适合运行混合工作负载。
  • 监控与时序存储: Prometheus。事实上的开源监控标准,其TSDB性能可靠。
  • CI/CD 与 GitOps 理念: 使用GitLab CI作为执行引擎,将应用代码、Nomad Jobspec、以及Prometheus Rules全部置于Git仓库中管理。

整个流程的核心,在于打通从代码变更到监控生效的完整闭环。

第一步:让 Django 应用具备自我描述健康状况的能力

一个无法量化的服务,其SLO无从谈起。因此,第一步是让我们的Django应用暴露高质量的Prometheus Metrics。我们使用的是django-prometheus库,它能自动暴露大量默认指标,但对于业务SLI而言,这远远不够。

我们需要关注两个核心SLI:请求成功率请求延迟

settings.py中集成django-prometheus

# settings.py

INSTALLED_APPS = [
    # ... other apps
    'django_prometheus',
    'myapp',
]

MIDDLEWARE = [
    'django_prometheus.middleware.PrometheusBeforeMiddleware',
    # All other middlewares
    # ...
    'django_prometheus.middleware.PrometheusAfterMiddleware',
]

在根urls.py中暴露/metrics端点:

# urls.py

from django.urls import path, include

urlpatterns = [
    # ... other urls
    path('', include('django_prometheus.urls')),
]

这只是基础。为了精确度量特定API的SLI,我们需要自定义指标。假设我们有一个核心接口 /api/v1/process_data,我们需要为它创建专门的延迟直方图和请求计数器。

# myapp/metrics.py

from prometheus_client import Counter, Histogram

# Counter for API requests, distinguishing by method, endpoint, and status code.
# The status_code label is crucial for calculating error rates.
api_requests_total = Counter(
    'myapp_api_requests_total',
    'Total number of API requests',
    ['method', 'endpoint', 'status_code']
)

# Histogram for API request latency.
# Buckets are defined based on our SLO for this specific endpoint (e.g., 99% < 500ms).
# A good practice is to have buckets around your target latency.
api_request_latency_seconds = Histogram(
    'myapp_api_request_latency_seconds',
    'API request latency in seconds',
    ['method', 'endpoint'],
    buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 1.0, 2.5, 5.0, float("inf")]
)

然后在视图中手动埋点。虽然django-prometheus中间件能自动处理,但手动埋点给予我们更强的控制力,尤其是在处理非HTTP 2xx的“业务成功”场景时。

# myapp/views.py

from django.http import JsonResponse
from .metrics import api_requests_total, api_request_latency_seconds
import time
import random

def process_data_view(request):
    """
    A view that simulates some processing and might fail.
    This is the core endpoint for our SLO monitoring.
    """
    start_time = time.time()
    endpoint_path = '/api/v1/process_data'

    try:
        # Simulate work
        processing_time = random.uniform(0.05, 0.6)
        time.sleep(processing_time)

        # Simulate random failures
        if random.random() < 0.05: # 5% chance of server error
            raise ValueError("Internal processing error")

        status_code = 200
        response_data = {"status": "success", "processed_in": processing_time}

    except Exception as e:
        status_code = 500
        response_data = {"status": "error", "message": str(e)}

    finally:
        latency = time.time() - start_time
        # Record latency
        api_request_latency_seconds.labels(method=request.method, endpoint=endpoint_path).observe(latency)
        # Record request count with the final status code
        api_requests_total.labels(method=request.method, endpoint=endpoint_path, status_code=str(status_code)).inc()

    return JsonResponse(response_data, status=status_code)

现在,应用已经具备了暴露核心SLI数据的能力。每一次调用 /api/v1/process_data 都会更新 myapp_api_requests_totalmyapp_api_request_latency_seconds 这两个指标。

第二步:使用 Buildah 实现无守护进程的容器构建

在CI流水线中,依赖Docker-in-Docker (DinD) 既笨重又不安全。Buildah提供了一个完美的替代方案。我们的构建过程被封装在一个简单的shell脚本中,该脚本将在GitLab CI Runner中执行。

Dockerfile 保持不变,它依然是我们定义镜像环境的标准方式:

# Dockerfile
FROM python:3.9-slim

WORKDIR /app

# Install poetry
RUN pip install poetry

# Copy only dependency files to leverage Docker cache
COPY poetry.lock pyproject.toml /app/

# Install dependencies
RUN poetry config virtualenvs.create false && poetry install --no-dev --no-interaction --no-ansi

# Copy the rest of the application
COPY . /app/

EXPOSE 8000

# Run the application
CMD ["gunicorn", "myproject.wsgi:application", "--bind", "0.0.0.0:8000"]

关键在于build.sh脚本。它将由CI调用,负责构建、打标签并推送镜像。

#!/bin/bash
set -eo pipefail

# These variables are expected to be provided by the CI environment
# CI_REGISTRY_IMAGE: e.g., registry.example.com/my-group/my-project
# CI_COMMIT_SHA: The short SHA of the commit

if [[ -z "$CI_REGISTRY_IMAGE" || -z "$CI_COMMIT_SHA" ]]; then
  echo "Error: CI_REGISTRY_IMAGE and CI_COMMIT_SHA must be set."
  exit 1
fi

IMAGE_TAG="${CI_REGISTRY_IMAGE}:${CI_COMMIT_SHA}"
LATEST_TAG="${CI_REGISTRY_IMAGE}:latest"

echo "Building image: ${IMAGE_TAG}"

# buildah bud (build-using-dockerfile) builds an image from a Dockerfile.
# --tag adds a tag to the resulting image.
# The final argument is the context directory.
buildah bud --tag "${IMAGE_TAG}" .

echo "Pushing image: ${IMAGE_TAG}"
# Requires prior authentication, e.g., 'buildah login'
buildah push "${IMAGE_TAG}"

# Optionally, also tag and push a 'latest' tag for the main branch
if [[ "$CI_COMMIT_REF_NAME" == "main" ]]; then
  echo "Tagging and pushing latest tag: ${LATEST_TAG}"
  buildah tag "${IMAGE_TAG}" "${LATEST_TAG}"
  buildah push "${LATEST_TAG}"
fi

echo "Build and push complete."

这个脚本简洁明了,并且完全不依赖Docker守护进程,非常适合在临时的CI执行器中运行。

第三步:定义 Nomad Job 和服务发现

Nomad的Job Specification文件(HCL格式)是部署的核心。我们将它模板化,以便CI可以注入正确的镜像标签。这个文件同样存放在Git仓库中。

deploy/django-app.nomad.hcl.tmpl:

job "django-app" {
  datacenters = ["dc1"]
  type        = "service"

  group "api" {
    count = 3 # Run 3 instances for availability

    network {
      port "http" {
        to = 8000 # The port gunicorn is listening on inside the container
      }
    }

    service {
      name     = "django-api"
      port     = "http"
      provider = "consul" # Using Consul for service discovery

      # Health check for Nomad to manage task lifecycle
      check {
        type     = "http"
        path     = "/healthz/" # A simple health check endpoint in Django
        interval = "10s"
        timeout  = "2s"
      }

      # Critical for monitoring: telling Prometheus where to find the metrics
      tags = [
        "prometheus-scrape=true",
        "prometheus-path=/metrics",
        "prometheus-port=http"
      ]
    }

    task "server" {
      driver = "docker"

      config {
        image = "${IMAGE_TAG}" # This will be replaced by the CI pipeline
        ports = ["http"]
      }

      resources {
        cpu    = 250 # MHz
        memory = 256 # MB
      }
    }
  }
}

这里的关键是 service 块中的 tags。我们使用 prometheus-scrape=true 这种约定,让Consul中注册的服务能够被Prometheus自动发现。Prometheus会查询Consul,找到所有带有此标签的服务,并根据prometheus-pathprometheus-port标签去抓取指标。

第四步:将 SLO 定义为代码

这是实现闭环的核心。我们将SLO的定义,通过Prometheus的Recording Rules和Alerting Rules来实现,并把这些规则文件也纳入Git仓库管理。

monitoring/rules/slo-django-app.yml:

groups:
  - name: DjangoAppSLO
    rules:
      # Rule 1: Record the 5-minute request rate for our specific API endpoint
      - record: job_endpoint:myapp_api_requests:rate5m
        expr: >
          sum by (job, endpoint) (
            rate(myapp_api_requests_total{job="django-app/api"}[5m])
          )
        labels:
          team: backend

      # Rule 2: Record the 5-minute error rate (HTTP 5xx)
      - record: job_endpoint:myapp_api_errors:rate5m
        expr: >
          sum by (job, endpoint) (
            rate(myapp_api_requests_total{job="django-app/api", status_code=~"5.."}[5m])
          )
        labels:
          team: backend

      # Rule 3: Calculate the availability SLI over a 28-day window
      # This gives us a long-term view of our service health
      - record: slo:availability:ratio_rate28d
        expr: >
          (
            sum(rate(myapp_api_requests_total{job="django-app/api", status_code!~"5.."}[28d]))
            /
            sum(rate(myapp_api_requests_total{job="django-app/api"}[28d]))
          )
        labels:
          service: "django-api"

      # Rule 4: Calculate the p99 latency SLI over a 28-day window
      # SLO: 99% of requests should be faster than 500ms
      - record: slo:latency:ratio_rate28d
        expr: >
          sum(rate(myapp_api_request_latency_seconds_bucket{job="django-app/api", le="0.5"}[28d]))
          /
          sum(rate(myapp_api_request_latency_seconds_count{job="django-app/api"}[28d]))
        labels:
          service: "django-api"
          percentile: "p99"
          threshold: "0.5s"

  - name: DjangoAppSLOAlerting
    rules:
      # Alerting based on error budget burn rate.
      # This is far more effective than simple threshold-based alerting.
      #
      # SLO: 99.9% availability over 28 days
      # Error Budget: 0.1% = 0.001
      #
      # Alert condition: If we burn through 2% of our monthly error budget in just 1 hour,
      # it's a critical issue that needs immediate attention.
      # Calculation: 0.001 (total budget) * 0.02 (burn allowance) = 0.00002
      # Burn rate over 1 hour > 14 * burn rate over 28 days
      # (14 is a multiplier based on window sizes, a common SRE practice)
      - alert: HighErrorBudgetBurn
        expr: |
          (
            job_endpoint:myapp_api_errors:rate5m{endpoint="/api/v1/process_data"}
            /
            job_endpoint:myapp_api_requests:rate5m{endpoint="/api/v1/process_data"}
          )
          > (14 * (1 - 0.999))
        for: 5m # Fire only if the condition persists for 5 minutes
        labels:
          severity: critical
        annotations:
          summary: "High error budget burn for django-api"
          description: "The service {{ $labels.job }} is burning through its 28-day error budget too quickly. Current error rate is above the SLO-defined threshold."

这份YAML文件是“可观测性即代码”的体现。它定义了如何从原始指标计算出SLI,并基于错误预算消耗速率设置了智能告警。当开发人员修改了API行为,他们可以一并修改这个文件,确保监控逻辑与应用逻辑同步演进。

第五步:构建端到端的 GitOps 流水线

最后,我们使用GitLab CI将所有部分串联起来。.gitlab-ci.yml文件是整个自动化流程的指挥中心。

stages:
  - build
  - deploy

variables:
  # Using a sha-based tag for image immutability
  IMAGE_TAG: "${CI_REGISTRY_IMAGE}:${CI_COMMIT_SHORT_SHA}"

build_image:
  stage: build
  image: quay.io/buildah/stable:latest # Use an image with Buildah pre-installed
  script:
    - echo "Logging into container registry..."
    # CI_REGISTRY_USER and CI_REGISTRY_PASSWORD are GitLab predefined variables
    - buildah login -u "$CI_REGISTRY_USER" -p "$CI_REGISTRY_PASSWORD" $CI_REGISTRY
    - ./build.sh
  rules:
    - if: '$CI_COMMIT_BRANCH == "main"'

deploy_app_and_rules:
  stage: deploy
  image:
    name: hashicorp/nomad:1.4.3
    entrypoint: [""] # Override entrypoint to use shell
  script:
    # 1. Template and deploy the Nomad job
    - echo "Deploying Nomad job for image ${IMAGE_TAG}"
    # Use envsubst to replace ${IMAGE_TAG} in the template file
    - export IMAGE_TAG=${IMAGE_TAG}
    - envsubst < deploy/django-app.nomad.hcl.tmpl > /tmp/django-app.nomad.hcl
    # NOMAD_ADDR should be configured as a CI/CD variable in GitLab
    - nomad job run /tmp/django-app.nomad.hcl

    # 2. Deploy Prometheus rules
    # This part can be tricky. A robust solution would use an API or a GitOps tool for Prometheus.
    # For this example, we assume a simple mechanism like copying the file to a location
    # that Prometheus watches and then triggering a reload.
    - echo "Deploying Prometheus rules..."
    # We need a tool like 'curl' or 'ssh' here. Let's assume we have an internal tool `prom-rule-deployer`.
    # This is a placeholder for a real-world implementation.
    # A real implementation might use `kubectl apply -f configmap.yml` if Prometheus is on K8s,
    # or post to Prometheus's reload endpoint.
    - |
      apk add --no-cache curl
      curl -X POST --data-binary "@monitoring/rules/slo-django-app.yml" http://prometheus.ops.internal/-/reload
      if [ $? -ne 0 ]; then
        echo "Failed to reload Prometheus configuration. Please check Prometheus server."
        exit 1
      fi
    - echo "Deployment complete."
  needs:
    - build_image
  rules:
    - if: '$CI_COMMIT_BRANCH == "main"'

现在,整个闭环已经形成。

graph TD
    A[Developer: git push] --> B{GitLab CI Pipeline};
    B --> C[Build Stage];
    C -- build.sh --> D[Buildah builds Image];
    D --> E[Push to Registry];
    B --> F[Deploy Stage];
    E --> F;
    F -- nomad job run --> G[Nomad Cluster];
    G -- Pulls Image --> E;
    G -- Registers Service --> H[Consul];
    I[Prometheus] -- Discovers from --> H;
    I -- Scrapes Metrics --> G[Django App Instance];
    F -- POST /-/reload --> I;
    I -- Loads Rules from --> J[monitoring/rules/*.yml];
    subgraph Git Repository
        K[Django App Code]
        L[Dockerfile]
        M[nomad.hcl.tmpl]
        J
    end
    A --> K;

当开发者在main分支上合并一次代码变更时:

  1. GitLab CI流水线被触发。
  2. build_image作业使用Buildah构建一个带有唯一Commit SHA标签的Docker镜像,并推送到容器镜像仓库。
  3. deploy_app_and_rules作业启动:
    • 它首先使用envsubst将镜像标签注入到Nomad job模板中。
    • 然后调用nomad job run,Nomad会根据新的jobspec进行滚动更新,从仓库拉取新镜像并部署。
    • 紧接着,流水线将monitoring/rules/slo-django-app.yml文件发送到Prometheus并触发其配置重载。

这个流程确保了应用实例和其监控告警规则的发布是同步的。如果一次重构改变了某个metric的名称或标签,开发者可以在同一个MR中同时修改应用代码和Prometheus规则文件,从而避免了监控失效。

这种架构的局限性也显而易见。首先,Prometheus规则的热加载机制需要一个健壮的实现,直接调用/reload接口比较粗暴,在生产环境中可能会使用Prometheus Operator或者通过一个专门的配置分发服务来完成。其次,对于更复杂的场景,比如蓝绿部署或金丝雀发布,流水线需要集成更复杂的逻辑,能够基于部署后短时间内的SLI表现来决定是继续推广还是自动回滚。最后,数据库迁移(如Django migrations)的管理并未包含在此流程中,它通常需要一个独立于应用部署的、更谨慎的策略。未来的迭代方向将是引入像Argo Rollouts这样的工具,以实现基于Prometheus SLI的渐进式交付和自动回滚,从而形成一个更加智能和安全的发布闭环。


  目录