Analysis

In this module we will:

Review and understand the concept of an Analysis
Deploy a BlueGreen Rollout with an Analysis
Perform a successful promotion of an image tested by an Analysis
Perform and observe a promotion that fails an Analysis

Analysis Overview

When performing upgrades of services there is a need to test the new version that is being deployed to ensure functionality is not being negatively impacted. The Analysis feature enables Rollouts to collect data and metrics from a variety of providers to validate the new version of the application.

In addition to collecting data, an Analysis can include a Job to drive more advanced use cases. For example, a Job could be used to run load against an application in order to generate the metrics needed to validate the revision.

To deploy an Analysis, first an AnalysisTemplate is created that defines the analysis that is required and then is associated with one or more rollouts. The way an analysis is associated with a rollout is dependent on the rollout strategy used. In this module we will look at it from the perspective of the Blue-Green strategy we have deployed previously. In the next module we will delve into using an Analysis with the Canary strategy as well.

In the blue-green strategy, an Analysis can be added as either pre-promotion or post-promotion. Pre-promotion is used before the new version is deployed and is useful for validating the deployment prior to cutting over to it for live traffic. Post-promotion is executed after the cut-over and can validate that the deployment is working with live traffic.

Analysis Deployment

In this section we will deploy the same Blue-Green rollout we did previously but with a pre-promotion analysis included along with the corresponding analysis template. Prior to starting, confirm you are still at the correct path.

cd ~/argo-rollouts-workshop/content/modules/ROOT/examples/

Next, let’s explore the manifests that we will be deploying in the ./bluegreen-analysis/base folder:

ls ./bluegreen-analysis/base

Here you will see the same files that we used previously, however there is a new file analysistemplate.yaml. Examining the file we see it appears as follows:

cat ./bluegreen-analysis/base/analysistemplate.yaml

kind: AnalysisTemplate
apiVersion: argoproj.io/v1alpha1
metadata:
  name: smoke-tests
spec:
  args:
  - name: namespace
  - name: route-name
  - name: route-url
  - name: api-token
    valueFrom:
      secretKeyRef:
        name: monitor-auth-secret
        key: token
  metrics:
  - name: success-rate
    interval: 30s
    count: 4
    successCondition: result[0] == 0
    failureLimit: 0
    provider:
      prometheus:
        address: https://thanos-querier.openshift-monitoring.svc.cluster.local:9091
        timeout: 40
        insecure: true
        headers:
          - key: Authorization
            value: "Bearer {{args.api-token}}"
        query: |
          irate(haproxy_backend_http_responses_total{exported_namespace='{{args.namespace}}', route=~"{{args.route-name}}", code="5xx"}[1m]) > 5 or on() vector(0)
  - name: run-load
    provider:
      job:
        spec:
          backoffLimit: 1
          template:
            spec:
              containers:
              - name: siege
                image: quay.io/gnunn/tools:latest
                command: [sh, -c]
                args: ["siege -c 20 -r 15 -d10 -v http://{{args.route-url}}"]
              restartPolicy: Never

This AnalysisTemplate is broken up into two broad sections as follows:

.spec.args. This is where you can specify arguments which are passed by the rollout when using the template. These arguments enable the template to be reused across many different rollouts.
.spec.metrics. These are the metric providers that will be used to collect data for the analysis as well as any jobs that need to be executed.

In the metrics provider section we can see that we have two providers defined. The first one, named success-rate, uses the prometheus metric provider to pull metrics from Thanos, an aggregator for Prometheus data. The second one, run-load is a job that runs Apache Siege to drive some load on the application.

The count and interval fields in the success-rate metric powered by the prometheus provider indicate that the metric will be checked four times with a thirty second interval between each check. The failureLimit determines how many failures are permitted for the rollout to be considered a success. Here we set a failure limit of 0.

Finally, remember we are running this Analysis in the pre-promotion phase of the blue-green strategy so the application will not be receiving load from users, therefore generating load with Apache Siege will generate the metrics we need to determine whether to proceed with the promotion.

In the arguments we are taking three arguments, route-name, route-url and namespace, which will be passed from the Rollout.

The route-name and namespace arguments are used in the prometheus metrics provider in the query field to specify the specific route to fetch the status of the HTTP return codes. This query returns the increased rate of HTTP error codes (5xx) over a 1 minute interval.

The route-url parameter is being used in the second provider, the job, where we will be creating load against the route. This parameter is used to specify the OpenShift route URL where we want to drive the load, i.e the URL that siege will be hitting when it generates load.

Finally a third parameter, api-token, is provided by a secret. This secret provides the token needed to access the OpenShift monitoring stack. It was created by the GitOps process which provisioned this workshop.

Next let’s look at the rollout and see how the AnalysisTemplate is wired into the rollout.

cat ./bluegreen-analysis/base/rollout.yaml

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: rollouts-demo
spec:
  replicas: 2
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: rollouts-demo
  strategy:
    blueGreen:
      activeService: active
      previewService: preview
      autoPromotionEnabled: true
      prePromotionAnalysis:
        args:
        - name: namespace
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: route-name
          value: preview
        - name: route-url
          value: preview-%USER%-prod.%SUB_DOMAIN%
        templates:
        - templateName: smoke-tests
  template:
    metadata:
      labels:
        app: rollouts-demo
    spec:
      containers:
      - image: quay.io/openshiftdemos/rollouts-demo:blue
        imagePullPolicy: IfNotPresent
        name: rollouts-demo
        ports:
        - containerPort: 8080

Notice that in .spec.strategy.blueGreen we have now defined the prePromotionAnalysis field. In this field we define the analysis template we want to use as well as the arguments that the rollout needs to provide the template.

To deploy this new version of the blue-green rollout, execute the following command:

kustomize build ./bluegreen-analysis/base | sed "s/%SUB_DOMAIN%/OPENSHIFT_CLUSTER_INGRESS_DOMAIN_PLACEHOLDER/" | sed "s/%USER%/USER_PLACEHOLDER/" | oc apply -n USER_PLACEHOLDER-prod -f -

Check that the rollout has deployed successfully in the Argo Rollouts Dashboard, The dashboard for our blue-green rollout will appear identically as follows:

argo rollout dashboard bg analysis initial

Confirm that the application is displaying blue squares by opening the active route in your browser:

Promotion with Analysis

In this section we will promote new images using the Analysis to test the new version of the application. In the first part we will perform a promotion where the analysis succeeds and the promotion occurs. In the second part we will have the application trigger errors causing the analysis to fail and the promotion to be aborted.

Note that auto-promotion is enabled.

Analysis Passes

With the updated blue-green rollout deployed, let’s run through a promotion to a new image where the analysis succeeds and observe the behavior. In the OpenShift console, go to the Pipelines and do a promotion to a green image. As a reminder this is located here in the console:

Wait for the pipelines to complete and then go to the Argo Rollouts dashboard where you may see the following if the Analysis is still running where the Analysis button is shown in grey (highlighted in the image with a red outline):

argo rollouts analysis in progress overview

When a promotion of a rollout with an analysis is executed the controller will generate an analysis run for each template execution. Since we only have one template being used in pre-promotion we will only see one analysis run being created. It is however possible for a promotion to have multiple analysis runs, dependent on the rollout configuration. i.e. in blue-green defining both pre and post promotion analysis.

Clicking on the analysis button will expand the view to show the in-progress analysis that is running:

argo rollouts analysis in progress details

Note there are two green boxes with graph icons that are being shown in separate rows. Each row represents a provider and if you recall we had two providers, the job running Siege and the metric provider. Hovering the mouse over the buttons will display additional information:

argo rollouts analysis in progress hover

Once the analysis is complete the Analysis button will go green to show that it successfully completed:

Notice that one row has four graph icons, this represents the four measurements that were taken during the execution of the AnalysisRun. Hovering the mouse over any of these graph icons will show you the measurement result.

You may need to refresh the browser to see new metrics being added to the AnalysisRun depending on connectivity during the workshop.

You can also use the Argo Rollouts CLI to examine the status of the Rollout by executing the get command that we used earlier:

oc argo rollouts get rollout rollouts-demo -n USER_PLACEHOLDER-prod

This will display output similar to the following assuming the analysis run has completed executing:

Name:            rollouts-demo
Namespace:       USER_PLACEHOLDER-prod
Status:          ✔ Healthy
Strategy:        BlueGreen
Images:          quay.io/openshiftdemos/rollouts-demo:green (stable, active)
Replicas:
  Desired:       2
  Current:       2
  Updated:       2
  Ready:         2
  Available:     2

NAME                                                        KIND         STATUS         AGE    INFO
⟳ rollouts-demo                                             Rollout      ✔ Healthy      6m58s
├──# revision:2
│  ├──⧉ rollouts-demo-5999df6cf9                            ReplicaSet   ✔ Healthy      2m23s  stable,active
│  │  ├──□ rollouts-demo-5999df6cf9-cw5bm                   Pod          ✔ Running      2m23s  ready:1/1
│  │  └──□ rollouts-demo-5999df6cf9-rfgzr                   Pod          ✔ Running      2m23s  ready:1/1
│  └──α rollouts-demo-5999df6cf9-2-pre                      AnalysisRun  ✔ Successful   2m20s  ✔ 5
│     └──⊞ 17525abd-96e7-4abe-bac3-e9f62249a353.run-load.1  Job          ✔ Successful   2m20s
└──# revision:1
   └──⧉ rollouts-demo-66d84bcd76                            ReplicaSet   • ScaledDown   6m58s
      ├──□ rollouts-demo-66d84bcd76-95xh8                   Pod          ◌ Terminating  6m58s  ready:1/1
      └──□ rollouts-demo-66d84bcd76-qq8lt                   Pod          ◌ Terminating  6m58s  ready:1/1

Note for the new revision the results of the AnalysisRun is shown with the INFO column showing a checkmark with a 5. The five shows the aggregated total of four measurement checks plus the job executing successfully for a total of five.

Analysis Fails

Now that we have experienced a successful promotion, let’s examine the behavior when the application has errors and the analysis fails. To start, open the preview version of the application.

Slide the error bar as shown below all the way to the right. Notice how all of the flashing squares have red outlines around them indicating they are returning error messages.

This section should not be overly timing sensitive since we are cheating a bit by pre-generating errors. However when we promote the new image the error setting will be reset back to 0 for the preview service. After the promotion pipeline completes, immediately switch back to the tab with the preview service and push the error bar back to 100%.

Now we will perform a promotion to a yellow image, go back to the OpenShift Pipeline and start it with the yellow image:

As mentioned in the note above, once the pipeline is finished you must immediately switch back to the preview service and set the error rate to 100%.

If you did everything correctly, you should see that the Analysis failed as indicated by the red button:

If the timing did not work out for you, you can try again by promoting with a different color. You can see the list of available colors here where each color is simply a tag in our quay.io registry.

If you expand the Analysis section, you will see at least one metric test has failed as pictured above. You can hover the mouse over the failed analysis to see the details.

Examining the rollout with the CLI using the previous get command:

oc argo rollouts get rollout rollouts-demo -n USER_PLACEHOLDER-prod

Will now show the following:

Name:            rollouts-demo
Namespace:       USER_PLACEHOLDER-prod
Status:          ✖ Degraded
Message:         RolloutAborted: Rollout aborted update to revision 3: Metric "success-rate" assessed Failed due to failed (1) > failureLimit (0)
Strategy:        BlueGreen
Images:          quay.io/openshiftdemos/rollouts-demo:green (stable, active)
Replicas:
  Desired:       2
  Current:       2
  Updated:       0
  Ready:         2
  Available:     2

NAME                                                        KIND         STATUS        AGE    INFO
⟳ rollouts-demo                                             Rollout      ✖ Degraded    23m
├──# revision:3
│  ├──⧉ rollouts-demo-6b8dccb648                            ReplicaSet   • ScaledDown  8m46s  preview,delay:passed
│  └──α rollouts-demo-6b8dccb648-3-pre                      AnalysisRun  ✖ Failed      8m43s  ✔ 1,✖ 1
│     └──⊞ d484ff69-4ef1-4cf1-9a10-1b78530dadc7.run-load.1  Job          ✔ Successful  8m43s
├──# revision:2
│  ├──⧉ rollouts-demo-5999df6cf9                            ReplicaSet   ✔ Healthy     18m    stable,active
│  │  ├──□ rollouts-demo-5999df6cf9-cw5bm                   Pod          ✔ Running     18m    ready:1/1
│  │  └──□ rollouts-demo-5999df6cf9-rfgzr                   Pod          ✔ Running     18m    ready:1/1
│  └──α rollouts-demo-5999df6cf9-2-pre                      AnalysisRun  ✔ Successful  18m    ✔ 5
│     └──⊞ 17525abd-96e7-4abe-bac3-e9f62249a353.run-load.1  Job          ✔ Successful  18m
└──# revision:1
   └──⧉ rollouts-demo-66d84bcd76                            ReplicaSet   • ScaledDown  23m

Clean-up

Prior to moving on to the next module we need to perform some clean-up activities. First let’s reset the Development environment back to blue:

oc apply -k ./deploy/base -n USER_PLACEHOLDER-dev

Next we will delete the Rollout in the USER_PLACEHOLDER-prod so we can start fresh as we explore the canary strategy next.

kustomize build ./bluegreen-analysis/base | sed "s/%SUB_DOMAIN%/OPENSHIFT_CLUSTER_INGRESS_DOMAIN_PLACEHOLDER/" | sed "s/%USER%/USER_PLACEHOLDER/" | oc delete -n USER_PLACEHOLDER-prod -f -

More Information

Analysis