亚马逊AWS官方博客

使用AWS Distro for OpenTelemetry洞察现代化应用

一、AWS Distro for OpenTelemetry简介

1.1 可观测性介绍

随着微服务技术的普及,微服务(MicroServices) 的概念早已深入人心,越来越多的公司开始使⽤微服务架构来解耦应用,提高业务的迭代发布速度,从而快速交付最终用户的需求,实现业务快速创新。然而微服务架构不是“银弹”,如果微服务治理不当,反而有可能适得其反,不仅无法享受到微服务架构带来的优势,反而可能由于微服务架构的系统复杂性,造成开发、运维部署的复杂度增加,进而影响开发迭代的速度,甚至影响系统的整体稳定性。因此容器编排、服务网格、应用可观测等技术被越来越多的提及,用于解决微服务架构中碰到的各种挑战。

对于微服务架构,由于其远超传统软件的系统复杂性,系统运维的难度大大增加,且会随着分布式节点的增加而指数级增长。为了实现卓越运营和业务创新目标,客户需要为系统提供“可观测性”,让开发运维人员了解系统的内部运行情况。简单来说,可观测性就是为复杂IT系统寻求应用的白盒监控能力,通过“某种手段”让开发运维人员,便捷的观测应用在“各个时间点”的行为,获取对应用系统的洞察,不断改进支持流程和程序以实现业务价值,达到卓越运营的目标。在正常运行时,观测系统能对系统负载进行评估,对运维操作提供建议。在发生故障时,可协助快速定位和修复问题。

1.2 部署”可观测“系统时遇到的挑战

在为现代化应用引入“可观测性”时,“三大支柱”中logs系统提供事件细节、metrics系统负责统计和聚合、traces系统则专注请求延迟。然而“三大支柱”各司其职,往往是独立的系统,例如CNCF社区的开源方案体系中Prometheus负责指标、Jaeger负责跟踪、EFK负责日志。当出现问题时,往往只能通过人去寻找各种信息的关联性,再根据经验判断和优化,显然是不可行的,耗时耗力还无法找到问题根因。

需要重点指出的是,traces系统往往需要在调用的函数或服务的边界进行跟踪,需要一些代码的侵入,各traces方案有单独的sdk/api,互不兼容,更换tracss方案需要花费大量的成本进行代码改造,也直接影响了用户对traces技术采用。

1.3 OpenTelemetry和ADOT

那么如何解决这些问题呢?答案结构化和标准化。OpenTelemetry(简称:OTel)从OpenTracing和OpenCensus合并而来,结合w3c trace context,为可观测体系提供了数据采集和标准规范的统一,将traces、metrics、logs这些孤立的信息,作为一个单一、高度“结构化”的数据流连接起来。

OTel由以下几部分组成:

  • 跨语言的标准规范Specification:定义了数据标准、语义规范、OTLP协议、API、SDK等;
    • API:定义用于生成和关联traces、metrics和logs数据的数据类型和操作;
    • SDK:基于OTel API标准实现的各种语言的SDK,还定义了配置、数据处理和导出概念,方便用户使用这些SDK开发生成导出观测数据;
    • Data:定义观测系统后端可以提供支持的OpenTelemetry协议 (OTLP) 和语义约定Semantic Conventions
  • 接收、处理、输出观测数据的(Collector):用于接收OTel观测数据的工具,允许用户通过配置Pipeline定义观测数据如何接收、如何处理、如何导出到后端;
  • Automatic Instrumentation:一些开箱即用的观测数据采集器

AWS Distro for OpenTelemetry(简称:ADOT)是AWS支持的OpenTelemetry项目发行版,安全且可直接用于生产。使用ADOT可将应用生成的相关的指标和跟踪数据发送至多个AWS和合作伙伴监控解决方案。ADOT还可以从AWS资源和托管服务中收集元数据,以便您可以将应用程序性能数据与底层基础设施数据关联,从而减少解决问题的平均时间。

2022年4月21日,EKS发布了ADOT EKS Addon,方便EKS用户安装和管理ADOT Operator。简化了在EKS上运行的应用程序的观测体验,可将指标和跟踪发送到多个监控服务,包括 AWS X-Ray、Amazon Managed Service for Prometheus 和 Amazon CloudWatch或合作伙伴监控解决方案。

本文主要围绕EKS的ADOT Operator演示如何构建容器化应用的可观测。

二、安装ADOT Operator

Kubernetes Operator 是一种打包、部署和管理 Kubernetes 原生应用程序的方法,该应用程序既部署在Kubernetes上,又使用Kubernetes API和kubectl工具进行管理。Kubernetes Operator是一个自定义kubernetes控制器,ADOT Operator通过自定义资源定义(CRD)引入新的对象类型,来简化Collector管理。通过ADOT Operator,用户可以以声明API的方式来管理如何采集、处理、导出观测数据。

2.1 安装工具和集群

在安装ADOT Operator前,你需要一台能够连接到AWS的机器,安装配置好awscli、eksctl、kubectl、Helm工具。这里不展开说明,可参考:客户端工具安装

除了这些工具,你还需要一个EKS集群来运行ADOT Operator,这里可以使用eksctl快速拉起一个集群或使用现有集群,注意集群得启用IAM OIDC Provider

cat <<EOF | eksctl create cluster -f -
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: "otel"   # 集群名字
  region: "ap-northeast-2"    # 集群所在的region
  version: "1.23"    # eks版本

iam:
  withOIDC: true    # OIDC配置,很重要,AWS Loadbalancer Controller等addon都需要

managedNodeGroups:
  - name: Private-01
    instanceType: m6i.large    # worker节点使用的机器类型
    desiredCapacity: 3    # Autoscaling group的初始大小
    minSize: 3    # Autoscaling group的最小规模
    maxSize: 3    # Autoscaling group的最大规模
    volumeSize: 30    # 节点EBS大小
    volumeType: gp3    # 节点存储类型
EOF

2.2 配置IAM

Operator需要的kubernetes的API权限,通过以下命令配置kubnernetes RBAC:

kubectl apply -f https://amazon-eks.s3.amazonaws.com/docs/addons-otel-permissions.yaml

ADOT Addon依赖cert-manager,需要提前安装:

kubectl apply -f https://github.com/jetstack/cert-manager/releases/download/v1.9.1/cert-manager.yaml

定义环境变量,通过以下命令获取AWS账户ID,EKS集群名,OIDC Provider等信息,用于配置IRSA:

export ACCOUNT_ID=$(aws sts get-caller-identity --query "Account" --output text)
export CLUSTER_NAME=$(aws eks list-clusters --query "clusters[]" --output text)
export OIDC_PROVIDER=$(aws eks describe-cluster --name $CLUSTER_NAME --query "cluster.identity.oidc.issuer" --output text | sed -e "s/^https:\\/\\///")

配置IAM Role for ServiceAccounts(简称:IRSA),与给EKS的worker节点role角色不同,IRSA将IAM Role与kuberntes原生的ServiceAccount关联,使用该ServiceAccount运行的Pod,将获得IAM Role的权限,而不是节点,实现了更细的访问控制。

首先创建IAM Role的Trust relationships配置文件:

read -r -d '' TRUST_RELATIONSHIP <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::${ACCOUNT_ID}:oidc-provider/${OIDC_PROVIDER}"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "${OIDC_PROVIDER}:aud": "sts.amazonaws.com",
          "${OIDC_PROVIDER}:sub": "system:serviceaccount:aws-otel-eks:aws-otel-collector"
        }
      }
    }
  ]
}
EOF

echo "${TRUST_RELATIONSHIP}" > trust.json

创建ADOT Collector运行需要的IAM Role ADOTCollectorExecutionRole-${CLUSTER_NAME}

aws iam create-role --role-name AmazonADOTCollectorExecutionRole-${CLUSTER_NAME}  --assume-role-policy-document file://trust.json --description "ADOT Collector Execution Role for EKS Cluster ${CLUSTER_NAME}"

根据需要为ADOTCollectorExecutionRole-${CLUSTER_NAME}附加AmazonPrometheusRemoteWriteAccess、AWSXrayWriteOnlyAccess、CloudWatchAgentServerPolicy需要的托管policy,使这个role有权限访问这些服务.

aws iam attach-role-policy --role-name AmazonADOTCollectorExecutionRole-${CLUSTER_NAME} --policy-arn="arn:aws:iam::aws:policy/AmazonPrometheusRemoteWriteAccess"

aws iam attach-role-policy --role-name AmazonADOTCollectorExecutionRole-${CLUSTER_NAME}   --policy-arn="arn:aws:iam::aws:policy/AWSXrayWriteOnlyAccess"

aws iam attach-role-policy --role-name AmazonADOTCollectorExecutionRole-${CLUSTER_NAME}  --policy-arn="arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy"

创建IRSA对应的serviceAccount,并通过annotations映射到IAM Role:

创建Kubernetes的namespace

cat <<EOF | kubectl create -f -

apiVersion: v1
kind: Namespace
metadata:
  name: aws-otel-eks
---

apiVersion: v1
kind: ServiceAccount
metadata:
  name: aws-otel-collector
  namespace: aws-otel-eks
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::${ACCOUNT_ID}:role/AmazonADOTCollectorExecutionRole-${CLUSTER_NAME}
EOF

2.3 部署ADOT Addon

创建完IRSA后,可以使用以下的命令创建EKS的ADOT add-on:

aws eks create-addon --addon-name adot --cluster-name ${CLUSTER_NAME}

使用以下命令查看add-on的状态,确保处于ACTIVE状态

aws eks describe-addon --addon-name adot --cluster-name ${CLUSTER_NAME}
{
    "addon": {
        "addonName": "adot",
        "clusterName": "otel",
        "status": "ACTIVE",
        "addonVersion": "v0.58.0-eksbuild.1",
        "health": {
            "issues": []
        },
        "addonArn": "arn:aws:eks:ap-northeast-2:<AWS ACCOUNT ID>:addon/otel/adot/18c1bee4-3665-9fc2-1fe2-303bf9938657",
        "createdAt": "2022-09-27T07:10:30.740000+00:00",
        "modifiedAt": "2022-09-27T07:11:42.641000+00:00",
        "tags": {}
    }
}

至此,ADOT的K8S Operator安装完成,下面展示如何使用Operator来管理Collector。

参考资料:https://github.com/aws-observability/aws-otel-collector

三、使用Operator创建管理Collector

3.1 ADOT Collector的4个部署模式

ADOT Collector支持多种部署模式来匹配不同的场景:

  • Deployment: 以最简单的方式创建collector收集metrics和traces,需要注意的是,如果使用Prometheus的服务发现,由于每个collector都拥有完整的发现列表,将导致collector重复收集指标信息
  • Sidecar: 作为sidecar container与应用container一同运行在pod中,可以设置Pod annotation自动注入
  • DaemonSet: 每个Kubernetes节点部署一个collector作为代理运行
  • StatefulSet: 使用此模式可以保证运行实例固定命名,支持Prometheus服务发现target重新调度和分配

详细的区别可参考:https://aws-otel.github.io/docs/getting-started/operator

3.2 OTel Collector Pipeline介绍

OTel Collector支持丰富的pipeline来帮助客户收集、处理、导出观测数据,并且能在多种不同的观测系统之间进行数据的灵活转换导出。在配置之前,我们得了解OTel Collector Pipeline的一些概念:

  • receivers:定义以什么样的格式接收观测信号,如: Prometheus, X-Ray, statsD, Jaeger, Zipkin, OpenTelemetry
  • processors:定义如何处理观测数据, 例如标签转换、批量发送,甚至设置collector内存分配等
  • exporters:定义如何将观测数据发送到后端系统,如:发送到X-ray、Prometheus、文件、三方合作伙伴的APM等
  • extensions:用于扩展增强collector功能,如:使用sigv4auth来扩展prometheus的身份验证

ADOT Collector Pipeline支持列表,可参考:https://github.com/aws-observability/aws-otel-collector

3.3 使用Kubernetes Operator配置Collector

范例1:将otlp协议的traces通过collector转换后输出到aws x-ray

---
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: otel-trace
  namespace: aws-otel-eks
spec:
  mode: deployment
  serviceAccount: aws-otel-collector
  resources:
    limits: 
      cpu: 500m
      memory: 1024Mi
    requests: 
      cpu: 250m
      memory: 512Mi
  config: |
    receivers:
      otlp:
        protocols: 
          grpc:
          http:

    processors:
      batch:
        timeout: 30s
        send_batch_size: 8192

    exporters:
      awsxray:
        region: "ap-northeast-2"
  
    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [batch]
          exporters: [awsxray]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: otel-collector-role-binding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: otel-collector-role
subjects:
  - kind: ServiceAccount
    name: aws-otel-collector
    namespace: aws-otel-eks

范例2:使用collector替代Prometheus Server,将metrics信息同时输出到cloudwatch和prometheus

apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: otel-prom
  namespace: aws-otel-eks
spec:
  mode: deployment
  serviceAccount: aws-otel-collector
  config: |

    extensions:
      sigv4auth:
        service: "aps"
        region: "ap-northeast-2"

    receivers:
      prometheus:
        config:
          scrape_configs:
            - job_name: 'kubernetes-apiservers'
              sample_limit: 8192
              scheme: https

              kubernetes_sd_configs:
              - role: endpoints
              tls_config:
                ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
                insecure_skip_verify: true
              bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

              relabel_configs:
              - source_labels: [__meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
                action: keep
                regex: kubernetes;https

    processors:
      batch:
        timeout: 30s
        send_batch_size: 8192

    exporters:
      prometheusremotewrite:
        endpoint: "https://<aws-managed-prometheus-endpoint>/v1/api/remote_write"
        auth:
          authenticator: sigv4auth
      awsemf:
        region: "ap-northeast-2"
        log_group_name: "/metrics/my-adot-collector"
        log_stream_name: "adot-stream"

    service:
      extensions: [sigv4auth]
      pipelines:
        metrics:
          receivers: [prometheus]
          processors: []
          exporters: [prometheusremotewrite,awsemf]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: otel-collector-role
rules:
  - apiGroups:
      - ""
    resources:
      - nodes
      - nodes/proxy
      - services
      - endpoints
      - pods
    verbs:
      - get
      - list
      - watch
  - apiGroups:
      - extensions
    resources:
      - ingresses
    verbs:
      - get
      - list
      - watch
  - nonResourceURLs:
      - /metrics
    verbs:
      - get

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: otel-collector-role-binding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: otel-collector-role
subjects:
  - kind: ServiceAccount
    name: aws-otel-collector
    namespace: aws-otel-eks

范例3:使用Collector收集OTLP协议的metric和trace,输出到Prometheus的X-ray

---
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: otel-allinone
  namespace: aws-otel-eks
spec:
  mode: deployment
  serviceAccount: aws-otel-collector
  resources:
    limits: 
      cpu: 500m
      memory: 1024Mi
    requests: 
      cpu: 250m
      memory: 512Mi
  config: |

    extensions:
      sigv4auth:
        service: "aps"
        region: "ap-northeast-2"

    receivers:
      otlp:
        protocols: 
          grpc:
          http:

    processors:
      batch:
        timeout: 30s
        send_batch_size: 8192

    exporters:
      prometheusremotewrite:
        endpoint: "https://<aws-managed-prometheus-endpoint>/v1/api/remote_write"
        auth:
          authenticator: sigv4auth
      awsxray:
        region: "ap-northeast-2"
        
    service:
      extensions: [sigv4auth]
      pipelines:
        metrics:
          receivers: [otlp]
          processors: [batch]
          exporters: [prometheusremotewrite]
        traces:
          receivers: [otlp]
          processors: [batch]
          exporters: [awsxray]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: otel-collector-role
rules:
  - apiGroups:
      - ""
    resources:
      - nodes
      - nodes/proxy
      - services
      - endpoints
      - pods
    verbs:
      - get
      - list
      - watch
  - apiGroups:
      - extensions
    resources:
      - ingresses
    verbs:
      - get
      - list
      - watch
  - nonResourceURLs:
      - /metrics
    verbs:
      - get

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: otel-collector-role-binding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: otel-collector-role
subjects:
  - kind: ServiceAccount
    name: aws-otel-collector
    namespace: aws-otel-eks

3.4 部署ADOT Collector

为方便入门,我们选择范例1的的配置进行部署,将范例1的内容保存为collector-trace.yaml,使用以下命令创建资源:

kubectl create -f collector-trace.yaml

Operator将在EKS中创建deployment和service资源

kubectl get all -n aws-otel-eks

NAME                                        READY   STATUS    RESTARTS   AGE
pod/otel-trace-collector-5f45484b79-x582x   1/1     Running   0          72s

NAME                                      TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                       AGE
service/otel-trace-collector              ClusterIP   10.100.5.170    <none>        4317/TCP,4318/TCP,55681/TCP   73s
service/otel-trace-collector-headless     ClusterIP   None            <none>        4317/TCP,4318/TCP,55681/TCP   73s
service/otel-trace-collector-monitoring   ClusterIP   10.100.73.216   <none>        8888/TCP                      73s

NAME                                   READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/otel-trace-collector   1/1     1            1           73s

NAME                                              DESIRED   CURRENT   READY   AGE

至此,collector已经部署完成,下一步我们将通过一个简单的例子来介绍如何使用OTel的SDK来实现应用程序的可观测。

四、使用OTel SDK实现应用的可观测

在本演示样例中,使用golang的sdk来进行开发部署

  • 使用OTLP协议将观测数据发送到ADOT的Collector,Collector将OTLP做转换,输出到X-Ray,注意X-Ray的trace-id并不是和OTLP一样完全随机,包含了时间戳信息,阻止30天前的traces数据进入
  • 在代码中,将traces的信息与底层资源进行关联
  • 为方便观测,将采样率设置为AlwaysSample,可以根据需要调整采样策略和比例,OTel提供了丰富的采样策略,例如主流的固定采样比率、尾部采样等等来降低观测信号的网络传输和存储压力
  • 将应用暴露为http服务,监听在0.0.0.0:4000
package main

import (
	"context"
	"fmt"
	"log"
	"math/rand"
	"net/http"
	"os"
	"time"

	"go.opentelemetry.io/contrib/detectors/aws/ec2"
	"go.opentelemetry.io/contrib/detectors/aws/eks"
	"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
	"go.opentelemetry.io/contrib/propagators/aws/xray"
	"go.opentelemetry.io/otel"
	"go.opentelemetry.io/otel/attribute"
	"go.opentelemetry.io/otel/codes"
	"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
	"go.opentelemetry.io/otel/sdk/resource"
	sdktrace "go.opentelemetry.io/otel/sdk/trace"
	semconv "go.opentelemetry.io/otel/semconv/v1.4.0"
	"go.opentelemetry.io/otel/trace"
)

var (
	OTLP_ENDPOINT_GRPC = "0.0.0.0:4317" // 定义默认OTLP Exporter的Endpoint
	SERVICE_NAME       = "OTELDemo"
	REGION             = "ap-northeast-2"
	TR                 trace.Tracer
)

func main() {
	ctx := context.Background()        // 初始化context,用于传递可观测信号上下文传递
	traceStop := InitOTELProvider(ctx) // 初始化meter和tracer的provider,并在main函数关闭时停止trace
	defer func() {
		if err := traceStop(ctx); err != nil {
			log.Fatal(err)
		}
	}()
	// otelhttp.NewHandler拦截请求路由,为处理函数增加⾃动⽣成traces的能⼒
	http.Handle("/hello", otelhttp.NewHandler(http.HandlerFunc(helloHandler), "hello"))
	http.Handle("/err", otelhttp.NewHandler(http.HandlerFunc(errHandler), "err"))
	http.Handle("/notfound", otelhttp.NewHandler(http.HandlerFunc(notfoundHandler), "notfound"))
	log.Fatal(http.ListenAndServe(":4000", nil))
}

func helloHandler(w http.ResponseWriter, r *http.Request) {
	ctx := r.Context()
	n := rand.New(rand.NewSource(time.Now().UnixNano())).Intn(200)
	fib, _ := fibonacci(ctx, uint(n))
	w.Write([]byte(fmt.Sprintf("Number: %d Fib: %d\n", n, fib)))
}

func errHandler(w http.ResponseWriter, r *http.Request) {
	w.WriteHeader(http.StatusServiceUnavailable)
}

func notfoundHandler(w http.ResponseWriter, r *http.Request) {
	w.WriteHeader(http.StatusNotFound)
}

// Fibonacci returns the n-th fibonacci number.
func fibonacci(ctx context.Context, n uint) (uint64, error) {
	_, span := TR.Start(ctx, "Fibonacci")
	defer span.End()

	if n <= 1 {
		return uint64(n), nil
	}
	// 当传输的数字超过93,结果超过unit64的表示范围,将span标记为codes.Error,并在SetAttributes记录关键信息
	if n > 93 {
		span.SetStatus(codes.Error, fmt.Sprintf("unsupported fibonacci number %d: too large", n))
		span.SetAttributes(attribute.Int("num", int(n)))
		return 0, fmt.Errorf("unsupported fibonacci number %d: too large", n)
	}

	var n2, n1 uint64 = 0, 1
	for i := uint(2); i < n; i++ {
		n2, n1 = n1, n1+n2
	}

	span.SetAttributes(attribute.Int("num", int(n)))

	return n2 + n1, nil
}

// 将Traces与底层基础设置关联,如EKS Pod ID、EC2实例ID等
func NewResource(ctx context.Context) *resource.Resource {
	// 如果未设置RESOURCE_TYPE环境变量,则使用默认值
	resType := os.Getenv("RESOURCE_TYPE")
	switch resType {
	case "EC2":
		r, err := ec2.NewResourceDetector().Detect(ctx)
		if err != nil {
			log.Fatalf("%s: %v", "Failed to detect EC2 resource", err)
		}
		res, err := resource.Merge(r, resource.NewSchemaless(semconv.ServiceNameKey.String(SERVICE_NAME)))
		if err != nil {
			log.Fatalf("%s: %v", "Failed to merge resources", err)
		}
		return res

	case "EKS": // EKS Resource的实现依赖Container Insight
		r, err := eks.NewResourceDetector().Detect(ctx)
		if err != nil {
			log.Fatalf("%s: %v", "failed to detect EKS resource", err)
		}
		res, err := resource.Merge(r, resource.NewSchemaless(semconv.ServiceNameKey.String(SERVICE_NAME)))
		if err != nil {
			log.Fatalf("%s: %v", "Failed to merge resources", err)
		}
		return res

	default:
		res := resource.NewWithAttributes(
			semconv.SchemaURL,
			// ServiceName用于在后端中标识应用
			semconv.ServiceNameKey.String(SERVICE_NAME),
		)
		return res
	}
}

// InitOTELProvider 初始化TraceProvider,返回Tracer的关闭函数
func InitOTELProvider(ctx context.Context) (traceStop func(context.Context) error) {

	ep := os.Getenv("OTEL_EXPORTER_OTLP_ENDPOINT")
	if ep != "" {
		OTLP_ENDPOINT_GRPC = ep // 如果设置了环境变量,则使用环境变量的值来设置exporter的endpoint
	}

	res := NewResource(ctx)

	// 初始化TracerProvider,使用grpc与collector通讯
	traceExporter, err := otlptracegrpc.New(ctx, otlptracegrpc.WithInsecure(), otlptracegrpc.WithEndpoint(OTLP_ENDPOINT_GRPC))
	if err != nil {
		log.Fatalf("%s: %v", "failed to create trace exporter", err)
	}

	idg := xray.NewIDGenerator() // x-ray的trace id包含时间戳

	tp := sdktrace.NewTracerProvider(
		sdktrace.WithSampler(sdktrace.AlwaysSample()), // 设置采样率
		sdktrace.WithBatcher(traceExporter),
		sdktrace.WithIDGenerator(idg), //使用x-ray兼容的trace id
		sdktrace.WithResource(res),
	)
	otel.SetTracerProvider(tp)

	TR = tp.Tracer(SERVICE_NAME)

	return tp.Shutdown
}

这些样例代码可以从https://github.com/xufanglin/otel-exmaple直接下载

五、在EKS中部署应用样例

5.1 使用Docker来封装应用

Golang的好处是可以非常方便的交叉编译成执行文件,通过Multi-stage构建出干净的运行镜像。构建镜像前,我们需要先安装运行docker

sudo yum install -y docker
sudo systemctl enable docker && sudo systemctl start docker

创建Dockerfile,内容如下:

FROM golang:alpine AS build-env

RUN apk update && apk add ca-certificates

WORKDIR /usr/src/app

COPY . .

RUN CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -o otel-example -a -ldflags '-extldflags "-static"'

FROM scratch
COPY --from=build-env /usr/src/app/otel-example /otel-example
COPY --from=build-env /etc/ssl/certs/ca-certificates.crt /etc/ssl/certs/

CMD ["/otel-example"]

构建docker镜像

sudo docker build -t otel-example:latest .
sudo docker images
REPOSITORY     TAG       IMAGE ID       CREATED              SIZE
otel-example   latest    1ee8421c3437   About a minute ago   55.1MB
golang         alpine    5dd973625d31   3 weeks ago          352MB

5.2 使用ECR存储镜像

先使用以下命令创建ECR镜像库

export ACCOUNT_ID=$(aws sts get-caller-identity --query "Account" --output text)
export REGION="ap-northeast-2"

aws ecr get-login-password --region $REGION | sudo docker login --username AWS --password-stdin $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com

aws ecr create-repository --region $REGION --repository-name otel-example --image-scanning-configuration scanOnPush=true --image-tag-mutability MUTABLE

将镜像打标签推送至ECR

sudo docker tag otel-example:latest $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/otel-example:latest

sudo docker push $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/otel-example:latest

push完docker镜像,使用以下命令获取镜像的url,用于部署应用到EKS上:

export IMAGE=$(aws ecr describe-repositories --repository-name otel-example --query "repositories[].repositoryUri" --output text):latest

5.3 将应用部署到EKS

首先使用kubectl获取ADOT Collector的endpoint,从以下输出可以得知endpoint(svc+namespace+port)为”otel-trace-collector.aws-otel-eks:4317″

kubectl get all -n aws-otel-eks

NAME                                        READY   STATUS    RESTARTS   AGE
pod/otel-trace-collector-5f45484b79-x582x   1/1     Running   0          72s

NAME                                      TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                       AGE
service/otel-trace-collector              ClusterIP   10.100.5.170    <none>        4317/TCP,4318/TCP,55681/TCP   73s
service/otel-trace-collector-headless     ClusterIP   None            <none>        4317/TCP,4318/TCP,55681/TCP   73s
service/otel-trace-collector-monitoring   ClusterIP   10.100.73.216   <none>        8888/TCP                      73s

NAME                                   READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/otel-trace-collector   1/1     1            1           73s

NAME                                              DESIRED   CURRENT   READY   AGE

将应用部署到EKS上,”OTEL_EXPORTER_OTLP_ENDPOINT”设置为”otel-trace-collector.aws-otel-eks:4317″,配置”RESOURCE_TYPE”为”EC2″,在生成traces时将基础设置信息关联进来,方便排障。

cat <<EOF | kubectl create -f -

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: otel-example
  labels:
    app: otel-example
spec:
  replicas: 1
  selector:
    matchLabels:
      app: otel-example
  template:
    metadata:
      labels:
        app: otel-example
    spec:
      containers:
      - name: otel-example
        image: ${IMAGE}
        ports:
        - name: web
          containerPort: 4000
        env:
        - name: OTEL_EXPORTER_OTLP_ENDPOINT
          value: "otel-trace-collector.aws-otel-eks:4317"
        - name: RESOURCE_TYPE
          value: "EC2"

---
apiVersion: v1
kind: Service
metadata:
  name: otel-example
  labels:
    app: otel-example
  annotations:
    scrape: "true"
spec:
  ports:
  - name: web
    port: 4000
    targetPort: 4000
    protocol: TCP
  selector:
    app: otel-example
EOF

检查样例应用

kubectl get pod
NAME                           READY   STATUS    RESTARTS   AGE
otel-example-955f66cd5-59xxs   1/1     Running   0          19s

六、测试应用的traces生成和展示

6.1 生成traces信息

使用curl工具来测试访问应用,生成traces信息

kubectl run curl --image=curlimages/curl:latest  -- sleep 1d
kubectl exec -it curl -- sh

/ $ curl otel-example:4000/err
/ $ curl otel-example:4000/notfound
/ $ curl otel-example:4000/notfound
/ $ curl otel-example:4000/err
/ $ curl otel-example:4000/hello
Number: 86 Fib: 420196140727489673
/ $ curl otel-example:4000/hello
Number: 119 Fib: 0
/ $ curl otel-example:4000/hello
Number: 63 Fib: 6557470319842
/ $ curl otel-example:4000/hello
Number: 85 Fib: 259695496911122585
/ $ curl otel-example:4000/hello
Number: 144 Fib: 0

6.2 在X-Ray上查看Trace信息

在AWS Console上打开X-Ray的Service map,可以看到客户端与应用之间的访问状态的统计

点开X-Ray的Traces菜单,可以看到应用上报的traces信息,显示了http.mothodhttp.status_code、响应时间等信息等

选择一条正常的trace信息,可以看到单次服务的调用关系,整个trace持续的时间,每个span花费的时间等

点开Fibonacci,可以看到该函数调用的一些信息,如parent的span id

在resources标签,可以看到EC2的信息已经被关联上来,如Account ID、实例的AZ信息、实例类型等

点开metadata,可以看到我们在程序中设置的span attributes信息

对于返回404 notfound信息的访问,我们可以看到response为404

点开OTELDemo,可以看到http request和response等信息

对于返回503服务不可用的访问,可以看到trace map显示异常红色

同时点开OTELDemo,可以看到http request和response等信息,span的状态Fault为true

总结

OTel为可观测提供了标准化的API/SDK以及语义规范,使用Context将metrics、traces、logs将“三大支柱”关联,降低了开发者可观测技术栈采用成本和更换的技术成本,并提供了灵活可配置的Collector满足用户灵活多变的需求。目前OTel还有很多未完全解决的问题,如Golang的metrics的sdk还处于alpha,logs处于Frozen阶段。需要依赖prometheus的sdk的exemplar特性将traces和metrics关联。但社区活跃度较高,有各云服务商以及APM厂商的支持,发展迅速,在CNCF的2022年的可观测的报告“Cloud Native Observability: hurdles remain to understanding the health of systems”中,以49%的采用率仅次于prometheus。

ADOT作为AWS的OTel发行版,由AWS提供技术支持,帮助客户更方便将应用OTel技术栈,与EKS、ECS、Lambda、EC2、Amazon Managed Service for Prometheus、Amazon Managed Grafana、X-Ray、Cloudwatch或三方合作伙伴APM产品集成。

本篇作者

林旭芳

AWS 解决方案架构师,主要负责 AWS 云技术和解决方案的推广工作,在 Container、容灾等方向有丰富实践经验。

于昺蛟

亚马逊云科技解决方案架构师,负责互联网客户云计算方案的架构咨询和设计。在容器平台的建设和运维,应用现代化,DevOps等领域有多年经验,致力于容器技术和现代化应用的推广。