亚马逊AWS官方博客
使用AWS Distro for OpenTelemetry洞察现代化应用
一、AWS Distro for OpenTelemetry简介
1.1 可观测性介绍
随着微服务技术的普及,微服务(MicroServices) 的概念早已深入人心,越来越多的公司开始使⽤微服务架构来解耦应用,提高业务的迭代发布速度,从而快速交付最终用户的需求,实现业务快速创新。然而微服务架构不是“银弹”,如果微服务治理不当,反而有可能适得其反,不仅无法享受到微服务架构带来的优势,反而可能由于微服务架构的系统复杂性,造成开发、运维部署的复杂度增加,进而影响开发迭代的速度,甚至影响系统的整体稳定性。因此容器编排、服务网格、应用可观测等技术被越来越多的提及,用于解决微服务架构中碰到的各种挑战。
对于微服务架构,由于其远超传统软件的系统复杂性,系统运维的难度大大增加,且会随着分布式节点的增加而指数级增长。为了实现卓越运营和业务创新目标,客户需要为系统提供“可观测性”,让开发运维人员了解系统的内部运行情况。简单来说,可观测性就是为复杂IT系统寻求应用的白盒监控能力,通过“某种手段”让开发运维人员,便捷的观测应用在“各个时间点”的行为,获取对应用系统的洞察,不断改进支持流程和程序以实现业务价值,达到卓越运营的目标。在正常运行时,观测系统能对系统负载进行评估,对运维操作提供建议。在发生故障时,可协助快速定位和修复问题。
1.2 部署”可观测“系统时遇到的挑战
在为现代化应用引入“可观测性”时,“三大支柱”中logs系统提供事件细节、metrics系统负责统计和聚合、traces系统则专注请求延迟。然而“三大支柱”各司其职,往往是独立的系统,例如CNCF社区的开源方案体系中Prometheus负责指标、Jaeger负责跟踪、EFK负责日志。当出现问题时,往往只能通过人去寻找各种信息的关联性,再根据经验判断和优化,显然是不可行的,耗时耗力还无法找到问题根因。
需要重点指出的是,traces系统往往需要在调用的函数或服务的边界进行跟踪,需要一些代码的侵入,各traces方案有单独的sdk/api,互不兼容,更换tracss方案需要花费大量的成本进行代码改造,也直接影响了用户对traces技术采用。
1.3 OpenTelemetry和ADOT
那么如何解决这些问题呢?答案结构化和标准化。OpenTelemetry(简称:OTel)从OpenTracing和OpenCensus合并而来,结合w3c trace context,为可观测体系提供了数据采集和标准规范的统一,将traces、metrics、logs这些孤立的信息,作为一个单一、高度“结构化”的数据流连接起来。
OTel由以下几部分组成:
- 跨语言的标准规范Specification:定义了数据标准、语义规范、OTLP协议、API、SDK等;
- API:定义用于生成和关联traces、metrics和logs数据的数据类型和操作;
- SDK:基于OTel API标准实现的各种语言的SDK,还定义了配置、数据处理和导出概念,方便用户使用这些SDK开发生成导出观测数据;
- Data:定义观测系统后端可以提供支持的OpenTelemetry协议 (OTLP) 和语义约定Semantic Conventions。
- 接收、处理、输出观测数据的(Collector):用于接收OTel观测数据的工具,允许用户通过配置Pipeline定义观测数据如何接收、如何处理、如何导出到后端;
- Automatic Instrumentation:一些开箱即用的观测数据采集器
AWS Distro for OpenTelemetry(简称:ADOT)是AWS支持的OpenTelemetry项目发行版,安全且可直接用于生产。使用ADOT可将应用生成的相关的指标和跟踪数据发送至多个AWS和合作伙伴监控解决方案。ADOT还可以从AWS资源和托管服务中收集元数据,以便您可以将应用程序性能数据与底层基础设施数据关联,从而减少解决问题的平均时间。
2022年4月21日,EKS发布了ADOT EKS Addon,方便EKS用户安装和管理ADOT Operator。简化了在EKS上运行的应用程序的观测体验,可将指标和跟踪发送到多个监控服务,包括 AWS X-Ray、Amazon Managed Service for Prometheus 和 Amazon CloudWatch或合作伙伴监控解决方案。
本文主要围绕EKS的ADOT Operator演示如何构建容器化应用的可观测。
二、安装ADOT Operator
Kubernetes Operator 是一种打包、部署和管理 Kubernetes 原生应用程序的方法,该应用程序既部署在Kubernetes上,又使用Kubernetes API和kubectl工具进行管理。Kubernetes Operator是一个自定义kubernetes控制器,ADOT Operator通过自定义资源定义(CRD)引入新的对象类型,来简化Collector管理。通过ADOT Operator,用户可以以声明API的方式来管理如何采集、处理、导出观测数据。
2.1 安装工具和集群
在安装ADOT Operator前,你需要一台能够连接到AWS的机器,安装配置好awscli、eksctl、kubectl、Helm工具。这里不展开说明,可参考:客户端工具安装
除了这些工具,你还需要一个EKS集群来运行ADOT Operator,这里可以使用eksctl快速拉起一个集群或使用现有集群,注意集群得启用IAM OIDC Provider
cat <<EOF | eksctl create cluster -f -
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: "otel" # 集群名字
region: "ap-northeast-2" # 集群所在的region
version: "1.23" # eks版本
iam:
withOIDC: true # OIDC配置,很重要,AWS Loadbalancer Controller等addon都需要
managedNodeGroups:
- name: Private-01
instanceType: m6i.large # worker节点使用的机器类型
desiredCapacity: 3 # Autoscaling group的初始大小
minSize: 3 # Autoscaling group的最小规模
maxSize: 3 # Autoscaling group的最大规模
volumeSize: 30 # 节点EBS大小
volumeType: gp3 # 节点存储类型
EOF
2.2 配置IAM
Operator需要的kubernetes的API权限,通过以下命令配置kubnernetes RBAC:
kubectl apply -f https://amazon-eks.s3.amazonaws.com/docs/addons-otel-permissions.yaml
ADOT Addon依赖cert-manager,需要提前安装:
kubectl apply -f https://github.com/jetstack/cert-manager/releases/download/v1.9.1/cert-manager.yaml
定义环境变量,通过以下命令获取AWS账户ID,EKS集群名,OIDC Provider等信息,用于配置IRSA:
export ACCOUNT_ID=$(aws sts get-caller-identity --query "Account" --output text)
export CLUSTER_NAME=$(aws eks list-clusters --query "clusters[]" --output text)
export OIDC_PROVIDER=$(aws eks describe-cluster --name $CLUSTER_NAME --query "cluster.identity.oidc.issuer" --output text | sed -e "s/^https:\\/\\///")
配置IAM Role for ServiceAccounts(简称:IRSA),与给EKS的worker节点role角色不同,IRSA将IAM Role与kuberntes原生的ServiceAccount关联,使用该ServiceAccount运行的Pod,将获得IAM Role的权限,而不是节点,实现了更细的访问控制。
首先创建IAM Role的Trust relationships配置文件:
read -r -d '' TRUST_RELATIONSHIP <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::${ACCOUNT_ID}:oidc-provider/${OIDC_PROVIDER}"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"${OIDC_PROVIDER}:aud": "sts.amazonaws.com",
"${OIDC_PROVIDER}:sub": "system:serviceaccount:aws-otel-eks:aws-otel-collector"
}
}
}
]
}
EOF
echo "${TRUST_RELATIONSHIP}" > trust.json
创建ADOT Collector运行需要的IAM Role ADOTCollectorExecutionRole-${CLUSTER_NAME}
aws iam create-role --role-name AmazonADOTCollectorExecutionRole-${CLUSTER_NAME} --assume-role-policy-document file://trust.json --description "ADOT Collector Execution Role for EKS Cluster ${CLUSTER_NAME}"
根据需要为ADOTCollectorExecutionRole-${CLUSTER_NAME}附加AmazonPrometheusRemoteWriteAccess、AWSXrayWriteOnlyAccess、CloudWatchAgentServerPolicy需要的托管policy,使这个role有权限访问这些服务.
aws iam attach-role-policy --role-name AmazonADOTCollectorExecutionRole-${CLUSTER_NAME} --policy-arn="arn:aws:iam::aws:policy/AmazonPrometheusRemoteWriteAccess"
aws iam attach-role-policy --role-name AmazonADOTCollectorExecutionRole-${CLUSTER_NAME} --policy-arn="arn:aws:iam::aws:policy/AWSXrayWriteOnlyAccess"
aws iam attach-role-policy --role-name AmazonADOTCollectorExecutionRole-${CLUSTER_NAME} --policy-arn="arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy"
创建IRSA对应的serviceAccount,并通过annotations映射到IAM Role:
创建Kubernetes的namespace
cat <<EOF | kubectl create -f -
apiVersion: v1
kind: Namespace
metadata:
name: aws-otel-eks
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: aws-otel-collector
namespace: aws-otel-eks
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::${ACCOUNT_ID}:role/AmazonADOTCollectorExecutionRole-${CLUSTER_NAME}
EOF
2.3 部署ADOT Addon
创建完IRSA后,可以使用以下的命令创建EKS的ADOT add-on:
aws eks create-addon --addon-name adot --cluster-name ${CLUSTER_NAME}
使用以下命令查看add-on的状态,确保处于ACTIVE
状态
aws eks describe-addon --addon-name adot --cluster-name ${CLUSTER_NAME}
{
"addon": {
"addonName": "adot",
"clusterName": "otel",
"status": "ACTIVE",
"addonVersion": "v0.58.0-eksbuild.1",
"health": {
"issues": []
},
"addonArn": "arn:aws:eks:ap-northeast-2:<AWS ACCOUNT ID>:addon/otel/adot/18c1bee4-3665-9fc2-1fe2-303bf9938657",
"createdAt": "2022-09-27T07:10:30.740000+00:00",
"modifiedAt": "2022-09-27T07:11:42.641000+00:00",
"tags": {}
}
}
至此,ADOT的K8S Operator安装完成,下面展示如何使用Operator来管理Collector。
参考资料:https://github.com/aws-observability/aws-otel-collector
三、使用Operator创建管理Collector
3.1 ADOT Collector的4个部署模式
ADOT Collector支持多种部署模式来匹配不同的场景:
- Deployment: 以最简单的方式创建collector收集metrics和traces,需要注意的是,如果使用Prometheus的服务发现,由于每个collector都拥有完整的发现列表,将导致collector重复收集指标信息
- Sidecar: 作为sidecar container与应用container一同运行在pod中,可以设置Pod annotation自动注入
- DaemonSet: 每个Kubernetes节点部署一个collector作为代理运行
- StatefulSet: 使用此模式可以保证运行实例固定命名,支持Prometheus服务发现target重新调度和分配
详细的区别可参考:https://aws-otel.github.io/docs/getting-started/operator
3.2 OTel Collector Pipeline介绍
OTel Collector支持丰富的pipeline来帮助客户收集、处理、导出观测数据,并且能在多种不同的观测系统之间进行数据的灵活转换导出。在配置之前,我们得了解OTel Collector Pipeline的一些概念:
- receivers:定义以什么样的格式接收观测信号,如: Prometheus, X-Ray, statsD, Jaeger, Zipkin, OpenTelemetry
- processors:定义如何处理观测数据, 例如标签转换、批量发送,甚至设置collector内存分配等
- exporters:定义如何将观测数据发送到后端系统,如:发送到X-ray、Prometheus、文件、三方合作伙伴的APM等
- extensions:用于扩展增强collector功能,如:使用sigv4auth来扩展prometheus的身份验证
ADOT Collector Pipeline支持列表,可参考:https://github.com/aws-observability/aws-otel-collector
3.3 使用Kubernetes Operator配置Collector
范例1:将otlp协议的traces通过collector转换后输出到aws x-ray
---
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
name: otel-trace
namespace: aws-otel-eks
spec:
mode: deployment
serviceAccount: aws-otel-collector
resources:
limits:
cpu: 500m
memory: 1024Mi
requests:
cpu: 250m
memory: 512Mi
config: |
receivers:
otlp:
protocols:
grpc:
http:
processors:
batch:
timeout: 30s
send_batch_size: 8192
exporters:
awsxray:
region: "ap-northeast-2"
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [awsxray]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: otel-collector-role-binding
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: otel-collector-role
subjects:
- kind: ServiceAccount
name: aws-otel-collector
namespace: aws-otel-eks
范例2:使用collector替代Prometheus Server,将metrics信息同时输出到cloudwatch和prometheus
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
name: otel-prom
namespace: aws-otel-eks
spec:
mode: deployment
serviceAccount: aws-otel-collector
config: |
extensions:
sigv4auth:
service: "aps"
region: "ap-northeast-2"
receivers:
prometheus:
config:
scrape_configs:
- job_name: 'kubernetes-apiservers'
sample_limit: 8192
scheme: https
kubernetes_sd_configs:
- role: endpoints
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: kubernetes;https
processors:
batch:
timeout: 30s
send_batch_size: 8192
exporters:
prometheusremotewrite:
endpoint: "https://<aws-managed-prometheus-endpoint>/v1/api/remote_write"
auth:
authenticator: sigv4auth
awsemf:
region: "ap-northeast-2"
log_group_name: "/metrics/my-adot-collector"
log_stream_name: "adot-stream"
service:
extensions: [sigv4auth]
pipelines:
metrics:
receivers: [prometheus]
processors: []
exporters: [prometheusremotewrite,awsemf]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: otel-collector-role
rules:
- apiGroups:
- ""
resources:
- nodes
- nodes/proxy
- services
- endpoints
- pods
verbs:
- get
- list
- watch
- apiGroups:
- extensions
resources:
- ingresses
verbs:
- get
- list
- watch
- nonResourceURLs:
- /metrics
verbs:
- get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: otel-collector-role-binding
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: otel-collector-role
subjects:
- kind: ServiceAccount
name: aws-otel-collector
namespace: aws-otel-eks
范例3:使用Collector收集OTLP协议的metric和trace,输出到Prometheus的X-ray
---
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
name: otel-allinone
namespace: aws-otel-eks
spec:
mode: deployment
serviceAccount: aws-otel-collector
resources:
limits:
cpu: 500m
memory: 1024Mi
requests:
cpu: 250m
memory: 512Mi
config: |
extensions:
sigv4auth:
service: "aps"
region: "ap-northeast-2"
receivers:
otlp:
protocols:
grpc:
http:
processors:
batch:
timeout: 30s
send_batch_size: 8192
exporters:
prometheusremotewrite:
endpoint: "https://<aws-managed-prometheus-endpoint>/v1/api/remote_write"
auth:
authenticator: sigv4auth
awsxray:
region: "ap-northeast-2"
service:
extensions: [sigv4auth]
pipelines:
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheusremotewrite]
traces:
receivers: [otlp]
processors: [batch]
exporters: [awsxray]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: otel-collector-role
rules:
- apiGroups:
- ""
resources:
- nodes
- nodes/proxy
- services
- endpoints
- pods
verbs:
- get
- list
- watch
- apiGroups:
- extensions
resources:
- ingresses
verbs:
- get
- list
- watch
- nonResourceURLs:
- /metrics
verbs:
- get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: otel-collector-role-binding
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: otel-collector-role
subjects:
- kind: ServiceAccount
name: aws-otel-collector
namespace: aws-otel-eks
3.4 部署ADOT Collector
为方便入门,我们选择范例1的的配置进行部署,将范例1的内容保存为collector-trace.yaml,使用以下命令创建资源:
kubectl create -f collector-trace.yaml
Operator将在EKS中创建deployment和service资源
kubectl get all -n aws-otel-eks
NAME READY STATUS RESTARTS AGE
pod/otel-trace-collector-5f45484b79-x582x 1/1 Running 0 72s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/otel-trace-collector ClusterIP 10.100.5.170 <none> 4317/TCP,4318/TCP,55681/TCP 73s
service/otel-trace-collector-headless ClusterIP None <none> 4317/TCP,4318/TCP,55681/TCP 73s
service/otel-trace-collector-monitoring ClusterIP 10.100.73.216 <none> 8888/TCP 73s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/otel-trace-collector 1/1 1 1 73s
NAME DESIRED CURRENT READY AGE
至此,collector已经部署完成,下一步我们将通过一个简单的例子来介绍如何使用OTel的SDK来实现应用程序的可观测。
四、使用OTel SDK实现应用的可观测
在本演示样例中,使用golang的sdk来进行开发部署
- 使用OTel的opentelmetry-go SDK进行traces观测信号的生成
- 使用OTLP协议将观测数据发送到ADOT的Collector,Collector将OTLP做转换,输出到X-Ray,注意X-Ray的trace-id并不是和OTLP一样完全随机,包含了时间戳信息,阻止30天前的traces数据进入
- 在代码中,将traces的信息与底层资源进行关联
- 为方便观测,将采样率设置为AlwaysSample,可以根据需要调整采样策略和比例,OTel提供了丰富的采样策略,例如主流的固定采样比率、尾部采样等等来降低观测信号的网络传输和存储压力
- 将应用暴露为http服务,监听在0.0.0.0:4000
package main
import (
"context"
"fmt"
"log"
"math/rand"
"net/http"
"os"
"time"
"go.opentelemetry.io/contrib/detectors/aws/ec2"
"go.opentelemetry.io/contrib/detectors/aws/eks"
"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
"go.opentelemetry.io/contrib/propagators/aws/xray"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/codes"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
"go.opentelemetry.io/otel/sdk/resource"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
semconv "go.opentelemetry.io/otel/semconv/v1.4.0"
"go.opentelemetry.io/otel/trace"
)
var (
OTLP_ENDPOINT_GRPC = "0.0.0.0:4317" // 定义默认OTLP Exporter的Endpoint
SERVICE_NAME = "OTELDemo"
REGION = "ap-northeast-2"
TR trace.Tracer
)
func main() {
ctx := context.Background() // 初始化context,用于传递可观测信号上下文传递
traceStop := InitOTELProvider(ctx) // 初始化meter和tracer的provider,并在main函数关闭时停止trace
defer func() {
if err := traceStop(ctx); err != nil {
log.Fatal(err)
}
}()
// otelhttp.NewHandler拦截请求路由,为处理函数增加⾃动⽣成traces的能⼒
http.Handle("/hello", otelhttp.NewHandler(http.HandlerFunc(helloHandler), "hello"))
http.Handle("/err", otelhttp.NewHandler(http.HandlerFunc(errHandler), "err"))
http.Handle("/notfound", otelhttp.NewHandler(http.HandlerFunc(notfoundHandler), "notfound"))
log.Fatal(http.ListenAndServe(":4000", nil))
}
func helloHandler(w http.ResponseWriter, r *http.Request) {
ctx := r.Context()
n := rand.New(rand.NewSource(time.Now().UnixNano())).Intn(200)
fib, _ := fibonacci(ctx, uint(n))
w.Write([]byte(fmt.Sprintf("Number: %d Fib: %d\n", n, fib)))
}
func errHandler(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(http.StatusServiceUnavailable)
}
func notfoundHandler(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(http.StatusNotFound)
}
// Fibonacci returns the n-th fibonacci number.
func fibonacci(ctx context.Context, n uint) (uint64, error) {
_, span := TR.Start(ctx, "Fibonacci")
defer span.End()
if n <= 1 {
return uint64(n), nil
}
// 当传输的数字超过93,结果超过unit64的表示范围,将span标记为codes.Error,并在SetAttributes记录关键信息
if n > 93 {
span.SetStatus(codes.Error, fmt.Sprintf("unsupported fibonacci number %d: too large", n))
span.SetAttributes(attribute.Int("num", int(n)))
return 0, fmt.Errorf("unsupported fibonacci number %d: too large", n)
}
var n2, n1 uint64 = 0, 1
for i := uint(2); i < n; i++ {
n2, n1 = n1, n1+n2
}
span.SetAttributes(attribute.Int("num", int(n)))
return n2 + n1, nil
}
// 将Traces与底层基础设置关联,如EKS Pod ID、EC2实例ID等
func NewResource(ctx context.Context) *resource.Resource {
// 如果未设置RESOURCE_TYPE环境变量,则使用默认值
resType := os.Getenv("RESOURCE_TYPE")
switch resType {
case "EC2":
r, err := ec2.NewResourceDetector().Detect(ctx)
if err != nil {
log.Fatalf("%s: %v", "Failed to detect EC2 resource", err)
}
res, err := resource.Merge(r, resource.NewSchemaless(semconv.ServiceNameKey.String(SERVICE_NAME)))
if err != nil {
log.Fatalf("%s: %v", "Failed to merge resources", err)
}
return res
case "EKS": // EKS Resource的实现依赖Container Insight
r, err := eks.NewResourceDetector().Detect(ctx)
if err != nil {
log.Fatalf("%s: %v", "failed to detect EKS resource", err)
}
res, err := resource.Merge(r, resource.NewSchemaless(semconv.ServiceNameKey.String(SERVICE_NAME)))
if err != nil {
log.Fatalf("%s: %v", "Failed to merge resources", err)
}
return res
default:
res := resource.NewWithAttributes(
semconv.SchemaURL,
// ServiceName用于在后端中标识应用
semconv.ServiceNameKey.String(SERVICE_NAME),
)
return res
}
}
// InitOTELProvider 初始化TraceProvider,返回Tracer的关闭函数
func InitOTELProvider(ctx context.Context) (traceStop func(context.Context) error) {
ep := os.Getenv("OTEL_EXPORTER_OTLP_ENDPOINT")
if ep != "" {
OTLP_ENDPOINT_GRPC = ep // 如果设置了环境变量,则使用环境变量的值来设置exporter的endpoint
}
res := NewResource(ctx)
// 初始化TracerProvider,使用grpc与collector通讯
traceExporter, err := otlptracegrpc.New(ctx, otlptracegrpc.WithInsecure(), otlptracegrpc.WithEndpoint(OTLP_ENDPOINT_GRPC))
if err != nil {
log.Fatalf("%s: %v", "failed to create trace exporter", err)
}
idg := xray.NewIDGenerator() // x-ray的trace id包含时间戳
tp := sdktrace.NewTracerProvider(
sdktrace.WithSampler(sdktrace.AlwaysSample()), // 设置采样率
sdktrace.WithBatcher(traceExporter),
sdktrace.WithIDGenerator(idg), //使用x-ray兼容的trace id
sdktrace.WithResource(res),
)
otel.SetTracerProvider(tp)
TR = tp.Tracer(SERVICE_NAME)
return tp.Shutdown
}
这些样例代码可以从https://github.com/xufanglin/otel-exmaple直接下载
五、在EKS中部署应用样例
5.1 使用Docker来封装应用
Golang的好处是可以非常方便的交叉编译成执行文件,通过Multi-stage构建出干净的运行镜像。构建镜像前,我们需要先安装运行docker
sudo yum install -y docker
sudo systemctl enable docker && sudo systemctl start docker
创建Dockerfile,内容如下:
FROM golang:alpine AS build-env
RUN apk update && apk add ca-certificates
WORKDIR /usr/src/app
COPY . .
RUN CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -o otel-example -a -ldflags '-extldflags "-static"'
FROM scratch
COPY --from=build-env /usr/src/app/otel-example /otel-example
COPY --from=build-env /etc/ssl/certs/ca-certificates.crt /etc/ssl/certs/
CMD ["/otel-example"]
构建docker镜像
sudo docker build -t otel-example:latest .
sudo docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
otel-example latest 1ee8421c3437 About a minute ago 55.1MB
golang alpine 5dd973625d31 3 weeks ago 352MB
5.2 使用ECR存储镜像
先使用以下命令创建ECR镜像库
export ACCOUNT_ID=$(aws sts get-caller-identity --query "Account" --output text)
export REGION="ap-northeast-2"
aws ecr get-login-password --region $REGION | sudo docker login --username AWS --password-stdin $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com
aws ecr create-repository --region $REGION --repository-name otel-example --image-scanning-configuration scanOnPush=true --image-tag-mutability MUTABLE
将镜像打标签推送至ECR
sudo docker tag otel-example:latest $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/otel-example:latest
sudo docker push $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/otel-example:latest
push完docker镜像,使用以下命令获取镜像的url,用于部署应用到EKS上:
export IMAGE=$(aws ecr describe-repositories --repository-name otel-example --query "repositories[].repositoryUri" --output text):latest
5.3 将应用部署到EKS
首先使用kubectl获取ADOT Collector的endpoint,从以下输出可以得知endpoint(svc+namespace+port)为”otel-trace-collector.aws-otel-eks:4317″
cat <<EOF | kubectl create -f -
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: otel-example
labels:
app: otel-example
spec:
replicas: 1
selector:
matchLabels:
app: otel-example
template:
metadata:
labels:
app: otel-example
spec:
containers:
- name: otel-example
image: ${IMAGE}
ports:
- name: web
containerPort: 4000
env:
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: "otel-trace-collector.aws-otel-eks:4317"
- name: RESOURCE_TYPE
value: "EC2"
---
apiVersion: v1
kind: Service
metadata:
name: otel-example
labels:
app: otel-example
annotations:
scrape: "true"
spec:
ports:
- name: web
port: 4000
targetPort: 4000
protocol: TCP
selector:
app: otel-example
EOF
检查样例应用
kubectl get pod
NAME READY STATUS RESTARTS AGE
otel-example-955f66cd5-59xxs 1/1 Running 0 19s
六、测试应用的traces生成和展示
6.1 生成traces信息
使用curl工具来测试访问应用,生成traces信息
kubectl run curl --image=curlimages/curl:latest -- sleep 1d
kubectl exec -it curl -- sh
/ $ curl otel-example:4000/err
/ $ curl otel-example:4000/notfound
/ $ curl otel-example:4000/notfound
/ $ curl otel-example:4000/err
/ $ curl otel-example:4000/hello
Number: 86 Fib: 420196140727489673
/ $ curl otel-example:4000/hello
Number: 119 Fib: 0
/ $ curl otel-example:4000/hello
Number: 63 Fib: 6557470319842
/ $ curl otel-example:4000/hello
Number: 85 Fib: 259695496911122585
/ $ curl otel-example:4000/hello
Number: 144 Fib: 0
6.2 在X-Ray上查看Trace信息
在AWS Console上打开X-Ray的Service map,可以看到客户端与应用之间的访问状态的统计
点开X-Ray的Traces菜单,可以看到应用上报的traces信息,显示了http.mothod
、http.status_code
、响应时间等信息等
选择一条正常的trace信息,可以看到单次服务的调用关系,整个trace持续的时间,每个span花费的时间等
点开Fibonacci
,可以看到该函数调用的一些信息,如parent的span id
在resources标签,可以看到EC2的信息已经被关联上来,如Account ID、实例的AZ信息、实例类型等
点开metadata,可以看到我们在程序中设置的span attributes信息
对于返回404 notfound信息的访问,我们可以看到response为404
点开OTELDemo,可以看到http request和response等信息
对于返回503服务不可用的访问,可以看到trace map显示异常红色
同时点开OTELDemo,可以看到http request和response等信息,span的状态Fault为true
总结
OTel为可观测提供了标准化的API/SDK以及语义规范,使用Context将metrics、traces、logs将“三大支柱”关联,降低了开发者可观测技术栈采用成本和更换的技术成本,并提供了灵活可配置的Collector满足用户灵活多变的需求。目前OTel还有很多未完全解决的问题,如Golang的metrics的sdk还处于alpha,logs处于Frozen阶段。需要依赖prometheus的sdk的exemplar特性将traces和metrics关联。但社区活跃度较高,有各云服务商以及APM厂商的支持,发展迅速,在CNCF的2022年的可观测的报告“Cloud Native Observability: hurdles remain to understanding the health of systems”中,以49%的采用率仅次于prometheus。
ADOT作为AWS的OTel发行版,由AWS提供技术支持,帮助客户更方便将应用OTel技术栈,与EKS、ECS、Lambda、EC2、Amazon Managed Service for Prometheus、Amazon Managed Grafana、X-Ray、Cloudwatch或三方合作伙伴APM产品集成。