news 2026/4/18 1:11:49

opentelemetry全链路初探--埋点与jaeger

作者头像

张小明

前端开发工程师

1.2k 24
文章封面图
opentelemetry全链路初探--埋点与jaeger

前言

某天一位业务研发老哥跑来咨询

研发老哥:我的服务出现了504,但是不太清楚是哪个环节报错,每次请求需要访问4个微服务、2个数据库、1个redis、1个消息队列。。。

苦逼运维:停停停,不要再说了,目前不支持链路追踪,只能手动帮你一个服务一个服务的排查了

先请老哥大概描述了一下业务逻辑以及访问方式,10分钟过去了。再逐级排查每个服务以及对应访问的资源层,终于在半小时之后完成了故障定位。。。

这效率也太低了,于是,关于链路建设项目提上了议程,目标只有一个,快速定位问题,提高稳定性。而链路建设,OpenTelemetry是目前行业热点,那本运维就来研究研究

环境准备

组件 版本

操作系统 Ubuntu 22.04.4 LTS

opentelemetry-sdk 1.35.0

安装

首先先简单说一下OpenTelemetry的数据采集流程,然后先跑起来再去讨论细节

OpenTelemetry就是在代码中埋入采集点进行数据采集,opentelemetry-sdk

再通过固定的协议将数据上传至某个地方进行数据展示,jaeger UI

安装OpenTelemetry-sdk

pip3 install opentelemetry-sdk opentelemetry-exporter-otlp opentelemetry-api

安装数据展示jaeger UI

docker pull docker.m.daocloud.io/jaegertracing/all-in-one:latest

docker run -d --name jaeger \

-e COLLECTOR_OTLP_ENABLED=true \

-p 16686:16686 \

-p 4317:4317 \

-p 4318:4318 \

docker.m.daocloud.io/jaegertracing/all-in-one:latest

docker启动之后访问:http://127.0.0.1:16686

watermarked-first_1

第一个例子

web服务

首先先准备一个web服务,这里我们用tornado来实现,安装tornado:pip3 install tornado

import tornado.httpserver as httpserver

import tornado.web

from tornado.ioloop import IOLoop

class TestFlow(tornado.web.RequestHandler):

def get(self):

self.finish('hello world')

def applications():

urls = []

urls.append([r'/', TestFlow])

return tornado.web.Application(urls)

def main():

app = applications()

server = httpserver.HTTPServer(app)

server.bind(10000, '0.0.0.0')

server.start(1)

IOLoop.current().start()

if __name__ == "__main__":

try:

main()

except KeyboardInterrupt as e:

IOLoop.current().stop()

finally:

IOLoop.current().close()

检查是否能够正常访问:

watermarked-first_2

添加埋点

import tornado.httpserver as httpserver

import tornado.web

from tornado.ioloop import IOLoop

from opentelemetry import trace

from opentelemetry.sdk.trace import TracerProvider

from opentelemetry.sdk.resources import SERVICE_NAME, Resource

from opentelemetry.sdk.trace.export import BatchSpanProcessor

from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter

trace.set_tracer_provider(

TracerProvider(resource=Resource.create({SERVICE_NAME: "s1"}))

)

tracer = trace.get_tracer(__name__)

span_processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4318/v1/traces"))

trace.get_tracer_provider().add_span_processor(span_processor)

class TestFlow(tornado.web.RequestHandler):

def get(self):

views()

self.finish('hello world')

def views():

span = tracer.start_span("s1-span")

span.end()

def applications():

urls = []

urls.append([r'/', TestFlow])

return tornado.web.Application(urls)

def main():

app = applications()

server = httpserver.HTTPServer(app)

server.bind(10000, '0.0.0.0')

server.start(1)

IOLoop.current().start()

if __name__ == "__main__":

try:

main()

except KeyboardInterrupt as e:

IOLoop.current().stop()

finally:

IOLoop.current().close()

再次访问 curl http://localhost:10000 ,打开jaeger UI查看

watermarked-first_3

watermarked-first_4

已经有数据了,刚才的埋点已经上报至jaeger UI了

埋点数据属性

丰富一下埋点数据的属性

def views():

span = tracer.start_span("s1-span")

span.set_attribute("name", "wilson")

span.set_attribute("addr", "cd")

span.end()

watermarked-first_5

增加数据库访问追踪

def views():

span = tracer.start_span("s1-span")

span.set_attribute("name", "wilson")

span.set_attribute("addr", "cd")

ctx = trace.set_span_in_context(span)

get_db(ctx)

span.end()

def get_db(parent_ctx):

span = tracer.start_span("s1-span-db", context=parent_ctx)

span.end()

watermarked-first_6

增加跨服务追踪

增加第二个web服务:s2.py

import tornado.httpserver as httpserver

import tornado.web

from tornado.ioloop import IOLoop

from opentelemetry import trace

from opentelemetry.sdk.trace import TracerProvider

from opentelemetry.sdk.resources import SERVICE_NAME, Resource

from opentelemetry.sdk.trace.export import BatchSpanProcessor

from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter

from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator

trace.set_tracer_provider(

TracerProvider(resource=Resource.create({SERVICE_NAME: "s2"}))

)

tracer = trace.get_tracer(__name__)

span_processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4318/v1/traces"))

trace.get_tracer_provider().add_span_processor(span_processor)

class TestFlow(tornado.web.RequestHandler):

def get(self):

ctx = TraceContextTextMapPropagator().extract(self.request.headers)

span = tracer.start_span("s2-span", context=ctx)

span.end()

self.finish('hello world')

def applications():

urls = []

urls.append([r'/', TestFlow])

return tornado.web.Application(urls)

def main():

app = applications()

server = httpserver.HTTPServer(app)

server.bind(20000, '0.0.0.0')

server.start(1)

IOLoop.current().start()

if __name__ == "__main__":

try:

main()

except KeyboardInterrupt as e:

IOLoop.current().stop()

finally:

IOLoop.current().close()

修改s1.py

from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator

import requests

def views():

span = tracer.start_span("s1-span")

span.set_attribute("name", "wilson")

span.set_attribute("addr", "cd")

ctx = trace.set_span_in_context(span)

get_db(ctx)

headers = {}

TraceContextTextMapPropagator().inject(headers, context=ctx)

requests.get("http://localhost:20000", headers=headers)

span.end()

watermarked-first_7

改造进k8s

jaeger

编排文件:

apiVersion: apps/v1

kind: Deployment

metadata:

labels:

app: jaeger

name: jaeger

namespace: default

spec:

replicas: 1

selector:

matchLabels:

app: jaeger

template:

metadata:

labels:

app: jaeger

spec:

containers:

- image: docker.m.daocloud.io/jaegertracing/all-in-one:latest

imagePullPolicy: Always

name: jaeger

dnsPolicy: ClusterFirst

restartPolicy: Always

---

apiVersion: v1

kind: Service

metadata:

labels:

app: jaeger-service

name: jaeger-service

namespace: default

spec:

ports:

- name: port-4317

port: 4317

protocol: TCP

targetPort: 4317

- name: port-4318

port: 4318

protocol: TCP

targetPort: 4318

- name: port-16686

port: 16686

protocol: TCP

targetPort: 16686

selector:

app: jaeger

type: NodePort

s2

1)制作镜像

由于在k8s集群中通过svc访问jaeger,需要改造一下s2.py

s2.py

...

import os

JAEGER_ADDR=os.environ.get('JAEGER_ADDR')

...

span_processor = BatchSpanProcessor(OTLPSpanExporter(endpoint=JAEGER_ADDR))

...

Dockerfile

FROM python:3.8

WORKDIR /opt

RUN pip3 install tornado opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp -i https://pypi.tuna.tsinghua.edu.cn/simple

ADD s2.py /opt

CMD python3 s2.py

2)编排文件

apiVersion: apps/v1

kind: Deployment

metadata:

labels:

app: s2

name: s2

namespace: default

spec:

replicas: 1

selector:

matchLabels:

app: s2

template:

metadata:

labels:

app: s2

spec:

containers:

- env:

- name: JAEGER_ADDR

value: http://jaeger-service:4318/v1/traces

image: s2:v1

imagePullPolicy: Always

name: s2

dnsPolicy: ClusterFirst

restartPolicy: Always

---

apiVersion: v1

kind: Service

metadata:

labels:

app: s2-service

name: s2-service

namespace: default

spec:

ports:

- name: s2-port

port: 20000

protocol: TCP

targetPort: 20000

selector:

app: s2

type: NodePort

s1

1)制作镜像

由于在k8s集群中通过svc访问s2与jaeger,需要改造一下s1.py

s1.py

...

import os

S2_ADDR=os.environ.get('S2_ADDR')

JAEGER_ADDR=os.environ.get('JAEGER_ADDR')

...

span_processor = BatchSpanProcessor(OTLPSpanExporter(endpoint=JAEGER_ADDR))

...

def views():

span = tracer.start_span("s1-span")

span.set_attribute("name", "wilson")

span.set_attribute("addr", "cd")

ctx = trace.set_span_in_context(span)

get_db(ctx)

headers = {}

TraceContextTextMapPropagator().inject(headers, context=ctx)

requests.get(S2_ADDR, headers=headers)

span.end()

...

Dockerfile:

FROM python:3.8

WORKDIR /opt

RUN pip3 install tornado opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp -i https://pypi.tuna.tsinghua.edu.cn/simple

ADD s1.py /opt

CMD python3 s1.py

2)编排文件

apiVersion: apps/v1

kind: Deployment

metadata:

labels:

app: s1

name: s1

namespace: default

spec:

replicas: 1

selector:

matchLabels:

app: s1

template:

metadata:

labels:

app: s1

spec:

containers:

- env:

- name: S2_ADDR

value: http://s2-service:20000

- name: JAEGER_ADDR

value: http://jaeger-service:4318/v1/traces

image: s1:v1

imagePullPolicy: Always

name: s1

dnsPolicy: ClusterFirst

restartPolicy: Always

---

apiVersion: v1

kind: Service

metadata:

labels:

app: s1-service

name: s1-service

namespace: default

spec:

ports:

- name: s1-port

port: 10000

protocol: TCP

targetPort: 10000

selector:

app: s1

type: NodePort

查看结果

▶ kubectl get pod -owide

NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES

jaeger-6669cd7c4-4pl5j 1/1 Running 0 7m31s 10.244.0.236 minikube <none> <none>

s1-5c569c5b4b-lctzq 1/1 Running 0 73s 10.244.0.237 minikube <none> <none>

s2-5bb648dcdf-mlnbj 1/1 Running 0 61s 10.244.0.238 minikube <none> <none>

▶ kubectl get svc

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE

jaeger-service NodePort 10.106.13.217 <none> 4317:31891/TCP,4318:31997/TCP,16686:31002/TCP 5m49s

s1-service NodePort 10.102.25.195 <none> 10000:32376/TCP 4m23s

s2-service NodePort 10.103.114.198 <none> 20000:30032/TCP 3m40s

进行数据测试:

访问s1服务

▶ curl http://192.168.49.2:32376

hello world%

查看jaeger日志,访问:http://192.168.49.2:31002/

watermarked-first_10

总结

在第一个例子中,我们主要采集了业务服务的trace记录,即一个完整的请求需要经过的路径,包括读取数据库、跨服务请求等等

在整个跟踪过程中trace_id与span_id发挥了决定性的作用,前者为请求链路的唯一标识,串联了整个访问步骤;而后者则是链路上每一次不同的具体操作的标识

watermarked-first_8

采集:通过嵌入代码埋点,采集重点监控的流程,比如数据库读写速度、下游服务速度等

处理:opentelemetry-sdk对数据进行处理:过滤、缓存、合并

导出:将处理过的数据,通过固定的协议(otlp协议、grpc协议、http协议等)发送到后端系统,比如jaeger

版权声明: 本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若内容造成侵权/违法违规/事实不符,请联系邮箱:809451989@qq.com进行投诉反馈,一经查实,立即删除!
网站建设 2026/4/15 7:21:17

阿里Wan2.1开源:1.3B参数打破视频生成垄断,消费级GPU即可运行

阿里Wan2.1开源&#xff1a;1.3B参数打破视频生成垄断&#xff0c;消费级GPU即可运行 【免费下载链接】Wan2.1-T2V-1.3B-Diffusers 项目地址: https://ai.gitcode.com/hf_mirrors/Wan-AI/Wan2.1-T2V-1.3B-Diffusers 导语 阿里巴巴通义实验室开源的Wan2.1-T2V-1.3B模型…

作者头像 李华
网站建设 2026/4/18 2:02:47

3个步骤解决FunASR时间戳对齐问题:从新手到精通的完整指南

3个步骤解决FunASR时间戳对齐问题&#xff1a;从新手到精通的完整指南 【免费下载链接】FunASR A Fundamental End-to-End Speech Recognition Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Recognition, Voice Activity Detection, Text Post-processi…

作者头像 李华
网站建设 2026/4/17 2:54:19

Qwen3-4B智能客服升级方案:3步实现企业级AI对话降本增效

Qwen3-4B智能客服升级方案&#xff1a;3步实现企业级AI对话降本增效 【免费下载链接】Qwen3-4B-MLX-4bit 项目地址: https://ai.gitcode.com/hf_mirrors/Qwen/Qwen3-4B-MLX-4bit 在数字化转型浪潮中&#xff0c;智能客服已成为企业提升用户体验的核心竞争力。通义千问Q…

作者头像 李华
网站建设 2026/4/18 1:12:50

改善深层神经网络 第二周:优化算法(三)Momentum梯度下降法

1. Momentum 梯度下降法1.1梯度下降中的“震荡”现象我们用课程里的图来看一下这个问题&#xff1a;Pasted image 20251110104620现在假设这就是我们的网络的损失图像&#xff0c;我们通过一次次迭代&#xff0c;让损失下降到最低点。这里展开两个问题&#xff1a;&#xff08;…

作者头像 李华
网站建设 2026/4/18 2:03:24

音乐管理|基于springboot + vue音乐管理系统(源码+数据库+文档)

音乐管理系统 目录 基于springboot vue音乐管理系统 一、前言 二、系统功能演示 详细视频演示 三、技术选型 四、其他项目参考 五、代码参考 六、测试参考 七、最新计算机毕设选题推荐 八、源码获取&#xff1a; 基于springboot vue音乐管理系统 一、前言 博主介绍…

作者头像 李华
网站建设 2026/4/18 0:06:52

强化学习训练监控实战:从噪声曲线到可靠指标的诊断指南

你是否曾在训练强化学习模型时&#xff0c;面对看似随机波动的奖励曲线无从下手&#xff1f;当训练日志中充斥着-100到1000的奖励值时&#xff0c;如何判断模型是在进步还是在退化&#xff1f;本文将从工程实践角度&#xff0c;为你构建一套完整的训练监控诊断体系&#xff0c;…

作者头像 李华