Introduction to Observability and Setting up Distributed Tracing using Jaeger
Microservice architecture have enabled developers to build and release softwares independently and faster. As these distributed systems expand in complexity and scale, it becomes way more difficult for developers to see how there services are behaving and affecting other service. The concept of observability has emerged as a critical solution, providing developers with the means to attain comprehensive insights into the systems they develop. This blog aims to explore observability and distributed tracing, highlighting their importance and practical uses.
Topics covered in this blog:
- What is Observability?
- What is OpenTelemetry?
- What is distributed tracing?
- Introduction to Jaeger
- Jaeger Components
- Deploying Jaeger on Kubernetes
- Collecting and displaying application traces using Jaeger and OpenTelemetry
- Basic interpretation of traces
What is Observability?
In software world, the ability to know exactly how our software behaves from inside with the help of collecting and analyzing data from software components is called Observability. Observability allows us to understand the internal state of the software system and it enables development teams to quickly and accurately navigate to the root cause of the identified technical issue.
Observability uses three types of telemetry data — logs, metrics and traces to provide deep visibility into distributed systems. Together logs, metrics and traces are known as three pillars of observability.
What is OpenTelemetry?
OpenTelemetry provides a single, open-source standard and vendor-agnostic SDKs, APIs and tools to capture and export logs, metrics and traces from our distributed application and infrastructure.
Since, distributed applications and services make the capture and export of telemetry data complicated, OpenTelemetry’s standardized tools make it easier to capture and export telemetry data according to our requirements.
What is Distributed Tracing?
Distributed tracing is a method of monitoring and observing service requests in applications that are built on microservice architecture. Distributed tracing is used to track requests and transactions throughout the applcation to provide observability into request journey. Distributed tracing tracks the request through each service. This is helpful to determine which service or function is causing the performance issues.
How distributed tracking works?
When user makes a request, the request trace gets a unique trace ID. As the request moves through the host system, every operations performed are tagged with the initial request’s trace ID. These performed operations are called child spans.
Basic Terminologies in Distributed Tracing and OpenTelemetry:
Trace: Trace exposes the request execution path throughout the distributed system. Trace is composed of one or more spans.
Span: In distributed tracing, a single trace contains series of tagged time intervals known as spans. The first span is known as parent span and subsequent spans are known as child spans. A child span is triggered by a parent span and it can be a function call, DB calls, calls to another service, etc.
Tags: Tags are the metadata used to contextualize and support multi-dimensional queries to analyze the traces. Traces and Spans are tagged using tags.
Introduction to Jaeger
Jaeger is a distributed tracing system released as open source by Uber Technologies. It’s used for monitoring and tracing transactions between microservice-based distributed systems.
Read more about Jaeger here: https://www.jaegertracing.io/docs/1.44/
Jaeger Components
Jaeger consists of multiple components that work together to collect, store and visualize traces and spans.
Jaeger Client
The jaeger client contains language-specific implementation of OpenTracing API for distributed tracing.
Jaeger Agent
The Jaeger agent is a process that runs continuosly in the background. It listens for spans that the client sends through User Datagram Protocol (UDP). The agent is meant to placed on the same host as the instrumented application. This is usually achieved by deploying it as sidecar in container environments like Kubernetes.
Jaeger Collector
It receives spans and traces and it places them in a queue for processing. It processes and stores the traces in storage.
Storage
The Jaeger tracing system receives traces and spans, and stores them in a persistent storage or database.
Ingester
We can deploy Jaeger by sending trace data to Kafka. An ingester is a module that reads trace data from Kafka and stores it separately.
Query
Query is a service that retrieves traces from database/storage. Developers use queries to quickly find traces with specific time, tags and operation.
Jaeger Console
Jaeger Console is a user interface that let’s us visualize and analyze traces. It displays trace data using graphs and charts.
Deploying Jaeger on Kubernetes
For deploying Jaeger on Kubernetes, the pre-requisite is the Kubernetes cluster. If you are trying that out on local then you can use minikube which is single node cluster on local. You can also use the Kubernetes services like Amazon EKS, Google Kubernetes Engine, Azure Kubernetes Service, RedHat OpenShift, etc.
Setting up Kubernetes using minikube:
For minikube installation you can refer this guide: https://minikube.sigs.k8s.io/docs/start/
I’m using virtualbox as a driver for minikube.
$ minikube start --memory=4096 --driver=virtualbox
I’m using kubectl command provided by minikube by setting up an alias.
$ alias kubectl="minikube kubectl --"
Deploying Jaeger:
Step 1: Install dependencies
We have a cert-manager as a dependency for Jaeger. So, let’s first install that.
$ kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.12.2/cert-manager.yaml
Step 2: Install Jaeger Operator
Jaeger Operator simplifies deploying and running Jaeger on Kubernetes. Read more about Jaeger Operator here: https://www.jaegertracing.io/docs/1.44/operator/
First create a observability namespace and install Jaeger operator into this namespace.
$ kubectl create namespace observability
$ kubectl apply -f https://github.com/jaegertracing/jaeger-operator/releases/download/v1.44.0/jaeger-operator.yaml -n observability
Verify if the resources are created in observability namespace.
Step 3: Deploy Jaeger Instance
Download the code from repository and change the current directory to code directory.
$ git clone https://github.com/cankush625/Jaeger.git
$ cd ./Jaeger/deploy_jaeger
$ kubectl apply -k jaeger_instance
Verify if deployments and services are successfully created.
Step 4: Access Jaeger Console
Jaeger console will be available on your minikube cluster IP on port 80. Retrieve cluster IP using below command
$ minikube ip
Go to this IP on port 80. You should be able to see Jaeger UI page as below.
So, we have configured Jaeger successfully.
Note:
I haven’t setup storage for this Jaeger deployment, so it will use in-memory storage. This will not persist the trace data and on restart the data will be lost. I might cover setting up persistent storage in another blog.
Instrumenting the application using OpenTelemetry and Jaeger
Now, we will be instrumenting the application so that it can generate and send the trace information in a specific format to Jaeger.
I have created two services, product service and cart service. The product service creates and returns the product. This service is written in Django framework and runs on port 8000. The cart service is written in Flask framework and runs on port 5000. Here, cart service calls the product service to check whether the product being added to cart is exists and some other validations. The high level architecture looks like this.
You can find the code for both product and cart services here:
Product Service: https://github.com/cankush625/jaeger-product-service
Cart Service: https://github.com/cankush625/jaeger-cart-service
Now, let’s run both of these services and validate if they are available on specified ports.
As we hit the service endpoint, we should be able to see trace generated in the Jaeger.
So, finally, we have instrumented the application successfully.
Basic interpretation of traces
Let’s make a request to add product to cart.
This request took 4 seconds to execute, which is very slow.
Let’s select the trace generated by above request and analyze it to understand what makes the request slow.
Now, lets expand the child spans to see which function or service took the extra time.
Here, we can see that the API to retrieve product by ID is taking lots of time. So, we need to optimize this API.
Let go to the code and see if there is any possibility for optimization. Here, we can see that the function get_product_by_id is taking lots of time to execute. The line time.sleep(1) in the code is causing the function to halt for 1 second for every iteration.
Let’s remove this line of code and re-start the service.
Now, hit the API to add the item in the cart.
Now, we can see that the request is completed much faster and the function to check if product exists is taking lesser time.
Here, we are be able to track exactly which system is causing the performance bottlenecks or issues quickly. This is possible because we made OpenTelemetry to propagate the context from one service to another service so that the child spans from another service get attach to the parent span from the initial service.
Get stacktrace for breaking code:
We can also see the crash related data and stack trace in Jaeger. I don’t have a pricing service available. If I try to hit the get cart API that calls pricing service then it will result in some connection error. We can see the exact stack trace in Jaeger.
Note:
As Jaeger is a open source tool, when we use that directly, we need to manage the entire infrastructure. There are some other managed services in market which can be used directly without having the overhead of setting up and managing Jaeger. I might cover one of those services in next blog.