tracing.md

Tracing Temporal Services with OTEL

The Temporal server supports ability to configure OTEL trace exporters to support emitting spans and traces for observability. More specifically, the server uses the Go Open Telemetry library for instrumentation and multi-protocol multi-model telemetry exporting. This document is intended to help developers understand how to configure exporters and instrument their code. A full exploration of tracing and telemetry is out of scope of this document and the reader is referred to external reference material, third party descriptions, and the specification itself.

Configuring

No trace exporters are configured by default and thus trace data is neither collected nor emitted without additional configuration added to the server's yaml configuration files.

The server now supports a new otel YAML stanza which is used to configure a set of process-wide exporters. In OpenTelemetry, the concept of an "exporter" is abstract. The concrete implementation of an exporter is determined by a 3-tuple of values: the exporter signal, model, and protocol. In OTEL, a "signal" is one of traces, metrics, or logs (in this document we will only deal with traces), "model" indicates the abstract data model for the span and trace data being exported, and the "protocol" specifies the concrete application protocol binding for the indicated model. Temporal is known to support exporting trace data as defined by otlp over either grpc or http.

A common configuration is to emit tracing data to an agent such as the otel-collector running locally. To configure such a system add the stanza below to your configuration yaml file(s).

otel:
  exporters:
    - kind:
        signal: traces
        model: otlp
        protocol: grpc
      spec:
        connection:
          insecure: true
          endpoint: localhost:4317

Another example is pointing Temporal directly at the Honeycomb hosted OTLP collection service. To achieve such a configuration you will need an API key from the upstream Honeycomb service and the stanza below.

otel:
  exporters:
    - kind:
        signal: traces
        model: otlp
        protocol: grpc
      spec:
        connection:
          endpoint: api.honeycomb.io:443
        headers:
          x-honeycomb-team: <a honeycomb API key>

Note that the configuration parser supports defining multiple exporters by supplying additional kind and spec declarations. Additional configuration fields can be found in config_test.go and are mostly related to the underlying gRPC client configuration (retries, timeouts, etc).

Note that the Go OTEL SDK will also read a well-known set of environment variables for configuration. So if you prefer setting environment variables to writing YAML then you can use the variables defined in the OTEL spec. If environment variables conflict with YAML-provided configuration then the YAML takes precedence.

Instrumenting

While the exporter configuration described above is executed and set up at process startup time, instrumentation code - the creation and termination of spans - is inserted inline (like logging statements) into normal server processing code. Spans are created by go.opentelemetry.io/otel/trace.Tracer objects which are themselves created by go.opentelemetry.io/otel/trace.TracerProvider instances. The TracerProvider instances are bound to a single logical service and as such a single Temporal process will have up to four such instances (for worker, mathcing, history, and frontend services respectively). The Tracer object is bound to a single logical library which is different than a service. Consider that a history service instance might run code from the temporal common library, gRPC library, and gocql library.

Tracer and TracerProvider object management has been added to the server's fx DI configuration and thus they are available to be added to any fx-enabled object constructors. Due the possibility of multiple services being coresident within a single process, we do not use the OTEL library's capability to host and access a single global TracerProvider.

By default, gRPC clients and servers are instrumented via the open source otelgrpc library.

Instrumentation Tips

Follow the OTEL attribute naming guidelines

The OpenTelemetry project has published a non-normative set of guidelines for attribute naming.

If nothing else, please

Always check for an appropriate attribute in semconv before creating your own
Always prefix Temporal attributes with io.temporal

Create shared package-appropriate attribute keys

Do not create a single file in common for all attributes

Do not create packages just for OTEL attributes

Do create a set of attribute.Keys in the semantically appropriate package and re-use those to create attribute.KeyValues as needed.

Do create a set of utility functions that can transform frequently used aggregate types (Tasks, WorkflowExecutions, TaskQueues, etc) into an []attribute.KeyValue. The association of attribute.KeyValues to a trace.Span can be verbose in terms of the number of lines of code needed so any reduction in that noise will be a good idea. Not to mention the consistency benefit of sharing a single mapping function.

Start a span in `common` or other non-service-specific code

Q: Given that common code can be called from any service, how can I start a span in common library code that is bound the the appropriate service (frontend/history/matching/worker)?

A: The TracerProvider that created the currently active Span can be retrieved from that Span itself and the currently active Span can be received from from the context.Context.

// DoFoo is a function in the common package
func DoFoo(ctx context.Context, x int, y string) string {
   var span trace.Span
   ctx, span = trace.SpanFromContext(ctx).TracerProvider().Tracer("go.temporal.io/server/common").Start("DoFoo")
   defer span.End()
   return fmt.Sprintf("%v-%v", y, x)
}

`RecordError` does not imply Span failure

Using Span.RecordError is a good idea but not all errors imply failure. Thus if you want to capture an error and also capture that a span failed, you must additionally call Span.SetStatus(codes.Error, err.Error()). A FailSpanWithError utility function might be a good idea.

Propagate TraceContext across things other than function calls

This is taken care of by default for gRPC calls via the otelgrpc interceptors. However you may want to propagate tracing information between goroutines or other places where the context.Context is not passed such as handoffs through a Go channel or an external datastore. There are two broad approaches that are applicable in different situations:

If the object being transferred is not externally durable (e.g. an object put into a Go channel but not spooled to a database) then you can pull the trace.SpanContext out of the current trace.Span with trace.SpanContextFromContext(context.Context) or Span.SpanContext() and pass that object along with the data being transferred. The consuming side can restore the tracing state with trace.ContextWithSpanContext(trace.SpanContext).
If the tracing state needs to be serialized, the OTEL library provides the propagation package to convert trace state into a more serialization-friendly type such as a map[string]string. The propagation.TraceContext type can be used to inject and extract trace state into a key-value-ish object.

carrier := propagation.MapCarrier(map[string]string{})
propagation.TraceContext{}.Inject(ctx, carrier)
// write the carrier object to a durable store

Trace individual tasks that are processed together in batches

OpenTelemetry Spans can be linked together to form a non-parent-child relationship. One of the main use cases for linking is so that a batch process (e.g. a database read that fills a large buffer of work items) can create Spans for each of the individual work items it creates and those Spans can be linked back to the parent batch Span without that span becoming their logical parent.

Still want to log things?

Use Span.AddEvent to write messages that will be associated with that Span. From the OTEL manual

An event is a human-readable message on a span that represents “something happening” during it’s lifetime

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracing Temporal Services with OTEL

Configuring

Instrumenting

Instrumentation Tips

Follow the OTEL attribute naming guidelines

Create shared package-appropriate attribute keys

Start a span in `common` or other non-service-specific code

`RecordError` does not imply Span failure

Propagate TraceContext across things other than function calls

Trace individual tasks that are processed together in batches

Still want to log things?

FilesExpand file tree

tracing.md

Latest commit

History

tracing.md

File metadata and controls

Tracing Temporal Services with OTEL

Configuring

Instrumenting

Instrumentation Tips

Follow the OTEL attribute naming guidelines

Create shared package-appropriate attribute keys

Start a span in common or other non-service-specific code

RecordError does not imply Span failure

Propagate TraceContext across things other than function calls

Trace individual tasks that are processed together in batches

Still want to log things?

Start a span in `common` or other non-service-specific code

`RecordError` does not imply Span failure