InsureWatch — Application Guide | otel.guru

What InsureWatch Is

InsureWatch is a fictional insurance operations platform built specifically as an OTel training environment. It does three things well:

It’s polyglot. Five backend services, three languages (Node.js, Python, Java), four frameworks (Express, FastAPI, Spring Boot, React). This is not an accident — real production systems look like this. Learning OTel on a single-language app gives you half the picture.
It has real business logic. Policies, claims, investments, notifications — each service has a distinct domain with actual data models and inter-service dependencies. When you instrument it, you’re adding telemetry to something that means something, not a toy counter.
It has a chaos controller. A dedicated service that injects live failure conditions into the running stack. Latency, database failures, memory pressure, service crashes — you can trigger them on demand and observe exactly what your instrumentation reveals (and doesn’t).

The purpose is not to build InsureWatch. It’s to operate it observably — to understand what’s happening inside a distributed system from the telemetry it produces.

Why Polyglot Matters

If you’ve instrumented one language, you understand OTel’s concepts. Polyglot teaches you OTel’s architecture.

The OTel project ships separate SDKs for each language — opentelemetry-sdk for Python, @opentelemetry/sdk-node for Node.js, the Java agent for JVM services. Each SDK has different initialization patterns, different auto-instrumentation packages, different ways of attaching to frameworks. But every SDK produces the same wire format (OTLP) and the same data model (spans, metrics, logs with shared context).

InsureWatch lets you see this in one running system:

A Python service (Claims) using the OpenTelemetry Python SDK with FastAPIInstrumentor and PymongoInstrumentor
A Node.js service (API Gateway) using @opentelemetry/sdk-node with getNodeAutoInstrumentations()
A Java service (Policy) using the OpenTelemetry Java Agent — a JVM agent that instruments without any code changes

All three produce traces. All three attach to the same trace ID when a request flows through them. The fact that one is Python, one is JavaScript, and one is Java is invisible to Grafana — it shows you one unified waterfall.

That is the vendor-neutral, language-neutral promise of OTel. InsureWatch demonstrates it in practice.

Service Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    InsureWatch Platform                          │
│                                                                 │
│  ┌─────────────┐    ┌──────────────────────────────────────┐   │
│  │   Frontend   │    │           API Gateway                │   │
│  │  React/Vite  │───►│         Node.js / Express            │   │
│  │  Port: 5173  │    │           Port: 3000                 │   │
│  └─────────────┘    └──────┬──────────┬──────────┬─────────┘   │
│                            │          │          │             │
│              ┌─────────────┘    ┌─────┘    ┌────┘             │
│              ▼                  ▼          ▼                   │
│  ┌──────────────────┐  ┌──────────────┐  ┌──────────────────┐ │
│  │  Claims Service  │  │Policy Service│  │Investment Service│ │
│  │  Python/FastAPI  │  │ Java/Spring  │  │  Node.js/Express │ │
│  │    Port: 3001    │  │  Port: 8080  │  │    Port: 3002    │ │
│  └────────┬─────────┘  └──────────────┘  └──────────────────┘ │
│           │  validates       ▲                                  │
│           └──────────────────┘                                  │
│           │  notifies                                           │
│           ▼                                                     │
│  ┌──────────────────────┐    ┌──────────────────────┐          │
│  │ Notification Service │    │   Chaos Controller   │          │
│  │   Python/FastAPI     │    │   Node.js/Express    │          │
│  │      Port: 3003      │    │      Port: 3004      │          │
│  └──────────────────────┘    └──────────────────────┘          │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │                     MongoDB                             │   │
│  │   claims, policies, investments, notifications DBs      │   │
│  └─────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────┘
                         │ OTLP
                         ▼
              ┌──────────────────────┐
              │   OTel Collector     │
              │   (or lgtm image)    │
              │   gRPC: 4317         │
              │   HTTP: 4318         │
              └──────────┬───────────┘
                         │
                         ▼
              ┌──────────────────────┐
              │   Grafana UI :3000   │
              │   Tempo (traces)     │
              │   Prometheus (metrics│
              │   Loki (logs)        │
              └──────────────────────┘

Service Reference

API Gateway — Node.js / Express

Port: 3000 | Language: Node.js 20 | Framework: Express 4

The single entry point for all external traffic. Stateless — no database. Routes requests to downstream services and injects trace context into every outgoing HTTP call.

Routes it exposes:

POST /api/claims                 → Claims Service
GET  /api/claims/:id             → Claims Service
GET  /api/claims                 → Claims Service
GET  /api/policy/:customerId     → Policy Service
GET  /api/investments/:customerId → Investment Service
GET  /api/chaos/status           → Chaos Controller
POST /api/chaos/toggle           → Chaos Controller
GET  /health

Why it matters for OTel: This is where you see context propagation in action. Every outbound axios call injects the traceparent header — connecting the gateway’s span to the downstream service’s span. Remove or misconfigure the propagator here and every downstream trace becomes a disconnected root span.

Instrumentation:

Auto: HTTP server spans (Express), outbound HTTP spans (axios)
Manual: forward <path> span wrapping each downstream call
Metrics: gateway.requests.total, gateway.errors.total, gateway.request.duration
Logs: Winston JSON with traceId, spanId, traceFlags injected into every log record

Claims Service — Python / FastAPI

Port: 3001 | Language: Python 3.11 | Framework: FastAPI + Motor (async MongoDB)

The most complex service. Receives claim submissions, validates them against the Policy Service, auto-approves claims under $1,000, stores to MongoDB, and fires a notification asynchronously.

Routes:

POST /claims                     → submit claim
GET  /claims/:claim_id           → retrieve claim
GET  /claims                     → list claims (filter by customer_id)
GET  /health

Key business logic:

Policy lookup before claim acceptance (GET /policy/{customerId}/coverage)
Auto-approval threshold: claims < $1,000 approved immediately; larger claims set to “pending”
Async notification fire-and-forget to Notification Service (3s timeout, non-fatal on failure)

Why it matters for OTel: Claims is the hub of the critical path. It calls two downstream services, writes to MongoDB, and has business logic (the approval threshold) that has no visibility without manual instrumentation. It also demonstrates the Python SDK initialization pattern — a dedicated instrumentation.py module loaded before the FastAPI app.

Instrumentation:

Module: instrumentation.py — TracerProvider, MeterProvider, LoggerProvider initialized before app
Auto: FastAPI HTTP spans, outbound httpx spans, MongoDB operations (PyMongo)
Manual: submit_claim, get_claim, list_claims spans with business attributes (claim.type, claim.amount, claim.status, policy.coverage_limit)
Metrics: claims.submitted.total, claims.approved.total, claims.rejected.total, claims.processing.duration
Logs: Python logging bridge — every log record includes traceId and spanId in format string

Sample customers: CUST001 – CUST005 (seeded on startup)

Policy Service — Java / Spring Boot

Port: 8080 | Language: Java 21 | Framework: Spring Boot 3.2 + Spring Data MongoDB

Stores and serves insurance policy data. Queried by Claims Service for coverage validation and by the API Gateway for direct policy display.

Routes:

GET /policy/:customerId          → all policies for customer
GET /policy/:customerId/coverage → active policy + coverage limits/deductibles
GET /health

Coverage structure: Each active policy has medical, emergency, and dental sub-coverages with limits and deductibles. The Claims Service uses coverage.limit from this endpoint to decide auto-approval eligibility.

Why it matters for OTel: Policy Service demonstrates the Java instrumentation approach — the OpenTelemetry Java Agent. Unlike Python and Node.js where you write SDK initialization code, the Java agent instruments the JVM at startup via -javaagent:opentelemetry-javaagent.jar. Spring, MongoDB, HTTP clients — all auto-instrumented without a line of OTel code. Manual spans are added via GlobalOpenTelemetry.getTracer() for the business operations.

Instrumentation:

Agent: opentelemetry-javaagent.jar (auto-instrumented: Spring MVC, MongoDB, HTTP clients)
Manual: get_policy, get_coverage spans via GlobalOpenTelemetry.getTracer()
Metrics: policy.lookups.total, policy.errors.total
Logs: SLF4J (default Spring Boot logging — the Java agent bridges this to OTel)

Sample customers: 5 customers with different policy types (health, auto, property, life)

Investment Service — Node.js / Express

Port: 3002 | Language: Node.js 20 | Framework: Express + Mongoose (MongoDB)

Portfolio tracking service. Stores holdings (AAPL, MSFT, GOOGL, BRK.B) per customer and simulates live price volatility (±0.5%) on each fetch.

Routes:

GET /investments/:customerId     → full portfolio (holdings, total value, change %)
GET /health

Why it matters for OTel: Demonstrates Node.js SDK setup with getNodeAutoInstrumentations() — a convenience function that registers all available auto-instrumentation packages at once. Also shows a real metric design choice: investment.portfolio.value as a histogram, appropriate for measuring a distribution of values across customers.

Instrumentation:

Auto: HTTP spans, Mongoose/MongoDB spans (getNodeAutoInstrumentations())
Manual: get_investments span with customer.id and portfolio.value
Metrics: investment.portfolio.lookups, investment.portfolio.value

Notification Service — Python / FastAPI

Port: 3003 | Language: Python 3.11 | Framework: FastAPI + Motor (async MongoDB)

Receives event-driven notifications from the Claims Service and stores them to MongoDB. Renders message templates by event type. No actual email/SMS — simulated only.

Routes:

POST /notify                     → receive and store notification
GET  /notifications/:customerId  → list notifications (20 most recent)
GET  /health

Event types: claim_submitted, claim_approved, claim_rejected, policy_renewed

Why it matters for OTel: The async, fire-and-forget call from Claims Service to Notification Service is a common propagation pattern. Claims injects context into the outbound POST; Notification Service extracts it. The span appears as a child of the Claims span — demonstrating that even non-critical async paths can carry trace context correctly.

Instrumentation:

Auto: FastAPI HTTP spans
Manual: send_notification, get_notifications spans
Metrics: notifications.sent.total, notifications.failed.total
Logs: Python logging bridge with trace context

Chaos Controller — Node.js / Express

Port: 3004 | Language: Node.js 20 | Framework: Express

The fault injection hub. Broadcasts chaos commands to all downstream services, triggering simulated failure conditions. Includes an embedded web UI at http://localhost:3004.

Fault types per service:

service_crash — service returns 503 for all requests
high_latency — adds 2–5 second delay to every response
db_failure — database operations fail with connection error
memory_spike — allocates ~50MB heap (unbounded growth)
cpu_spike — busy loop consuming CPU

Pre-built scenarios:

Cascading failure: Claims latency → Policy latency → Claims DB failure → Investment crash → Notification crash (staged over 8 seconds)
System-wide latency: All services simultaneously at high_latency
DB blackout: All services simultaneously at db_failure
Memory pressure: All services simultaneously at memory_spike

Routes:

GET  /chaos/status               → master state + live service health
POST /chaos/toggle               → enable/disable single fault on service(s)
POST /chaos/reset                → clear all faults
POST /chaos/scenario/cascading   → trigger staged multi-failure
GET  /                           → embedded HTML dashboard

Why it matters for OTel: This is where you prove that instrumentation actually works. Inject high_latency on the Policy Service and you should see the Claims Service span’s policy lookup child span grow to 2–5s. Inject db_failure and you should see ERROR status on MongoDB spans. If you can’t see these effects in your traces — your instrumentation is incomplete.

Request Lifecycle: Submit a Claim

Following a single request from browser to database gives you the full trace topology:

1. Frontend (React)
   └─► POST /api/claims
       traceparent: (injected by OTel Web SDK into fetch request)

2. API Gateway (Node.js)
   └─► receives request, creates SERVER span
       └─► forward /claims (CLIENT span)
           └─► POST http://claims-service:3001/claims
               traceparent: injected via propagation.inject()

3. Claims Service (Python)
   └─► receives request, creates SERVER span (parent = API Gateway CLIENT span)
       └─► submit_claim (manual span)
           ├─► GET /policy/{customerId}/coverage (httpx CLIENT span)
           │   └─► Policy Service (Java)
           │       └─► receives request, creates SERVER span
           │           └─► get_coverage (manual span)
           │               └─► MongoDB query (auto-instrumented)
           │
           ├─► MongoDB insert (PyMongo auto-instrumented span)
           │
           └─► POST /notify (httpx CLIENT span, non-blocking)
               └─► Notification Service (Python)
                   └─► send_notification (manual span)
                       └─► MongoDB insert (auto-instrumented)

Result: ONE trace_id, ~8 spans across 4 services and 3 languages

In Grafana Tempo, this renders as a single waterfall. The longest bar is usually the Policy Service lookup — the synchronous dependency that gates approval. This is exactly what you’d want to see in production: the critical path made visible.

OTel Instrumentation Across the Stack

What’s Already Instrumented

Every service ships with working OTel instrumentation out of the box. Here’s the pattern each language uses:

Node.js (API Gateway, Investment, Chaos Controller):

// tracing.js — loaded before app code via --require flag
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({ url: `${OTLP_ENDPOINT}/v1/traces` }),
  instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();

Python (Claims, Notification):

# instrumentation.py — imported first in main.py
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.pymongo import PymongoInstrumentor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor

provider = TracerProvider(resource=resource)
provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter(...)))
trace.set_tracer_provider(provider)

FastAPIInstrumentor().instrument()
PymongoInstrumentor().instrument()
HTTPXClientInstrumentor().instrument()

Java (Policy):

# In Dockerfile — no code changes required
CMD ["java", "-javaagent:/app/opentelemetry-javaagent.jar", "-jar", "app.jar"]

# Environment variables configure the agent
OTEL_SERVICE_NAME=insurewatch-policy-service
OTEL_EXPORTER_OTLP_ENDPOINT=http://collector:4318

What Manual Instrumentation Adds

Auto-instrumentation tells you what (HTTP routes, DB queries, framework internals). Manual instrumentation tells you why (business context).

Compare these two spans for the same claim submission:

Auto-instrumented only:

POST /claims — 145ms — OK
  └─ mongodb.insert — 12ms

With manual instrumentation:

POST /claims — 145ms — OK
  └─ submit_claim — 140ms
     ├─ claim.type = "medical"
     ├─ claim.amount = 1200.00
     ├─ claim.status = "pending"    ← auto-approval threshold hit
     ├─ policy.valid = true
     ├─ policy.coverage_limit = 5000.00
     └─ mongodb.insert — 12ms

The second span answers questions the first can’t: Why is this claim pending? What was the coverage limit? How many claims are auto-approved vs manually reviewed? This is what you add in Lab 2.

Local Lab Setup

All labs run on a single machine using Docker Compose. You need:

Docker Desktop (Mac/Windows) or Docker Engine + Compose plugin (Linux)
Git
4 GB RAM minimum (8 GB recommended — Java build is hungry)
Nothing else — no cloud accounts required

Clone the Monorepo

All seven services live in a single repository. Each lab is a branch:

git clone https://github.com/storl0rd/insurewatch.git
cd insurewatch

Branch	Lab
`main`	Fully working reference
`lab/1-propagation`	Lab 1 — broken context propagation
`lab/2-instrumentation`	Lab 2 — missing instrumentation
`lab/3-collector`	Lab 3 — incomplete collector config
`lab/4-chaos`	Lab 4 — all three problems combined

Start the Stack

Switch to the lab branch, then bring everything up:

git checkout lab/1-propagation   # or whichever lab you're doing
docker compose up --build

First run takes 3–5 minutes while Docker builds the Java service and pulls images. Subsequent runs are fast.

URL	What
`http://localhost:5173`	InsureWatch UI
`http://localhost:3100`	Grafana (no login required)
`http://localhost:3000`	API Gateway (direct)

Verify the Reference Stack

Before starting a lab, check that main is fully healthy:

git checkout main
docker compose up --build

Then submit a test claim:

curl -s -X POST http://localhost:3000/api/claims \
  -H "Content-Type: application/json" \
  -d '{
    "customer_id": "CUST001",
    "policy_number": "POL-001",
    "claim_type": "medical",
    "amount": 500,
    "description": "GP visit",
    "incident_date": "2026-03-01"
  }' | python3 -m json.tool

Open Grafana → Explore → Tempo. You should see a trace spanning api-gateway → claims-service → policy-service → notification-service. If you see a complete waterfall, you’re ready for the labs.

Beginner Path vs Intermediate Path

This guide is written for both. Here’s how to read the labs depending on your starting point:

If you’re new to OTel

The auto-instrumentation is already in place. Your job in the labs is to:

Read the existing telemetry — learn to navigate Grafana Tempo, understand the waterfall view, read span attributes
Recognize when something is missing (Lab 1: why are traces fragmented?)
Add manual instrumentation to an existing instrumented service (Lab 2: add spans to the quote function)
Configure a Collector pipeline from a skeleton (Lab 3)

You don’t need to understand every line of SDK initialization code before you start. The pattern becomes clear as you work through it.

If you have observability experience

The interesting questions are:

Why does the Python service use HTTPXClientInstrumentor but not RequestsInstrumentor? (Because the service uses httpx, not requests — instrumentation is library-specific)
What happens to the trace when the Notification Service call fails? (The Claims span continues — it’s non-fatal. But is the notification span correctly marked ERROR?)
How does the Java agent know to extract traceparent from incoming Spring MVC requests? (It instruments the DispatcherServlet — you don’t configure this)
What does the BatchSpanProcessor do if the Collector is unreachable? (It buffers spans up to the queue limit, then drops them — you’ll see this in Lab 3)

Look for these behaviours in the traces. The answers are in the data.

Common Questions

Why separate repos instead of a monorepo? Real microservices teams often have separate repos per service with separate CI/CD pipelines. The multi-repo structure teaches you that OTel configuration is per-service — each service owns its SDK initialization, its resource attributes, and its exporter config.

Why not Kubernetes for the labs? Kubernetes adds operational complexity that is orthogonal to learning OTel instrumentation. The concepts (DaemonSet agents, resource detection, k8s.* semantic conventions) are covered in Module 6. The labs are designed to be runnable on a laptop with Docker Compose — minimum friction, maximum learning.

Why MongoDB for all services? Consistency — one database technology means one fewer thing to configure. It also means PymongoInstrumentor and Mongoose auto-instrumentation both produce db.* spans you can compare directly.

Can I use a different observability backend instead of grafana/otel-lgtm? Yes. InsureWatch exports standard OTLP. Any OTLP-compatible backend works. grafana/otel-lgtm is the default because it requires zero signup and zero configuration. Lab 3 explicitly demonstrates switching backends.