What InsureWatch Is
InsureWatch is a fictional insurance operations platform built specifically as an OTel training environment. It does three things well:
-
It’s polyglot. Five backend services, three languages (Node.js, Python, Java), four frameworks (Express, FastAPI, Spring Boot, React). This is not an accident — real production systems look like this. Learning OTel on a single-language app gives you half the picture.
-
It has real business logic. Policies, claims, investments, notifications — each service has a distinct domain with actual data models and inter-service dependencies. When you instrument it, you’re adding telemetry to something that means something, not a toy counter.
-
It has a chaos controller. A dedicated service that injects live failure conditions into the running stack. Latency, database failures, memory pressure, service crashes — you can trigger them on demand and observe exactly what your instrumentation reveals (and doesn’t).
The purpose is not to build InsureWatch. It’s to operate it observably — to understand what’s happening inside a distributed system from the telemetry it produces.
Why Polyglot Matters
If you’ve instrumented one language, you understand OTel’s concepts. Polyglot teaches you OTel’s architecture.
The OTel project ships separate SDKs for each language — opentelemetry-sdk for Python, @opentelemetry/sdk-node for Node.js, the Java agent for JVM services. Each SDK has different initialization patterns, different auto-instrumentation packages, different ways of attaching to frameworks. But every SDK produces the same wire format (OTLP) and the same data model (spans, metrics, logs with shared context).
InsureWatch lets you see this in one running system:
- A Python service (Claims) using the OpenTelemetry Python SDK with
FastAPIInstrumentorandPymongoInstrumentor - A Node.js service (API Gateway) using
@opentelemetry/sdk-nodewithgetNodeAutoInstrumentations() - A Java service (Policy) using the OpenTelemetry Java Agent — a JVM agent that instruments without any code changes
All three produce traces. All three attach to the same trace ID when a request flows through them. The fact that one is Python, one is JavaScript, and one is Java is invisible to Grafana — it shows you one unified waterfall.
That is the vendor-neutral, language-neutral promise of OTel. InsureWatch demonstrates it in practice.
Service Architecture
┌─────────────────────────────────────────────────────────────────┐
│ InsureWatch Platform │
│ │
│ ┌─────────────┐ ┌──────────────────────────────────────┐ │
│ │ Frontend │ │ API Gateway │ │
│ │ React/Vite │───►│ Node.js / Express │ │
│ │ Port: 5173 │ │ Port: 3000 │ │
│ └─────────────┘ └──────┬──────────┬──────────┬─────────┘ │
│ │ │ │ │
│ ┌─────────────┘ ┌─────┘ ┌────┘ │
│ ▼ ▼ ▼ │
│ ┌──────────────────┐ ┌──────────────┐ ┌──────────────────┐ │
│ │ Claims Service │ │Policy Service│ │Investment Service│ │
│ │ Python/FastAPI │ │ Java/Spring │ │ Node.js/Express │ │
│ │ Port: 3001 │ │ Port: 8080 │ │ Port: 3002 │ │
│ └────────┬─────────┘ └──────────────┘ └──────────────────┘ │
│ │ validates ▲ │
│ └──────────────────┘ │
│ │ notifies │
│ ▼ │
│ ┌──────────────────────┐ ┌──────────────────────┐ │
│ │ Notification Service │ │ Chaos Controller │ │
│ │ Python/FastAPI │ │ Node.js/Express │ │
│ │ Port: 3003 │ │ Port: 3004 │ │
│ └──────────────────────┘ └──────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ MongoDB │ │
│ │ claims, policies, investments, notifications DBs │ │
│ └─────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│ OTLP
▼
┌──────────────────────┐
│ OTel Collector │
│ (or lgtm image) │
│ gRPC: 4317 │
│ HTTP: 4318 │
└──────────┬───────────┘
│
▼
┌──────────────────────┐
│ Grafana UI :3000 │
│ Tempo (traces) │
│ Prometheus (metrics│
│ Loki (logs) │
└──────────────────────┘
Service Reference
API Gateway — Node.js / Express
Port: 3000 | Language: Node.js 20 | Framework: Express 4
The single entry point for all external traffic. Stateless — no database. Routes requests to downstream services and injects trace context into every outgoing HTTP call.
Routes it exposes:
POST /api/claims → Claims Service
GET /api/claims/:id → Claims Service
GET /api/claims → Claims Service
GET /api/policy/:customerId → Policy Service
GET /api/investments/:customerId → Investment Service
GET /api/chaos/status → Chaos Controller
POST /api/chaos/toggle → Chaos Controller
GET /health
Why it matters for OTel: This is where you see context propagation in action. Every outbound axios call injects the traceparent header — connecting the gateway’s span to the downstream service’s span. Remove or misconfigure the propagator here and every downstream trace becomes a disconnected root span.
Instrumentation:
- Auto: HTTP server spans (Express), outbound HTTP spans (axios)
- Manual:
forward <path>span wrapping each downstream call - Metrics:
gateway.requests.total,gateway.errors.total,gateway.request.duration - Logs: Winston JSON with
traceId,spanId,traceFlagsinjected into every log record
Claims Service — Python / FastAPI
Port: 3001 | Language: Python 3.11 | Framework: FastAPI + Motor (async MongoDB)
The most complex service. Receives claim submissions, validates them against the Policy Service, auto-approves claims under $1,000, stores to MongoDB, and fires a notification asynchronously.
Routes:
POST /claims → submit claim
GET /claims/:claim_id → retrieve claim
GET /claims → list claims (filter by customer_id)
GET /health
Key business logic:
- Policy lookup before claim acceptance (
GET /policy/{customerId}/coverage) - Auto-approval threshold: claims < $1,000 approved immediately; larger claims set to “pending”
- Async notification fire-and-forget to Notification Service (3s timeout, non-fatal on failure)
Why it matters for OTel: Claims is the hub of the critical path. It calls two downstream services, writes to MongoDB, and has business logic (the approval threshold) that has no visibility without manual instrumentation. It also demonstrates the Python SDK initialization pattern — a dedicated instrumentation.py module loaded before the FastAPI app.
Instrumentation:
- Module:
instrumentation.py— TracerProvider, MeterProvider, LoggerProvider initialized before app - Auto: FastAPI HTTP spans, outbound httpx spans, MongoDB operations (PyMongo)
- Manual:
submit_claim,get_claim,list_claimsspans with business attributes (claim.type,claim.amount,claim.status,policy.coverage_limit) - Metrics:
claims.submitted.total,claims.approved.total,claims.rejected.total,claims.processing.duration - Logs: Python logging bridge — every log record includes
traceIdandspanIdin format string
Sample customers: CUST001 – CUST005 (seeded on startup)
Policy Service — Java / Spring Boot
Port: 8080 | Language: Java 21 | Framework: Spring Boot 3.2 + Spring Data MongoDB
Stores and serves insurance policy data. Queried by Claims Service for coverage validation and by the API Gateway for direct policy display.
Routes:
GET /policy/:customerId → all policies for customer
GET /policy/:customerId/coverage → active policy + coverage limits/deductibles
GET /health
Coverage structure: Each active policy has medical, emergency, and dental sub-coverages with limits and deductibles. The Claims Service uses coverage.limit from this endpoint to decide auto-approval eligibility.
Why it matters for OTel: Policy Service demonstrates the Java instrumentation approach — the OpenTelemetry Java Agent. Unlike Python and Node.js where you write SDK initialization code, the Java agent instruments the JVM at startup via -javaagent:opentelemetry-javaagent.jar. Spring, MongoDB, HTTP clients — all auto-instrumented without a line of OTel code. Manual spans are added via GlobalOpenTelemetry.getTracer() for the business operations.
Instrumentation:
- Agent:
opentelemetry-javaagent.jar(auto-instrumented: Spring MVC, MongoDB, HTTP clients) - Manual:
get_policy,get_coveragespans viaGlobalOpenTelemetry.getTracer() - Metrics:
policy.lookups.total,policy.errors.total - Logs: SLF4J (default Spring Boot logging — the Java agent bridges this to OTel)
Sample customers: 5 customers with different policy types (health, auto, property, life)
Investment Service — Node.js / Express
Port: 3002 | Language: Node.js 20 | Framework: Express + Mongoose (MongoDB)
Portfolio tracking service. Stores holdings (AAPL, MSFT, GOOGL, BRK.B) per customer and simulates live price volatility (±0.5%) on each fetch.
Routes:
GET /investments/:customerId → full portfolio (holdings, total value, change %)
GET /health
Why it matters for OTel: Demonstrates Node.js SDK setup with getNodeAutoInstrumentations() — a convenience function that registers all available auto-instrumentation packages at once. Also shows a real metric design choice: investment.portfolio.value as a histogram, appropriate for measuring a distribution of values across customers.
Instrumentation:
- Auto: HTTP spans, Mongoose/MongoDB spans (
getNodeAutoInstrumentations()) - Manual:
get_investmentsspan withcustomer.idandportfolio.value - Metrics:
investment.portfolio.lookups,investment.portfolio.value
Notification Service — Python / FastAPI
Port: 3003 | Language: Python 3.11 | Framework: FastAPI + Motor (async MongoDB)
Receives event-driven notifications from the Claims Service and stores them to MongoDB. Renders message templates by event type. No actual email/SMS — simulated only.
Routes:
POST /notify → receive and store notification
GET /notifications/:customerId → list notifications (20 most recent)
GET /health
Event types: claim_submitted, claim_approved, claim_rejected, policy_renewed
Why it matters for OTel: The async, fire-and-forget call from Claims Service to Notification Service is a common propagation pattern. Claims injects context into the outbound POST; Notification Service extracts it. The span appears as a child of the Claims span — demonstrating that even non-critical async paths can carry trace context correctly.
Instrumentation:
- Auto: FastAPI HTTP spans
- Manual:
send_notification,get_notificationsspans - Metrics:
notifications.sent.total,notifications.failed.total - Logs: Python logging bridge with trace context
Chaos Controller — Node.js / Express
Port: 3004 | Language: Node.js 20 | Framework: Express
The fault injection hub. Broadcasts chaos commands to all downstream services, triggering simulated failure conditions. Includes an embedded web UI at http://localhost:3004.
Fault types per service:
service_crash— service returns 503 for all requestshigh_latency— adds 2–5 second delay to every responsedb_failure— database operations fail with connection errormemory_spike— allocates ~50MB heap (unbounded growth)cpu_spike— busy loop consuming CPU
Pre-built scenarios:
- Cascading failure: Claims latency → Policy latency → Claims DB failure → Investment crash → Notification crash (staged over 8 seconds)
- System-wide latency: All services simultaneously at high_latency
- DB blackout: All services simultaneously at db_failure
- Memory pressure: All services simultaneously at memory_spike
Routes:
GET /chaos/status → master state + live service health
POST /chaos/toggle → enable/disable single fault on service(s)
POST /chaos/reset → clear all faults
POST /chaos/scenario/cascading → trigger staged multi-failure
GET / → embedded HTML dashboard
Why it matters for OTel: This is where you prove that instrumentation actually works. Inject high_latency on the Policy Service and you should see the Claims Service span’s policy lookup child span grow to 2–5s. Inject db_failure and you should see ERROR status on MongoDB spans. If you can’t see these effects in your traces — your instrumentation is incomplete.
Request Lifecycle: Submit a Claim
Following a single request from browser to database gives you the full trace topology:
1. Frontend (React)
└─► POST /api/claims
traceparent: (injected by OTel Web SDK into fetch request)
2. API Gateway (Node.js)
└─► receives request, creates SERVER span
└─► forward /claims (CLIENT span)
└─► POST http://claims-service:3001/claims
traceparent: injected via propagation.inject()
3. Claims Service (Python)
└─► receives request, creates SERVER span (parent = API Gateway CLIENT span)
└─► submit_claim (manual span)
├─► GET /policy/{customerId}/coverage (httpx CLIENT span)
│ └─► Policy Service (Java)
│ └─► receives request, creates SERVER span
│ └─► get_coverage (manual span)
│ └─► MongoDB query (auto-instrumented)
│
├─► MongoDB insert (PyMongo auto-instrumented span)
│
└─► POST /notify (httpx CLIENT span, non-blocking)
└─► Notification Service (Python)
└─► send_notification (manual span)
└─► MongoDB insert (auto-instrumented)
Result: ONE trace_id, ~8 spans across 4 services and 3 languages
In Grafana Tempo, this renders as a single waterfall. The longest bar is usually the Policy Service lookup — the synchronous dependency that gates approval. This is exactly what you’d want to see in production: the critical path made visible.
OTel Instrumentation Across the Stack
What’s Already Instrumented
Every service ships with working OTel instrumentation out of the box. Here’s the pattern each language uses:
Node.js (API Gateway, Investment, Chaos Controller):
// tracing.js — loaded before app code via --require flag
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({ url: `${OTLP_ENDPOINT}/v1/traces` }),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
Python (Claims, Notification):
# instrumentation.py — imported first in main.py
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.pymongo import PymongoInstrumentor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
provider = TracerProvider(resource=resource)
provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter(...)))
trace.set_tracer_provider(provider)
FastAPIInstrumentor().instrument()
PymongoInstrumentor().instrument()
HTTPXClientInstrumentor().instrument()
Java (Policy):
# In Dockerfile — no code changes required
CMD ["java", "-javaagent:/app/opentelemetry-javaagent.jar", "-jar", "app.jar"]
# Environment variables configure the agent
OTEL_SERVICE_NAME=insurewatch-policy-service
OTEL_EXPORTER_OTLP_ENDPOINT=http://collector:4318
What Manual Instrumentation Adds
Auto-instrumentation tells you what (HTTP routes, DB queries, framework internals). Manual instrumentation tells you why (business context).
Compare these two spans for the same claim submission:
Auto-instrumented only:
POST /claims — 145ms — OK
└─ mongodb.insert — 12ms
With manual instrumentation:
POST /claims — 145ms — OK
└─ submit_claim — 140ms
├─ claim.type = "medical"
├─ claim.amount = 1200.00
├─ claim.status = "pending" ← auto-approval threshold hit
├─ policy.valid = true
├─ policy.coverage_limit = 5000.00
└─ mongodb.insert — 12ms
The second span answers questions the first can’t: Why is this claim pending? What was the coverage limit? How many claims are auto-approved vs manually reviewed? This is what you add in Lab 2.
Local Lab Setup
All labs run on a single machine using Docker Compose. You need:
- Docker Desktop (Mac/Windows) or Docker Engine + Compose plugin (Linux)
- Git
- 4 GB RAM minimum (8 GB recommended — Java build is hungry)
- Nothing else — no cloud accounts required
Clone the Monorepo
All seven services live in a single repository. Each lab is a branch:
git clone https://github.com/storl0rd/insurewatch.git
cd insurewatch
| Branch | Lab |
|---|---|
main | Fully working reference |
lab/1-propagation | Lab 1 — broken context propagation |
lab/2-instrumentation | Lab 2 — missing instrumentation |
lab/3-collector | Lab 3 — incomplete collector config |
lab/4-chaos | Lab 4 — all three problems combined |
Start the Stack
Switch to the lab branch, then bring everything up:
git checkout lab/1-propagation # or whichever lab you're doing
docker compose up --build
First run takes 3–5 minutes while Docker builds the Java service and pulls images. Subsequent runs are fast.
| URL | What |
|---|---|
http://localhost:5173 | InsureWatch UI |
http://localhost:3100 | Grafana (no login required) |
http://localhost:3000 | API Gateway (direct) |
Verify the Reference Stack
Before starting a lab, check that main is fully healthy:
git checkout main
docker compose up --build
Then submit a test claim:
curl -s -X POST http://localhost:3000/api/claims \
-H "Content-Type: application/json" \
-d '{
"customer_id": "CUST001",
"policy_number": "POL-001",
"claim_type": "medical",
"amount": 500,
"description": "GP visit",
"incident_date": "2026-03-01"
}' | python3 -m json.tool
Open Grafana → Explore → Tempo. You should see a trace spanning api-gateway → claims-service → policy-service → notification-service. If you see a complete waterfall, you’re ready for the labs.
Beginner Path vs Intermediate Path
This guide is written for both. Here’s how to read the labs depending on your starting point:
If you’re new to OTel
The auto-instrumentation is already in place. Your job in the labs is to:
- Read the existing telemetry — learn to navigate Grafana Tempo, understand the waterfall view, read span attributes
- Recognize when something is missing (Lab 1: why are traces fragmented?)
- Add manual instrumentation to an existing instrumented service (Lab 2: add spans to the quote function)
- Configure a Collector pipeline from a skeleton (Lab 3)
You don’t need to understand every line of SDK initialization code before you start. The pattern becomes clear as you work through it.
If you have observability experience
The interesting questions are:
- Why does the Python service use
HTTPXClientInstrumentorbut notRequestsInstrumentor? (Because the service useshttpx, notrequests— instrumentation is library-specific) - What happens to the trace when the Notification Service call fails? (The Claims span continues — it’s non-fatal. But is the notification span correctly marked ERROR?)
- How does the Java agent know to extract
traceparentfrom incoming Spring MVC requests? (It instruments theDispatcherServlet— you don’t configure this) - What does the
BatchSpanProcessordo if the Collector is unreachable? (It buffers spans up to the queue limit, then drops them — you’ll see this in Lab 3)
Look for these behaviours in the traces. The answers are in the data.
Common Questions
Why separate repos instead of a monorepo? Real microservices teams often have separate repos per service with separate CI/CD pipelines. The multi-repo structure teaches you that OTel configuration is per-service — each service owns its SDK initialization, its resource attributes, and its exporter config.
Why not Kubernetes for the labs? Kubernetes adds operational complexity that is orthogonal to learning OTel instrumentation. The concepts (DaemonSet agents, resource detection, k8s.* semantic conventions) are covered in Module 6. The labs are designed to be runnable on a laptop with Docker Compose — minimum friction, maximum learning.
Why MongoDB for all services?
Consistency — one database technology means one fewer thing to configure. It also means PymongoInstrumentor and Mongoose auto-instrumentation both produce db.* spans you can compare directly.
Can I use a different observability backend instead of grafana/otel-lgtm?
Yes. InsureWatch exports standard OTLP. Any OTLP-compatible backend works. grafana/otel-lgtm is the default because it requires zero signup and zero configuration. Lab 3 explicitly demonstrates switching backends.