A practical lesson on how to distinguish authoritative data from copied, derived, and query-optimized data so duplication can be judged correctly.
Reference data, derived data, and read models are important distinctions because teams often label all duplication as either “bad redundancy” or “completely fine.” Neither reaction is precise enough. In distributed systems, some duplication is necessary and healthy. The real question is what kind of duplication is happening and whether the authoritative boundary is still clear.
The most useful categories are:
authoritative data: the source of truthreference data: copied for lookup or enrichmentderived data: computed from authoritative facts or eventsread models: projections optimized for query or reportingThese categories help teams discuss duplication without blurring authority.
flowchart TD
A["Authoritative service"] --> B["Reference copies"]
A --> C["Derived calculations"]
A --> D["Read models and projections"]
B --> E["Not authoritative"]
C --> E
D --> E
Reference data is copied because another service needs local context or enrichment but does not own the underlying truth. Common examples include tenant display names, product titles, or currency metadata copied into another system for local use.
Reference data is usually safe when:
Derived data is computed from authoritative facts. Totals, summary views, risk scores, and analytics metrics often fall into this category. Derived data can be extremely useful, but teams should remember that it is downstream truth, not upstream authority.
If the derivation needs correction, the review question should usually be, “Was the source event or source state wrong?” not “Should we patch the derived table directly?”
Read models are projections built to serve queries well. They are common in event-driven and service-oriented systems because they reduce cross-service query chains and let teams shape query storage for specific use cases. A read model can be local to one service or shared for reporting.
The important warning is that read models should not silently become write paths for canonical state.
The event below is a safe kind of input for a read model.
1{
2 "event": "OrderPlaced",
3 "orderId": "ord_1042",
4 "tenantId": "t_17",
5 "totalAmount": 199.50,
6 "currency": "USD",
7 "placedAt": "2026-03-22T14:12:00Z"
8}
From this event, a reporting system can build timeline views, revenue summaries, and tenant dashboards. It should not redefine whether the order was validly placed.
1data_classification:
2 shipping_address_copy:
3 type: reference_data
4 authoritative_owner: customer-profile
5 daily_revenue_total:
6 type: derived_data
7 authoritative_owner: billing-events
8 order_reporting_projection:
9 type: read_model
10 authoritative_owner: checkout-and-billing-events
What this demonstrates:
The point is not to avoid all copies. The point is to stop vague duplication from becoming vague ownership.
A team says its reporting database is “just a read model,” but operations staff regularly fix customer-visible order states there because it is the fastest place to update dashboards. Is it still just a read model?
No. The stronger answer is that the reporting store has become a hidden operational write path. Once that happens, the architecture has lost its distinction between projection and authority.