Skip to content

Bronze-Silver-Gold Architecture Questions

Please answer the following questions to clarify the Bronze-Silver-Gold (Medallion) architecture approach for the inRiver ETL pipeline.

Instructions:

  • Answer each question by filling in the letter choice after the Answer: tag
  • If none of the options match your needs, choose the last option (Other) and describe your preference
  • Let me know when you're done

Architecture Overview

Bronze Layer (Raw/Landing)

  • Raw JSON files from inRiver S3 exports
  • Append-only, immutable
  • Audit metadata: load_timestamp, file_name, business_unit, source_system
  • Preserves complete lineage and reprocessing capability

Gold Layer (Business-Ready)

  • Current Blower table format (backward compatible)
  • Full product dataset (stateful, not just deltas)
  • Downstream systems consume directly
  • Updated via merge/upsert from Bronze deltas
  • Matches existing schema exactly

Question 1: Layer Structure

Do you need a Silver (intermediate) layer, or go directly Bronze → Gold?

A) Bronze → Gold only - Direct transformation, simpler architecture B) Bronze → Silver → Gold - Silver for normalized/cleaned data, Gold for business aggregations C) Bronze → Silver → Gold - Silver matches inRiver model (Product/Item), Gold flattens to Item-centric D) Other (please describe after Answer: tag below)

Answer: C - but with properly shaped tabular data in silver as well as leveraging dbt historical versioning

Detailed Architecture:

Bronze Layer:

  • Raw JSON from inRiver (immutable landing zone)
  • Append-only, preserves complete source data

Silver Layer:

  • Properly shaped tabular data (normalized relational tables)
  • Matches inRiver model: Product, Item, ChannelNode, Resource tables with proper relationships
  • dbt snapshots for historical versioning - Track changes over time using Type 2 Slowly Changing Dimensions (SCD)
  • Preserves the Product → Item hierarchy as separate tables
  • Enables historical analysis and audit trail

Gold Layer:

  • Flattened to Item-centric model (backward compatible with current Blower schema)
  • Business-ready for downstream consumption (Magento, Shopware, Algolia)
  • Full dataset maintained (not just deltas)
  • Optimized for downstream query patterns

Key Benefits:

  • Historical tracking: dbt snapshots capture every change to Products/Items over time
  • Flexibility: Silver preserves inRiver's data model for future use cases
  • Backward compatibility: Gold maintains current Blower format
  • Audit trail: Can see how any product/item changed over time
  • Reprocessability: Can rebuild Gold from Silver if business logic changes

Question 2: Bronze Storage Format

How should Bronze layer store raw JSON?

A) Aurora MySQL JSON columns - Store entire JSON payload in single column per entity B) S3 + Aurora metadata - Keep JSON in S3, Aurora tracks file metadata only C) Aurora JSONB-like - Parse and store as structured JSON in Aurora D) Other (please describe after Answer: tag below)

Rationale:

Why S3 + Aurora Metadata is Best:

  • Cost-effective: S3 storage (~$0.023/GB) vs Aurora storage costs
  • Scalability: S3 handles large JSON files without database bloat
  • Separation of concerns: Raw files in S3, processing metadata in Aurora
  • dbt compatibility: dbt can read from S3 using external tables or Python models
  • Reprocessability: Original files preserved in S3 for full reprocessing
  • Airflow integration: Natural fit - Airflow detects S3 events, logs metadata to Aurora, triggers dbt

Bronze Aurora Table Structure:

sql
bronze_file_metadata (
  file_id, 
  file_name, 
  s3_path, 
  business_unit, 
  entity_type, 
  load_timestamp, 
  file_size, 
  record_count, 
  processing_status
)

Workflow:

  1. inRiver writes JSON to S3
  2. S3 event triggers Airflow DAG
  3. Airflow logs file metadata to Aurora bronze table
  4. dbt reads JSON from S3, transforms to Silver tables
  5. dbt snapshots track changes in Silver
  6. dbt transforms Silver → Gold

Why Not Other Options:

  • Option A (JSON columns): Database bloat, higher costs, backup overhead, limited scalability
  • Option C (JSONB-like): Aurora MySQL lacks true JSONB, parsing overhead without benefits

Question 3: Bronze Audit Metadata

What audit metadata should Bronze capture?

A) Minimal - load_timestamp, file_name only B) Standard - load_timestamp, file_name, business_unit, entity_type, entity_id C) Comprehensive - Above + source_file_size, record_count, checksum, processing_status D) Other (please describe after Answer: tag below)

Rationale:

Comprehensive metadata provides significant operational and debugging benefits with minimal implementation overhead:

Core Metadata:

  • load_timestamp - When file was ingested
  • file_name - Original filename from inRiver
  • business_unit - BAK, CEFL, ECM, GEM
  • entity_type - Product, Item, ChannelNode, AttributeFile
  • entity_id - Primary entity ID from file (if applicable)

Extended Metadata:

  • source_file_size - File size in bytes (data quality check)
  • record_count - Number of entities in file (validation)
  • checksum - MD5/SHA256 hash (detect corruption, prevent duplicates)
  • processing_status - pending, processing, completed, failed (workflow tracking)

Additional Useful Fields:

  • s3_path - Full S3 URI
  • s3_bucket - Bucket name
  • s3_key - Object key
  • created_at - Record creation timestamp
  • updated_at - Last update timestamp
  • processed_at - When processing completed
  • error_message - Capture failures for debugging

Benefits:

  • Data quality: Validate record counts match expectations
  • Debugging: Quickly identify problematic files
  • Idempotency: Checksum prevents duplicate processing
  • Monitoring: Track processing status and duration
  • Audit trail: Complete lineage for compliance
  • Reprocessing: Identify which files to reprocess

Implementation Note: This data is easy to capture during S3 ingestion - Airflow can extract all metadata from S3 object properties and file parsing with minimal code.


Question 4: Initial Gold Load (Bootstrap)

How to handle initial Gold load to populate the full dataset?

A) Request full export from inRiver - One-time full load to populate Gold B) Migrate existing Blower data - Copy current Blower tables to Gold as starting point C) Accumulate deltas - Start empty, build up from delta files over time D) Other (please describe after Answer: tag below)

Rationale:

Accumulate Deltas Approach:

  • Start with empty Gold tables
  • Process delta files from inRiver to build up complete dataset
  • Each delta updates/inserts records in Gold via merge/upsert
  • Over time, Gold accumulates to full product catalog

Periodic Full Rebuilds:

  • Frequency: Quarterly (or as needed)
  • Process: Drop and rebuild Gold from Bronze/Silver history
  • Purpose: Data quality validation, schema migrations, bug fixes
  • Requirement: System must support full rebuild capability

Full Rebuild Process:

  1. Trigger full rebuild (manual or scheduled)
  2. Request inRiver to send complete dataset as delta batches
  3. Clear Gold tables (or create new version)
  4. Process all delta files through Bronze → Silver → Gold
  5. Validate Gold matches expected state
  6. Cutover to new Gold tables

Lineage Tracking in Gold:

Gold tables should include metadata columns for traceability:

sql
-- Lineage columns in Gold tables
source_s3_uri VARCHAR(500)      -- Original S3 file path
source_file_name VARCHAR(255)   -- Original filename
source_load_timestamp TIMESTAMP -- When file was ingested to Bronze
last_updated_timestamp TIMESTAMP -- When record was last updated in Gold
bronze_file_id BIGINT           -- FK to bronze_file_metadata

Benefits:

  • Auditability: Trace any Gold record back to source S3 file
  • Debugging: Identify which file caused data issues
  • Reprocessing: Know which files to reprocess for specific records
  • Data quality: Validate Gold against Bronze source files
  • Compliance: Complete lineage for regulatory requirements

Implementation Notes:

  • dbt models can propagate lineage metadata from Bronze → Silver → Gold
  • Silver snapshots preserve historical lineage
  • Gold maintains current lineage for active records
  • Full rebuilds validate entire pipeline end-to-end

Question 5: Bronze Retention Policy

How long should Bronze layer data be retained?

A) Keep forever - Complete audit trail, reprocessing capability B) Rolling window - Keep last 90 days, archive older to S3 C) After Gold sync - Delete Bronze after successful Gold update D) Other (please describe after Answer: tag below)

Answer:


Question 6: dbt Integration

How does dbt fit into this architecture?

A) dbt for Bronze → Gold - dbt models transform Bronze JSON to Gold tables B) Python for Bronze, dbt for Silver → Gold - Python loads JSON, dbt transforms C) dbt for all layers - dbt sources from S3, builds Bronze/Silver/Gold D) Other (please describe after Answer: tag below)

Answer:


Question 7: Silver Layer Normalization

Should Silver layer normalize the Product → Item hierarchy?

A) Yes - Silver has Product and Item tables (normalized), Gold flattens to Items B) No - Skip Silver, Bronze → Gold directly flattens Products to Items C) Hybrid - Silver for complex entities, direct Bronze → Gold for simple ones D) Other (please describe after Answer: tag below)

Answer:


Question 8: Deleted Entities in Gold

How to handle deleted entities in Gold layer?

A) Hard delete - Remove from Gold when Bronze indicates deletion B) Soft delete - Add deleted_at timestamp in Gold (change from earlier requirement) C) Archive table - Move deleted records to separate archive table D) Other (please describe after Answer: tag below)

Answer:


Benefits of Medallion Architecture

  • Reprocessability: Bronze preserves raw data for reprocessing if transformations change
  • Auditability: Complete lineage from source to consumption
  • Flexibility: Can rebuild Gold from Bronze if schema changes
  • Downstream simplicity: Gold maintains full dataset, downstream systems unchanged
  • Data quality: Each layer can have validation and quality checks
  • Performance: Gold optimized for query performance, Bronze for ingestion

Next Steps: After answering these questions, we'll update the requirements document to reflect the Bronze-Silver-Gold architecture.