Bronze-Silver-Gold Architecture Questions
Please answer the following questions to clarify the Bronze-Silver-Gold (Medallion) architecture approach for the inRiver ETL pipeline.
Instructions:
- Answer each question by filling in the letter choice after the Answer: tag
- If none of the options match your needs, choose the last option (Other) and describe your preference
- Let me know when you're done
Architecture Overview
Bronze Layer (Raw/Landing)
- Raw JSON files from inRiver S3 exports
- Append-only, immutable
- Audit metadata:
load_timestamp,file_name,business_unit,source_system - Preserves complete lineage and reprocessing capability
Gold Layer (Business-Ready)
- Current Blower table format (backward compatible)
- Full product dataset (stateful, not just deltas)
- Downstream systems consume directly
- Updated via merge/upsert from Bronze deltas
- Matches existing schema exactly
Question 1: Layer Structure
Do you need a Silver (intermediate) layer, or go directly Bronze → Gold?
A) Bronze → Gold only - Direct transformation, simpler architecture B) Bronze → Silver → Gold - Silver for normalized/cleaned data, Gold for business aggregations C) Bronze → Silver → Gold - Silver matches inRiver model (Product/Item), Gold flattens to Item-centric D) Other (please describe after Answer: tag below)
Answer: C - but with properly shaped tabular data in silver as well as leveraging dbt historical versioning
Detailed Architecture:
Bronze Layer:
- Raw JSON from inRiver (immutable landing zone)
- Append-only, preserves complete source data
Silver Layer:
- Properly shaped tabular data (normalized relational tables)
- Matches inRiver model: Product, Item, ChannelNode, Resource tables with proper relationships
- dbt snapshots for historical versioning - Track changes over time using Type 2 Slowly Changing Dimensions (SCD)
- Preserves the Product → Item hierarchy as separate tables
- Enables historical analysis and audit trail
Gold Layer:
- Flattened to Item-centric model (backward compatible with current Blower schema)
- Business-ready for downstream consumption (Magento, Shopware, Algolia)
- Full dataset maintained (not just deltas)
- Optimized for downstream query patterns
Key Benefits:
- Historical tracking: dbt snapshots capture every change to Products/Items over time
- Flexibility: Silver preserves inRiver's data model for future use cases
- Backward compatibility: Gold maintains current Blower format
- Audit trail: Can see how any product/item changed over time
- Reprocessability: Can rebuild Gold from Silver if business logic changes
Question 2: Bronze Storage Format
How should Bronze layer store raw JSON?
A) Aurora MySQL JSON columns - Store entire JSON payload in single column per entity B) S3 + Aurora metadata - Keep JSON in S3, Aurora tracks file metadata only C) Aurora JSONB-like - Parse and store as structured JSON in Aurora D) Other (please describe after Answer: tag below)
Rationale:
Why S3 + Aurora Metadata is Best:
- Cost-effective: S3 storage (~$0.023/GB) vs Aurora storage costs
- Scalability: S3 handles large JSON files without database bloat
- Separation of concerns: Raw files in S3, processing metadata in Aurora
- dbt compatibility: dbt can read from S3 using external tables or Python models
- Reprocessability: Original files preserved in S3 for full reprocessing
- Airflow integration: Natural fit - Airflow detects S3 events, logs metadata to Aurora, triggers dbt
Bronze Aurora Table Structure:
bronze_file_metadata (
file_id,
file_name,
s3_path,
business_unit,
entity_type,
load_timestamp,
file_size,
record_count,
processing_status
)Workflow:
- inRiver writes JSON to S3
- S3 event triggers Airflow DAG
- Airflow logs file metadata to Aurora bronze table
- dbt reads JSON from S3, transforms to Silver tables
- dbt snapshots track changes in Silver
- dbt transforms Silver → Gold
Why Not Other Options:
- Option A (JSON columns): Database bloat, higher costs, backup overhead, limited scalability
- Option C (JSONB-like): Aurora MySQL lacks true JSONB, parsing overhead without benefits
Question 3: Bronze Audit Metadata
What audit metadata should Bronze capture?
A) Minimal - load_timestamp, file_name only B) Standard - load_timestamp, file_name, business_unit, entity_type, entity_id C) Comprehensive - Above + source_file_size, record_count, checksum, processing_status D) Other (please describe after Answer: tag below)
Rationale:
Comprehensive metadata provides significant operational and debugging benefits with minimal implementation overhead:
Core Metadata:
load_timestamp- When file was ingestedfile_name- Original filename from inRiverbusiness_unit- BAK, CEFL, ECM, GEMentity_type- Product, Item, ChannelNode, AttributeFileentity_id- Primary entity ID from file (if applicable)
Extended Metadata:
source_file_size- File size in bytes (data quality check)record_count- Number of entities in file (validation)checksum- MD5/SHA256 hash (detect corruption, prevent duplicates)processing_status- pending, processing, completed, failed (workflow tracking)
Additional Useful Fields:
s3_path- Full S3 URIs3_bucket- Bucket names3_key- Object keycreated_at- Record creation timestampupdated_at- Last update timestampprocessed_at- When processing completederror_message- Capture failures for debugging
Benefits:
- Data quality: Validate record counts match expectations
- Debugging: Quickly identify problematic files
- Idempotency: Checksum prevents duplicate processing
- Monitoring: Track processing status and duration
- Audit trail: Complete lineage for compliance
- Reprocessing: Identify which files to reprocess
Implementation Note: This data is easy to capture during S3 ingestion - Airflow can extract all metadata from S3 object properties and file parsing with minimal code.
Question 4: Initial Gold Load (Bootstrap)
How to handle initial Gold load to populate the full dataset?
A) Request full export from inRiver - One-time full load to populate Gold B) Migrate existing Blower data - Copy current Blower tables to Gold as starting point C) Accumulate deltas - Start empty, build up from delta files over time D) Other (please describe after Answer: tag below)
Rationale:
Accumulate Deltas Approach:
- Start with empty Gold tables
- Process delta files from inRiver to build up complete dataset
- Each delta updates/inserts records in Gold via merge/upsert
- Over time, Gold accumulates to full product catalog
Periodic Full Rebuilds:
- Frequency: Quarterly (or as needed)
- Process: Drop and rebuild Gold from Bronze/Silver history
- Purpose: Data quality validation, schema migrations, bug fixes
- Requirement: System must support full rebuild capability
Full Rebuild Process:
- Trigger full rebuild (manual or scheduled)
- Request inRiver to send complete dataset as delta batches
- Clear Gold tables (or create new version)
- Process all delta files through Bronze → Silver → Gold
- Validate Gold matches expected state
- Cutover to new Gold tables
Lineage Tracking in Gold:
Gold tables should include metadata columns for traceability:
-- Lineage columns in Gold tables
source_s3_uri VARCHAR(500) -- Original S3 file path
source_file_name VARCHAR(255) -- Original filename
source_load_timestamp TIMESTAMP -- When file was ingested to Bronze
last_updated_timestamp TIMESTAMP -- When record was last updated in Gold
bronze_file_id BIGINT -- FK to bronze_file_metadataBenefits:
- Auditability: Trace any Gold record back to source S3 file
- Debugging: Identify which file caused data issues
- Reprocessing: Know which files to reprocess for specific records
- Data quality: Validate Gold against Bronze source files
- Compliance: Complete lineage for regulatory requirements
Implementation Notes:
- dbt models can propagate lineage metadata from Bronze → Silver → Gold
- Silver snapshots preserve historical lineage
- Gold maintains current lineage for active records
- Full rebuilds validate entire pipeline end-to-end
Question 5: Bronze Retention Policy
How long should Bronze layer data be retained?
A) Keep forever - Complete audit trail, reprocessing capability B) Rolling window - Keep last 90 days, archive older to S3 C) After Gold sync - Delete Bronze after successful Gold update D) Other (please describe after Answer: tag below)
Question 6: dbt Integration
How does dbt fit into this architecture?
A) dbt for Bronze → Gold - dbt models transform Bronze JSON to Gold tables B) Python for Bronze, dbt for Silver → Gold - Python loads JSON, dbt transforms C) dbt for all layers - dbt sources from S3, builds Bronze/Silver/Gold D) Other (please describe after Answer: tag below)
Question 7: Silver Layer Normalization
Should Silver layer normalize the Product → Item hierarchy?
A) Yes - Silver has Product and Item tables (normalized), Gold flattens to Items B) No - Skip Silver, Bronze → Gold directly flattens Products to Items C) Hybrid - Silver for complex entities, direct Bronze → Gold for simple ones D) Other (please describe after Answer: tag below)
Question 8: Deleted Entities in Gold
How to handle deleted entities in Gold layer?
A) Hard delete - Remove from Gold when Bronze indicates deletion B) Soft delete - Add deleted_at timestamp in Gold (change from earlier requirement) C) Archive table - Move deleted records to separate archive table D) Other (please describe after Answer: tag below)
Benefits of Medallion Architecture
- Reprocessability: Bronze preserves raw data for reprocessing if transformations change
- Auditability: Complete lineage from source to consumption
- Flexibility: Can rebuild Gold from Bronze if schema changes
- Downstream simplicity: Gold maintains full dataset, downstream systems unchanged
- Data quality: Each layer can have validation and quality checks
- Performance: Gold optimized for query performance, Bronze for ingestion
Next Steps: After answering these questions, we'll update the requirements document to reflect the Bronze-Silver-Gold architecture.