Requirements Verification Questions

Please answer the following questions to help clarify the requirements for the inRiver ETL pipeline project.

Instructions:

Answer each question by filling in the letter choice after the Answer: tag
If none of the options match your needs, choose the last option (Other) and describe your preference
Let me know when you're done

Question 1: Current Blower Database Schema

Do you have documentation or can you provide the current Blower database schema (tables, columns, data types)?

A) Yes, I can provide the schema documentation B) Yes, I can provide access to the database for schema extraction C) Partial - I have some documentation but not complete D) No, but I can work with the team to document it E) Other (please describe after Answer: tag below)

Question 2: Downstream Systems

What downstream systems consume the Blower database outputs, and what format do they expect?

A) E-commerce platforms (Magento, Shopware) - they expect specific table structures B) Business Intelligence/Analytics tools - they expect dimensional model C) Multiple systems with different requirements D) Not sure - need to identify all consumers E) Other (please describe after Answer: tag below)

Answer: A - plus Algolia indexes and possibly some other systems that currently have their own ETL

Question 3: Business Unit Handling

How should the pipeline handle multiple business units (BAK, CEFL, ECM, GEM)?

A) Separate pipelines per business unit (parallel processing) B) Single pipeline with business unit filtering/routing C) Single pipeline, all business units in same tables with BU column D) Other (please describe after Answer: tag below)

Question 4: Data Volume and Performance

What are the expected data volumes and performance requirements?

A) Small scale: <100K products, hourly updates acceptable B) Medium scale: 100K-1M products, updates within 15 minutes C) Large scale: >1M products, near real-time (< 5 minutes) D) Not sure - need to analyze current volumes E) Other (please describe after Answer: tag below)

Question 5: Delta Update Strategy

How should the pipeline handle delta updates (inserts, updates, deletes)?

A) Merge/upsert based on entity ID with soft deletes (deleted_at timestamp) B) Merge/upsert with hard deletes (remove records) C) Append-only with versioning (keep history of all changes) D) Not sure - need guidance on best practice E) Other (please describe after Answer: tag below)

Question 6: Error Handling and Data Quality

What error handling and data quality requirements are needed?

A) Basic: Log errors, continue processing, manual review B) Standard: Validation rules, quarantine bad records, alerting C) Comprehensive: Schema validation, business rule validation, automated retry, dead letter queue D) Not sure - need recommendations E) Other (please describe after Answer: tag below)

Question 7: Existing Airflow Infrastructure

What is the current Airflow setup?

A) Existing Airflow cluster - can add new DAGs B) Airflow exists but needs upgrade/reconfiguration C) Need to provision new Airflow infrastructure D) Open to alternatives (e.g., AWS Step Functions, Prefect) E) Other (please describe after Answer: tag below)

Answer: A - we will likely put this in a new MWAA that already has been created

Question 8: S3 Bucket Access

How will the pipeline access inRiver exports in S3?

A) S3 bucket already configured with access credentials B) Need to set up S3 bucket and IAM roles C) Aperture Labs will provide S3 event notifications D) Not sure - need to coordinate with Aperture Labs E) Other (please describe after Answer: tag below)

Answer: A - we have control of the S3 bucket and so can add event notifications on file upload

Question 9: PostgreSQL Database

What is the PostgreSQL database setup?

A) Existing PostgreSQL database - can create new schemas B) Need to provision new PostgreSQL instance C) Using managed service (RDS, Aurora, Cloud SQL) D) Not decided yet - need recommendations E) Other (please describe after Answer: tag below)

Answer: C my mistake I think we use MariaDB on RDS - we can consider using aurora if that works better

Question 10: dbt Transformation Complexity

What level of transformation is needed in dbt to match current Blower output?

A) Simple: Mostly 1:1 mapping with minor transformations B) Moderate: Some denormalization, calculated fields, lookups C) Complex: Significant business logic, aggregations, multi-table joins D) Not sure - need to analyze current transformations E) Other (please describe after Answer: tag below)

Question 11: Nested Data Handling

How should nested data (Items, Resources, Links) be handled in PostgreSQL?

A) Flatten to separate tables with foreign keys (normalized) B) Store as JSONB columns (semi-structured) C) Hybrid: Flatten key entities, JSONB for complex nested data D) Not sure - need recommendations E) Other (please describe after Answer: tag below)

Answer: A - another note is that our current process deals with items our new files have products and items are children of products

Question 12: Monitoring and Alerting

What monitoring and alerting requirements are needed?

A) Basic: Airflow task success/failure notifications B) Standard: Data quality metrics, processing time, record counts C) Comprehensive: Full observability with metrics, logs, traces, SLAs D) Not sure - need recommendations E) Other (please describe after Answer: tag below)

Question 13: Testing Strategy

What testing approach should be used?

A) Unit tests for dbt models and Python code B) Integration tests with sample data C) End-to-end tests with production-like data D) All of the above E) Other (please describe after Answer: tag below)

Answer: E - we want the created tables to be identical in format and data to our existing tables, e.g. for item 1234, if it has 5 attributes, and 2 media the same item in our new process should look the same

Question 14: Deployment and CI/CD

What deployment approach should be used?

A) Manual deployment with documentation B) CI/CD pipeline with automated testing C) Infrastructure as Code (Terraform) + CI/CD D) Not sure - need recommendations E) Other (please describe after Answer: tag below)

Question 15: Incremental vs Full Refresh

Should the pipeline support both incremental and full refresh modes?

A) Incremental only (delta updates) B) Full refresh only (rebuild from scratch) C) Both modes - incremental daily, full refresh on-demand D) Not sure - need recommendations E) Other (please describe after Answer: tag below)

Answer: A - InRiver can only give incremental files; if we need a full rebuild we would drop the database and cause inRiver to send us batches of incrementals for every product for a business unit

Question 16: AttributeFile Handling

How should the daily AttributeFile be processed?

A) Load to metadata table, use for validation B) Load to metadata table, use for dynamic schema generation C) Store as reference only, don't process D) Not sure - need recommendations E) Other (please describe after Answer: tag below)

Answer: E - the attribute file is expected to be loaded daily into a table that is joined to by the incremental item data

Question 17: ChannelNode Scalability

Given the concern about large ChannelNode OutboundLinks arrays, how should this be handled?

A) Load all links, optimize with indexing B) Paginate/chunk large link arrays C) Only process InboundLinks from Products (skip ChannelNode OutboundLinks) D) Not sure - need recommendations E) Other (please describe after Answer: tag below)

Answer: E - we are going to ask that outbound links to products be excluded

Question 18: Project Timeline

What is the expected timeline for this project?

A) Urgent: 2-4 weeks B) Standard: 1-2 months C) Flexible: 3+ months D) Not sure - depends on complexity E) Other (please describe after Answer: tag below)

Question 19: Team and Resources

What team resources are available for this project?

A) Data engineers familiar with Airflow and dbt B) Backend developers who can learn Airflow/dbt C) Need to hire/contract resources D) Mixed team with varying experience E) Other (please describe after Answer: tag below)

Answer: D - we have engineers that use airflow and dbt but never this way

Question 20: Success Criteria

How will success be measured for this migration?

A) Functional parity with current Blower outputs B) Functional parity + improved performance (faster updates) C) Functional parity + improved data quality + observability D) Not sure - need to define success criteria E) Other (please describe after Answer: tag below)

Requirements Verification Questions ​

Question 1: Current Blower Database Schema ​

Question 2: Downstream Systems ​

Question 3: Business Unit Handling ​

Question 4: Data Volume and Performance ​

Question 5: Delta Update Strategy ​

Question 6: Error Handling and Data Quality ​

Question 7: Existing Airflow Infrastructure ​

Question 8: S3 Bucket Access ​

Question 9: PostgreSQL Database ​

Question 10: dbt Transformation Complexity ​

Question 11: Nested Data Handling ​

Question 12: Monitoring and Alerting ​

Question 13: Testing Strategy ​

Question 14: Deployment and CI/CD ​

Question 15: Incremental vs Full Refresh ​

Question 16: AttributeFile Handling ​

Question 17: ChannelNode Scalability ​

Question 18: Project Timeline ​

Question 19: Team and Resources ​

Question 20: Success Criteria ​