E-commerce Product Catalog: System Design for Millions of SKUs
Problem Statement
E-commerce platforms require scalable product catalog systems capable of managing millions of SKUs with varying attributes, supporting complex search queries, and delivering sub-100ms response times. This common technical interview question at companies like Shopify, Amazon, and Wayfair challenges candidates to design flexible, high-performance catalog systems that support rapid product discovery and merchandising.
Solution Overview
A well-designed product catalog system uses a hybrid data storage approach combining relational databases for structured data integrity with search engines for query performance, complemented by robust caching and content delivery networks for scalability.
The system architecture separates read and write paths to optimize for different access patterns and scaling requirements. Real-time updates flow through an event-driven pipeline to maintain search index freshness while ensuring data consistency.
Implementation Details
1. Data Modeling Strategy
A flexible product data model must handle varying product types while maintaining query efficiency. This is a core Shopify interview topic that tests data modeling expertise.
The schema supports:
- Flexible attributes: Custom fields for different product types
- Hierarchical categories: Multi-level taxonomy with efficient traversal
- Variant support: For products with multiple options (size, color, etc.)
- Rich media: Images, videos, and other assets
2. Storage Architecture
Wayfair's interviews often focus on the storage architecture needed for large catalogs:
The hybrid approach leverages:
- Relational Database: For structured core product data with ACID properties
- Document Database: For flexible product attributes and specifications
- Search Engine: For text search and faceted filtering
- Object Storage: For product images and other media
- Distributed Cache: For frequently accessed product data
3. Search Architecture
Amazon interviews frequently address how to implement efficient search functionality:
Key search components include:
- Query Understanding: Parse and enhance user queries
- Inverted Index: For fast full-text search
- Faceted Search: For refining results by attributes
- Relevance Tuning: Balance between text match and business metrics
4. Write Pipeline
Shopify interviews often ask about safely updating product data across all systems:
The event-driven architecture ensures:
- Data Consistency: Primary write to database first
- Eventual Consistency: Secondary systems updated asynchronously
- Failure Isolation: Issues in one system don't block others
- Retry Mechanisms: For handling transient failures
5. Read Path Optimization
Amazon and Wayfair interviews often focus on optimizing the read path:
Optimization techniques include:
- CDN Caching: For static content (images, product descriptions)
- API Caching: For frequently requested products
- Read Replicas: For database query scaling
- Materialized Views: For complex aggregations
6. Catalog Consistency Challenges
A common interview question at Shopify and Etsy: "How do you ensure catalog consistency across search, inventory, and pricing systems?"
Consistency is maintained through:
- Event-Driven Updates: Propagate changes via events
- Idempotent Handlers: Safely process duplicate events
- Reconciliation Jobs: Periodically verify and fix inconsistencies
- Version Tracking: Detect and resolve conflicts
7. Scaling for Millions of SKUs
Wayfair and Amazon frequently ask about scaling strategies for massive catalogs:
Key scaling techniques:
- Horizontal Sharding: Partition data by product category or ID range
- Vertical Splitting: Separate frequently accessed fields
- Data Tiering: Active vs. archived products
- Asynchronous Processing: For bulk operations
8. Multi-Region Deployment
Global e-commerce platforms need multi-region deployments:
The multi-region approach provides:
- Reduced Latency: Serve users from nearby regions
- Disaster Recovery: Survive regional outages
- Compliance: Meet data residency requirements
Results & Validation
Performance Metrics
A well-designed product catalog system should achieve:
- Read Latency: < 100ms P95 for product detail views
- Search Latency: < 200ms P95 for complex queries
- Write Latency: < 500ms P95 for product updates
- Availability: 99.99% uptime
- Cache Hit Rate: > 95% for active products
Real-World Implementation Challenges
During interviews at Amazon, Shopify, and Wayfair, candidates are often asked about these common challenges:
- Catalog Size Impact: Systems behave differently at 1K vs. 1M vs. 10M+ products
- Attribute Explosion: Products with hundreds of attributes and specifications
- Seasonal Traffic: Black Friday/Cyber Monday traffic spikes (10-20x normal volume)
- Data Quality: Handling incomplete or inconsistent product information
- Merchant Performance: Multi-tenant systems where one merchant's actions affect others
Key Takeaways
- Hybrid Storage approach balances structured data integrity with search performance
- Read/Write Separation optimizes for different access patterns and scaling needs
- Event-Driven Architecture maintains consistency across multiple systems
- Multi-Layer Caching is essential for performance at scale
- Search Optimization requires both technical and business-rule considerations
Common Interview Questions & Solutions
Shopify: "Design a product catalog system that can handle millions of SKUs"
Solution: Implement a hybrid architecture with:
- Relational database for core product data (PostgreSQL/MySQL)
- Document store for flexible attributes (MongoDB/Firestore)
- Search engine for queries and facets (Elasticsearch)
- Event bus for cross-system synchronization (Kafka/SQS)
- Multi-level caching for performance (Redis + CDN)
Wayfair: "How would you model furniture products with hundreds of attributes?"
Solution: Use an Entity-Attribute-Value model with optimizations:
- Group common attributes in normalized tables
- Store variable attributes in a flexible schema
- Pre-compute frequently filtered attributes
- Create specialized indexes for search-critical fields
- Implement denormalized views for read efficiency
Amazon: "Design a system for product variations like size and color"
Solution: Implement a parent-child product model:
- Parent products represent the base item
- Child products (variants) for specific configurations
- Shared attributes at parent level
- Variant-specific attributes at child level
- Specialized indexing for variant searches
Etsy: "How would you handle product categorization for millions of products?"
Solution: Create a flexible category system:
- Hierarchical structure with parent-child relationships
- Materialized path for efficient tree traversal
- Multiple category assignments per product
- Category-specific attributes
- Automatic categorization suggestions via ML
Scaling Considerations
For systems handling millions of SKUs:
-
Database Strategy:
- Vertical partitioning for wide tables
- Horizontal sharding by category/seller
- Denormalization for read performance
- Mixed storage types for different data patterns
-
Search Scaling:
- Index partitioning by product segment
- Specialized indexes for common queries
- Aggregation result caching
- Query throttling and budgeting
-
Content Delivery:
- Global CDN for product images
- Edge caching for product data
- Dynamic image resizing
- Content preloading for popular products
Integration with Other E-commerce Systems
A product catalog integrates with multiple e-commerce components:
Key integration points:
- Pricing Engine: Access current product pricing and promotions
- Inventory Management: Check product availability
- Order Management: Validate orderable products
- Recommendation Engine: Get related products
- Content Management: Retrieve marketing content
Real-World Example: Wayfair's Catalog Architecture
Wayfair interviews often discuss their architecture that handles:
- 22+ million SKUs
- 100,000+ suppliers
- Complex product attributes (dimensions, materials, styles)
- Rich visual content (images, 3D models, videos)
Their architecture leverages:
- Multi-model database approach
- Machine learning for product categorization
- Visual search capabilities
- Near real-time supplier updates
- Advanced faceted search
Source References
- Shopify Engineering: Product Architecture
- Amazon's Approach to Product Catalog Design
- Wayfair Tech Blog: Scaling Product Discovery
- eBay's Product Catalog Architecture
- Etsy: Building a Cloud-Scale Product Catalog
Product Catalog Design Framework
Download our comprehensive framework for designing scalable product catalog systems that balance flexibility, performance, and maintainability.
The framework includes:
- Database schema templates for product data modeling
- Event schemas for catalog synchronization
- Caching strategies for product data
- Search optimization techniques
- Performance testing scenarios