đŻUltimate cheatsheet for Data Engineering System Design interview.
Because 'It Depends' Is a Valid Answer (But You Still Need to Know What It Depends On)
The 45-Minute Challenge
Imagine youâre sitting in a System Design interview. The interviewer just gave you a couple of sentences about requirements, and now you have around 45 minutes to dive deep into the architecture of a data engineering system. Your whiteboard is blank. The clock is ticking. â°
Sound familiar?
Whether youâre prepping for an interview or just want to level up your architectural thinking, this guide will give you a framework to tackle any data engineering system design challenge with confidence.
Letâs break it down.
đşď¸ The Master Architecture
Before we dive into the details, hereâs the high-level view of what weâre building. A solid data engineering system has six essential layers:
Think of this as your mental model. Every system design conversation should touch on these layers, though youâll spend more time on some than others depending on the requirements.
đ Sources Layer
â
âď¸ Processing Layer
â
đž Storage Layer
â
đ¤ ML/AI Layer (Optional)
â
đĽ Consumers Layer
â
đ ď¸ Tools Layer (Cross-cutting)
đŹ Where Do You Start?
First of all, you can decide where to start. From top to bottom or bottom to top?
Hereâs the truth: It doesnât really matter for data engineering architecture. You can start with sources and work your way down, or start with consumers and work backwards. Both approaches work.
However, thereâs one exception: data modelling interviews. If youâre specifically being asked about data modelling, start from the consumer requirements and work backwards. This shows youâre thinking about the business use case first.
đĄ Pro Tip: Tell a Story, Donât Just Draw Boxes
Use a drawing board like Drawio, Lucidchart, or at least Google Slides. Tell the story of how data flows through your system. Donât just list componentsâexplain why each piece exists and how they connect.
The best candidates Iâve interviewed donât just draw architecture; they narrate the data journey. âFirst, we capture events from the application, then we...â That storytelling makes all the difference.
đ The Six Essential Layers (Deep Dive)
1ď¸âŁ Source Layer: Where It All Begins
This is where you show you understand how businesses actually generate data. Start by listing the available sources based on what the interviewer told you.
Common source types:
đď¸ OLTP Databases - PostgreSQL, MySQL, SQL Server (transactional systems)
đ Files - CSV, JSON, Parquet on SFTP, S3, Azure Blob
đ APIs - REST APIs, webhooks, third-party SaaS tools
đ¨ Event Streams - Kafka, RabbitMQ, AWS Kinesis, Azure Event Hubs
đą IoT Devices - Sensor data, mobile apps, edge devices
Questions you MUST ask:
Whatâs the data volume? Are we talking about gigabytes or petabytes? This drives your storage and processing decisions.
What are the latency requirements? Do users need data in seconds, minutes, or hours? This determines batch vs stream processing.
What data formats are we dealing with? Structured (tables), semi-structured (JSON), or unstructured (images, documents)?
How is the business generating this data? Understanding the business context is crucial. Is this customer data? Transaction logs? Clickstream events?
Do we have CDC (Change Data Capture)? If the source is a database, can we capture incremental changes, or do we need full table dumps?
*Why this matters: The source layer defines your constraints. If you have real-time event streams, you canât just batch process them once a day. If youâre pulling from a legacy SFTP server with nightly file drops, streaming might be overkill.
Talk through the sources. Ask questions. Show curiosity. This demonstrates youâre not just memorizing patternsâyouâre thinking about real systems.
2ď¸âŁ Storage Layer: Where You Show Your Knowledge đ
This is your moment to shine. Storage decisions are your primary responsibility as a data engineer. This is where you separate the junior engineers from the senior ones.
The big question: Where do we store the data?
The Three Main Options:
đ˘ Data Warehouse
Examples: Snowflake, BigQuery, Redshift, Synapse
Best for: Structured data, SQL analytics, BI reporting
Pros: Optimized for queries, easy to use, great for business users
Cons: Can be expensive, less flexible for ML workloads, vendor lock-in
đď¸ Data Lake
Examples: Storage in S3, Azure Data Lake Storage, Google Cloud Storage with separated compute like Spark, Trino, Athena, EMR, Dataproc, Synapse and etc.
Best for: Large-scale raw data, unstructured data, cost-effective storage
Pros: Cheap storage, flexible, supports any format
Cons: Can become a âdata swamp,â requires more engineering effort
đ° Lakehouse
Examples: Databricks (Delta Lake), Snowflake (Iceberg), Dremio, Trino
Best for: Combined analytics + ML workloads, flexible architecture
Pros: Best of both worlds, ACID transactions on data lake, unified platform
Cons: Newer paradigm, some tools still maturing
How to Choose?
Ask these questions:
What public cloud are you using? Azure, AWS, or GCP? Each has native options that integrate well.
Do you already have a data warehouse? Donât reinvent the wheel if something is working.
What are the query patterns?
Mostly SQL reporting â Warehouse
Mostly ML/data science â Lake/Lakehouse
Both â Lakehouse
Whatâs the data volume and growth rate? Petabyte-scale often favors lake/lakehouse for cost reasons.
â ď¸ Important:There is no wrong answer. Each approach has trade-offs. What matters is that you can justify your choice based on the requirements.
Architecture Patterns to Mention
Medallion Architecture (Bronze â Silver â Gold) - This is especially popular in lakehouse architectures:
Bronze (Raw): Land data exactly as received. Immutable, append-only.
Silver (Cleaned): Deduplicated, validated, standardized formats.
Gold (Business-level): Aggregated metrics, dimensional models, ready for consumption.
Why itâs great: Clear separation of concerns, easy to troubleshoot, supports both batch and streaming.
Dimensional Modeling (Star/Snowflake Schema) - Still relevant for traditional data warehouse approaches:
Fact tables (events, transactions)
Dimension tables (customers, products, dates)
Optimized for BI tools and reporting
Donât dismiss this as âold school.â Many business users still need simple, well-defined dimensional models.
Data Transformation Strategy
The question: How will you transform the data?
The default answer: dbt (SQL-based transformation) đŻ
Why dbt is often the right choice:
â SQL is accessible to more team members
â Built-in testing and documentation
â Version control and CI/CD friendly
â Strong community and ecosystem
â Works with all major warehouses and lakehouses
Alternatives to mention:
Spark (PySpark/Scala) - For large-scale transformations, complex logic
Python/Pandas - For ML feature engineering, custom logic
Stored Procedures - Legacy systems (mention youâd migrate away from these)
Pro tip: Start with SQL/dbt. Only reach for Spark when you truly need it. Complexity should be justified.
3ď¸âŁ Consumers Layer: Start With the End in Mind đĽ
This is how the data insights will be used. Understanding your consumers drives everything else.
Who are the target audiences?
đ Business Analysts
Tools: Tableau, Power BI, Looker, Mode
Needs: Pre-aggregated tables, dimensional models, fast queries
Frequency: Daily dashboards, ad-hoc queries
đŹ Data Scientists
Tools: Jupyter notebooks, Python, R
Needs: Raw and feature-engineered datasets, ML-ready formats
Frequency: Varies by project phase
đ¤ Applications
Tools: REST APIs, GraphQL, Reverse ETL tools
Needs: Low-latency access, real-time or near-real-time data
Frequency: Continuous, high QPS (queries per second)
đ¤ External Partners
Tools: Data shares (Snowflake), secure APIs, file exports
Needs: Governed, secure access to specific datasets
Frequency: Scheduled or on-demand
đ§ GenAI Applications
Tools: Vector databases, RAG pipelines, LLM APIs
Needs: Embeddings, semantic search, context retrieval
Frequency: Real-time or batch depending on use case
Critical questions to ask:
How frequently do they need data? Every 15 minutes? Once an hour? Once a day? This determines your SLA and processing cadence.
Whatâs the acceptable data latency? Thereâs a big difference between âyesterdayâs dataâ and âlast 5 minutes of data.â
What format do they prefer? SQL-accessible tables? JSON via API? CSV files? Parquet on S3?
Whatâs the scale? How many users? How many queries per minute?
Why this matters: Consumer requirements drive your architecture decisions. If your BI team needs data once a day, you can batch process overnight. If your fraud detection app needs real-time data, youâre building a streaming pipeline.
The best engineers Iâve worked with always start by understanding the consumer needs. Thatâs how you avoid over-engineering (or under-engineering).
4ď¸âŁ Processing Layer: Batch vs Stream âď¸
This is how we move data from the Source Layer to the Storage Layer.
There are fundamentally only two options: Batch and Stream.
đ Batch Processing (The Default Choice)
Why batch is great:
â Simple to understand and maintain
â Easy to debug and replay
â Cost-effective (run when you need to)
â Well-established patterns and tools
â Handles large historical data well
Common tools:
Airflow (orchestration)
dbt (transformation)
Spark (heavy lifting)
Fivetran/Airbyte (managed ingestion)
Typical patterns:
Full refresh: Replace entire table (simple but inefficient)
Incremental loading: Only process new/changed records
Time-based partitioning: Process data by date ranges
When to use batch:
Daily reporting and analytics
Historical data analysis
When near-real-time isnât required
When simplicity is more important than latency
đ Stream Processing (When You Need It)
When streaming is required:
Real-time dashboards and alerting
Fraud detection (sub-second response)
IoT sensor data (constant flow)
Application events (clickstream, user activity)
Financial transactions (immediate reconciliation)
Common tools:
Kafka / Confluent (event streaming)
Spark Streaming / Structured Streaming
Flink (low-latency complex event processing)
AWS Kinesis / Azure Event Hubs (managed services)
The hybrid approach:
This is actually very common in real systems:
Stream events into cloud storage (S3, ADLS)
Process from storage in micro-batches
Get near-real-time results without full streaming complexity
Why this works: You get durability and replay capability from storage, without the operational complexity of true streaming.
đŻ Critical Concepts to Mention (Show You Know Production)
1. Idempotency
Your pipeline should produce the same result if you run it multiple times. Running the same job twice shouldnât create duplicate records.
Example: Use MERGE/UPSERT
instead of INSERT
. Key on unique business keys.
2. CDC (Change Data Capture)
Capture only what changed in the source system instead of pulling full tables every time.
Tools: Debezium, AWS DMS, Fivetran, Airbyte
3. Incremental Loading
Only process new or changed data. Track watermarks, timestamps, or sequence IDs.
-- Example: Load only records modified since last run
WHERE updated_at > (SELECT MAX(updated_at) FROM target_table)
4. Data Quality Checks
Validate data at ingestion and transformation stages:
Schema validation (expected columns present?)
Null checks (required fields populated?)
Range checks (values within expected bounds?)
Referential integrity (foreign keys valid?)
Freshness checks (data arriving on time?)
Tools: Great Expectations, dbt tests, custom validation
5. Schema Evolution
How do you handle when source systems add/remove/rename columns?
Strategies:
Schema-on-read (flexible but risky)
Schema validation (strict but safe)
Graceful degradation (log issues, donât fail)
đĄ Pro tip: Batch is a great default choice. Lead with that. Then say, âBut for use cases like [specific example from requirements], weâd need streaming.â
Donât over-engineer. Streaming adds complexity. Only use it when latency truly matters to the business.
5ď¸âŁ ML/AI Layer: The Optional But Impressive Section đ¤
Hereâs the secret: No one expects you to cover this, but if you do, youâll stand out.
This shows you understand modern data platforms arenât just for reportingâtheyâre powering machine learning and AI applications.
Components to Mention
đď¸ Feature Store
A centralized repository for ML features that serves both training (batch) and inference (real-time).
Offline features: Historical data for model training
Online features: Low-latency access for real-time predictions
Tools: Feast, Tecton, SageMaker Feature Store, Databricks Feature Store
Why it matters: Prevents feature engineering duplication. Data scientists define features once, use them everywhere.
đ§Ş ML Models
Where do models live in your architecture?
Training pipelines: Scheduled jobs to retrain models
Model registry: Version control for models
Model serving: Batch predictions vs real-time API endpoints
Tools: MLflow, Kubeflow, SageMaker, Azure ML
đ§ GenAI Initiatives
This is hot right now. Show youâre aware of the trends:
Vector databases: Store embeddings for semantic search (Pinecone, Weaviate, Chroma)
RAG (Retrieval Augmented Generation): Combine LLMs with your data
LLM integration: How do your data pipelines feed GenAI apps?
Real-World Examples to Share
When you mention ML/AI, ground it in real experience:
â âIn my last role, we built a feature store to support the fraud detection team. We reduced their feature engineering time from weeks to days.â
â âI worked with the ML team to create a data pipeline that refreshed model training data daily, which improved model performance by 15%.â
â âWe implemented a RAG system that allowed our customer support team to query internal docs using natural language.â
â âI helped set up MLflow for model versioning and tracking, which gave us reproducibility and audit trails.â
How to Position This in the Interview
Donât go deep unless the interviewer asks
Show collaborative mindset: âsupporting the ML team,â âenabling data scientistsâ
Keep it high-level architecture, not deep ML theory
Focus on the data engineering challenges (freshness, quality, scale)
đĄ Pro tip: If the job description mentions ML or AI, definitely bring this up. If not, mention it briefly to show breadth, then move on.
6ď¸âŁ Tools Layer: The Professional Touch đ ď¸
This is a complementary section where you show youâre not just an ETL developerâyouâre a complete data engineer who thinks about the full lifecycle.
This layer is cross-cutting; it applies to all the other layers.
Infrastructure as Code (IaC)
Tools: Terraform, CloudFormation, ARM templates, Pulumi
Why mention this:
Shows you understand infrastructure should be version-controlled
Demonstrates reproducibility thinking
Critical for disaster recovery and multi-environment setup
What to say:
âIâd use Terraform to provision all infrastructureâdatabases, compute clusters, networking, storage. This gives us version control, automated deployments, and the ability to recreate environments on demand.â
What to provision with IaC:
Data warehouse / database instances
Storage accounts / buckets
Compute clusters (Spark, Kubernetes)
Networking (VPCs, subnets, firewalls)
IAM roles and permissions
Version Control
Tools: Git (GitHub, GitLab, Bitbucket)
Why mention this:
âAll codeâSQL, Python, infrastructureâgoes into Git. We follow trunk-based development with feature branches and PR reviews.â
What goes in version control:
ETL/ELT code
dbt models
Infrastructure definitions
Documentation
Data quality tests
CI/CD Pipelines
Tools: GitHub Actions, GitLab CI, Jenkins, Azure DevOps
Why mention this:
Shows you think about automation and deployment safety.
What to include in CI/CD:
Automated testing (unit tests, integration tests, data quality tests)
Code linting and formatting
Deployment to dev â staging â production
Automated rollback on failure
Notifications to team on failures
What to say:
âEvery PR triggers automated tests. Once merged, changes deploy to dev automatically, then staging with approval, then production. This ensures we catch issues early and deploy safely.â
Data Catalog
Tools: DataHub, Collibra, Alation, AWS Glue Catalog, Azure Purview
Why mention this:
Shows you care about discoverability and documentation.
Key features:
Metadata management: What data exists and where
Data lineage: How data flows through pipelines
Business glossary: What do the metrics mean?
Discovery: Help users find relevant datasets
Access control: Who can see what
What to say:
âWeâd implement a data catalog so business users can discover datasets, understand lineage, and know who to contact with questions. This reduces tribal knowledge and accelerates analytics.â
Monitoring, Alerts & On-Call đ¨
Tools: Datadog, PagerDuty, Grafana, CloudWatch, Azure Monitor
Why mention this:
Shows you understand data engineering is an operational discipline, not just building pipelines.
What to monitor:
Data freshness: Is data arriving on schedule?
Data quality: Are validation checks passing?
Pipeline failures: Which jobs failed and why?
Resource utilization: Are we over/under-provisioned?
Cost: Are we staying within budget?
SLA compliance: Are we meeting our commitments?
What to say:
âIâd set up monitoring for data freshness, quality, and pipeline health. Weâd have alerts that page on-call engineers for critical failures, but also slack notifications for warnings. The key is avoiding alert fatigueâonly page for things that need immediate action.â
On-call practices:
Rotation schedule (donât burn people out)
Runbooks for common issues
Post-mortems for major incidents
Continuous improvement of alert quality
Data Quality Framework
Tools: Great Expectations, dbt tests, Soda, custom validation
Pro tip: dbt can handle all these tests out of the box!
Why mention this:
Data quality is one of the biggest pain points in data engineering. Show you take it seriously.
Types of tests:
Schema tests: Expected columns present and correct type
Null checks: Required fields arenât null
Uniqueness: Primary keys are actually unique
Referential integrity: Foreign keys are valid
Range checks: Values within expected bounds
Custom business logic: Revenue shouldnât be negative, etc.
Where to test:
At ingestion (catch bad data early)
After transformation (validate business logic)
Before pushing to production (gate deployments)
What to say:
âIâd implement data quality checks at every stage using dbt tests and Great Expectations. Failed tests block deployment and trigger alerts. Weâd track data quality metrics over time to identify degradation.â
FinOps (Cost Management) đ°
Why mention this:
Shows you understand data platforms can get expensive, and cost awareness is part of your job.
What to cover:
Cost monitoring: Track spending by project/team/environment
Resource tagging: Tag all resources for chargeback/showback
Query optimization: Tune expensive queries
Storage lifecycle policies: Move old data to cheaper tiers
Right-sizing: Donât over-provision compute
Spot/preemptible instances: Use for non-critical workloads
What to say:
âIâd implement tagging for all resources so we can track costs by team. Weâd set up alerts for unexpected spend spikes, and regularly review expensive queries and storage to optimize costs.â
Common cost optimizations:
Partition tables by date (scan less data)
Cluster tables by commonly filtered columns
Use materialized views for expensive aggregations
Compress data (Parquet, ORC, Delta)
Auto-pause warehouses when not in use
Use storage tiers (hot, cool, archive)
đ Interview Strategy: How to Actually Use This Guide
Before You Start Drawing
1. Clarify Requirements (5 minutes)
Donât jump straight into architecture. Ask questions first:
Whatâs the core business use case?
Whatâs the data volume?
What are the latency requirements?
Whatâs the budget? (Build vs buy)
What cloud platform?
Whatâs already in place?
Who are the end users?
2. State Your Assumptions
âBased on what youâve told me, Iâm assuming...â
Daily batch processing will meet requirements
Weâre starting from scratch (no legacy systems)
We have a team of 5 data engineers
Budget is moderate (not unlimited, not shoestring)
This shows structured thinking and gives the interviewer a chance to correct you.
While Youâre Designing
1. Go Breadth-First, Then Depth
Cover all six layers at a high level first (10-15 minutes). Then ask: âWhere would you like me to go deeper?â
This shows you can see the whole system while also knowing when to dive into details.
2. Think Out Loud
âIâm considering two approaches here... Option A would be simpler but less scalable. Option B would handle growth better but adds complexity. Given the requirements, Iâd lean toward Option A because...â
This is exactly what they want to hear. There are rarely ârightâ answersâonly trade-offs.
3. Draw as You Talk
Your diagram should evolve as you narrate. Donât draw everything silently then explain.
âSo first, we have these source systems... they feed into... which lands data here... then we transform using...â
Questions to Expect (Be Ready)
Scalability:
âWhat if data volume increases 10x?â
âHow does this handle peak load?â
Failure Scenarios:
âWhat happens if the source API goes down?â
âHow do you recover from a failed pipeline?â
Security:
âHow do you handle PII/sensitive data?â
âWhat about access control?â
Operations:
âHow do you deploy changes?â
âHow do you monitor this?â
âWhoâs on-call when it breaks?â
Trade-offs:
âWhy did you choose X over Y?â
âWhat are the downsides of your approach?â
Doâs â
Ask clarifying questions before diving in
Explain your reasoning for every choice
Acknowledge trade-offs (nothing is perfect)
Mention multiple options even if you recommend one
Show production thinking (monitoring, costs, operations)
Adapt based on feedback (donât be rigid)
Use real examples from your experience
Keep track of time (breadth first, then depth)
Donâts â
Donât assume requirements - ask!
Donât jump to a specific tool without justifying
Donât over-engineer - start simple, add complexity only when needed
Donât ignore operational concerns - itâs not done when the code works
Donât claim thereâs one right answer - itâs all trade-offs
Donât be silent - think out loud
Donât go too deep too early - cover breadth first
đ Your Action Plan
Now that you have the framework, hereâs how to prepare:
đ Week 1: Internalize the Framework
Review the six layers until you can draw them from memory
Practice explaining each layer in 2-3 sentences
Write down your own experiences for each layer
đ¨ Week 2: Practice Drawing
Pick 3-5 different scenarios (e-commerce, fintech, e-commerce, etc.)
Draw the architecture for each from scratch
Time yourself (aim for 30 minutes for a full design)
đŁď¸ Week 3: Mock Interviews
Find a peer or mentor to practice with
Record yourself and watch it back
Get feedback on communication, not just technical accuracy
đ Week 4: Deep Dive Your Weak Areas
Streaming? Read up on Kafka and Spark Streaming
Data modeling? Study dimensional modeling and normalization
ML systems? Learn about feature stores and MLOps
đ Continuous Improvement
After every real interview, write down what you struggled with
Update your own mental model
Keep learning from real systems you build
đ Bonus: Common Scenarios to Practice
Want to test your skills? Try designing these systems:
đą Scenario 1: E-commerce Analytics
Requirements: Track user clickstream, orders, inventory. BI team needs daily dashboards. Marketing wants near-real-time campaign performance.
đł Scenario 2: Fraud Detection
Requirements: Credit card transactions, need to flag suspicious activity within seconds. Also need historical analysis for model training.
đĽ Scenario 3: Healthcare Data Platform
Requirements: Patient records from multiple hospitals, strict compliance (HIPAA), clinical researchers need de-identified data.
đŽ Scenario 4: Gaming Analytics
Requirements: Player events, millions of users, need to detect cheating in real-time, also power personalized recommendations.
đ Scenario 5: IoT Fleet Management
Requirements: GPS and telemetry from 10,000 vehicles, predict maintenance needs, optimize routes, provide real-time dashboard for dispatchers.
For each scenario:
Draw the architecture
Justify your choices
Identify the biggest technical challenges
Estimate rough costs
Plan for failure scenarios
đ Additional Resources
Books:
Designing Data-Intensive Applications by Martin Kleppmann
Fundamentals of Data Engineering by Joe Reis and Matt Housley
The Data Warehouse Toolkit by Ralph Kimball
Helpful articles: