Knowledge + Practice

CCNA Design Implement Data Storage Questions

75 of 191 questions · Page 1/3 · Design Implement Data Storage topic · Answers revealed

Practice these questions Exam hub All questions

1

Multi-Selectmedium

Which THREE security features are available for Azure SQL Database? (Choose three.)

Select 3 answers

A.Always On availability groups

B.Auditing

C.Azure Active Directory authentication

D.Firewall rules

E.Transparent Data Encryption (TDE)

AnswersB, C, E

Correct. Auditing tracks database events.

Why this answer

Auditing is a built-in security feature for Azure SQL Database that tracks database events and writes them to an audit log in your Azure Storage account, Log Analytics workspace, or Event Hubs. It helps maintain regulatory compliance, understand database activity, and gain insights into discrepancies and anomalies that could indicate security concerns.

Exam trap

The trap here is that candidates often confuse high-availability features like Always On availability groups with security features, or they mistakenly think that firewall rules are a database-level security feature rather than a server-level network access control.

Practice this question →

2

Multi-Selecthard

You are designing a delta lake architecture in Azure Synapse Analytics. Which TWO practices should you follow to ensure ACID transactions and data consistency?

Select 2 answers

A.Disable schema evolution to prevent accidental schema changes

B.Run OPTIMIZE commands frequently to compact small files

C.Use the Delta Lake transaction log for all write operations

D.Store data in CSV format to simplify schema enforcement

E.Enable write-ahead logging to support concurrent reads and writes

AnswersC, E

The transaction log ensures ACID compliance by recording all operations.

Why this answer

Option C is correct because the Delta Lake transaction log is the core mechanism that enables ACID transactions. Every write operation (insert, update, delete, merge) is recorded as an atomic commit in the transaction log, ensuring that concurrent readers see a consistent snapshot and that partial writes are never visible. Without this log, Delta Lake cannot guarantee atomicity or isolation.

Exam trap

The trap here is that candidates often confuse performance optimization (OPTIMIZE) or file format choice (CSV) with ACID transaction guarantees, but the exam specifically tests the understanding that the transaction log is the fundamental enabler of atomicity, consistency, isolation, and durability in Delta Lake.

Practice this question →

3

MCQhard

You are migrating a large on-premises SQL Server database to Azure Synapse Analytics. The database includes tables with up to 500 million rows and frequent updates. You need to minimize data movement during the migration while ensuring optimal query performance in the dedicated SQL pool. Which table design strategy should you use?

A.Use hash-distributed tables for all tables and clustered columnstore indexes.

B.Use replicated tables for all fact tables and hash-distributed tables for dimension tables.

C.Use round-robin tables for all tables to simplify the migration.

D.Use round-robin tables for staging tables and hash-distributed tables for large fact tables on a key column.

AnswerD

Round-robin minimizes data movement; hash distribution optimizes joins.

Why this answer

Option D is correct because it uses round-robin tables for staging to minimize data movement during the initial load, then hash-distributes large fact tables on a key column to optimize query performance by collocating rows with the same distribution key on the same compute node. This balances the need for fast ingestion with efficient parallel query execution in Azure Synapse Analytics dedicated SQL pools.

Exam trap

The trap here is that candidates often assume hash-distributed tables are always the best choice for all tables, overlooking the fact that round-robin tables reduce data movement during migration and that hash distribution should be reserved for large fact tables to avoid skew and unnecessary shuffling.

How to eliminate wrong answers

Option A is wrong because using hash-distributed tables for all tables, including small dimension tables, can cause unnecessary data shuffling and skew, and clustered columnstore indexes are not optimal for tables with frequent updates due to high overhead in maintaining columnstore segments. Option B is wrong because replicated tables are designed for small dimension tables (typically < 2 GB), not for large fact tables with up to 500 million rows, as replicating such large tables would cause excessive storage and data movement. Option C is wrong because round-robin tables distribute data randomly across distributions, leading to poor query performance due to data movement during joins and aggregations, and are not suitable for production fact tables in a dedicated SQL pool.

Practice this question →

4

Matchingmedium

Match each data storage format to its characteristic.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Columnar storage format optimized for analytics

Row-based format with schema embedded

Columnar format with high compression

ACID transactions on data lakes

Why these pairings

These formats are commonly used in Azure data engineering.

Practice this question →

5

MCQhard

A healthcare organization stores patient data in Azure SQL Database. They need to encrypt sensitive columns (e.g., SSN) such that only authorized users can decrypt the data at query time. Which feature should they use?

A.Transparent Data Encryption (TDE)

B.Azure Information Protection labels

C.Always Encrypted (with secure enclaves)

D.Dynamic Data Masking

AnswerC

Column-level encryption, decrypted on the client.

Why this answer

Always Encrypted with secure enclaves is the correct choice because it allows sensitive columns (e.g., SSN) to be encrypted at rest and in memory, with decryption occurring only on the client side using keys that are never exposed to the database engine. This ensures that only authorized users with the proper encryption keys can decrypt the data at query time, meeting the requirement for column-level encryption with client-side key management.

Exam trap

The trap here is that candidates often confuse Transparent Data Encryption (TDE) with column-level encryption, mistakenly thinking TDE protects individual columns at query time, when in fact TDE only encrypts data at rest and does not control who can decrypt data during queries.

How to eliminate wrong answers

Option A is wrong because Transparent Data Encryption (TDE) encrypts the entire database at rest (pages and backups) but does not protect data in memory or during query execution, and it does not allow column-level granularity or client-side key control. Option B is wrong because Azure Information Protection labels are used for classifying and protecting documents and emails via sensitivity labels, not for encrypting specific database columns at query time. Option D is wrong because Dynamic Data Masking obfuscates data in query results for unauthorized users but does not encrypt the underlying data; it can be bypassed by users with direct database access or by using certain query patterns.

Practice this question →

6

MCQhard

You have an Azure Synapse Analytics dedicated SQL pool with a large fact table partitioned by date. As data grows, query performance on recent data degrades. You need to improve performance for queries filtering on the current month without affecting queries on older data. What should you do?

A.Implement partition switching to move older partitions to a different filegroup with slower storage

B.Increase the service level objective (SLO) of the dedicated SQL pool

C.Change the table distribution to round-robin

D.Update statistics on the table for the current month partition

AnswerA

D is correct because partition switching isolates recent data and allows targeted performance tuning.

Why this answer

Partition switching allows you to efficiently move older partitions to a different filegroup (e.g., slower or compressed storage) without affecting queries on recent data. This reduces the active data volume in the fact table, improving scan performance for queries filtering on the current month. The operation is metadata-only and does not require data movement, so it does not impact ongoing queries on the remaining partitions.

Exam trap

The trap here is that candidates often confuse partition switching with partition elimination or index maintenance, thinking that updating statistics or scaling resources will solve the performance issue, when the real bottleneck is the sheer volume of data in the table that must be scanned for queries on recent data.

How to eliminate wrong answers

Option B is wrong because increasing the SLO (service level objective) scales up resources for the entire dedicated SQL pool, which is costly and does not specifically target the performance degradation on recent data; it also affects queries on older data unnecessarily. Option C is wrong because changing the table distribution to round-robin would distribute data evenly across distributions, which can improve load performance but does not address the performance issue caused by scanning a large fact table partitioned by date; it may even worsen query performance for date-filtered queries by spreading data across all distributions. Option D is wrong because updating statistics on the current month partition alone does not reduce the amount of data scanned; while fresh statistics help the optimizer choose better plans, the fundamental problem is the volume of data, not the statistics quality.

Practice this question →

7

MCQhard

You are a data engineer for a gaming company that uses Azure Data Lake Storage Gen2. The data lake stores player event data in JSON format. The data is organized by date and event type. The analytics team frequently runs queries that filter by player ID to analyze individual player behavior. These queries are slow because they scan entire daily partitions. You need to improve the performance of queries that filter by player ID without restructuring the entire data lake. The data is stored as JSON files. What should you do?

A.Compress the JSON files using gzip to reduce I/O

B.Convert the data from JSON to Parquet format and partition by player ID

C.Create indexes on the player ID field in the data lake

D.Repartition the data by hour to reduce the data scanned per partition

AnswerB

Parquet allows predicate pushdown and column pruning, speeding up player ID queries.

Why this answer

Option B is correct because converting JSON to Parquet enables columnar storage, which significantly reduces I/O by reading only the columns needed for queries. Partitioning by player ID further improves performance by allowing partition elimination, so queries filter only the relevant partitions instead of scanning entire daily partitions. This approach directly addresses the slow queries without restructuring the entire data lake.

Exam trap

The trap here is that candidates often think compression alone (Option A) or indexing (Option C) can solve performance issues in a data lake, but Azure Data Lake Storage Gen2 does not support file-level indexes, and compression does not change the fundamental row-scanning nature of JSON queries.

How to eliminate wrong answers

Option A is wrong because compressing JSON with gzip reduces file size but does not change the row-oriented storage format; queries still must read and decompress entire files, and filtering by player ID still requires scanning all rows in each partition. Option C is wrong because Azure Data Lake Storage Gen2 does not support indexing on data files; indexes are a database concept and cannot be applied to files in a data lake. Option D is wrong because repartitioning by hour would create more partitions but does not help queries that filter by player ID; the queries would still scan all partitions unless they also filter by hour, and the player ID is the primary filter, not the time granularity.

Practice this question →

8

MCQmedium

Your team is migrating an on-premises SQL Server data warehouse to Azure Synapse Analytics. The source has a fact table with 500 million rows and several dimension tables. You need to choose the best distribution strategy for the fact table to minimize data movement during joins. Which distribution type should you use?

A.Hash distribution on the foreign key column used in joins

B.No distribution (single distribution)

C.Replicated distribution

D.Round-robin distribution

AnswerA

Hash distribution on the join key ensures rows with the same key are on the same distribution.

Why this answer

Hash distribution on the foreign key column used in joins ensures that rows with the same join key are co-located on the same distribution node. This minimizes data movement because the join can be performed locally on each node without shuffling data across the compute nodes, which is critical for a 500-million-row fact table.

Exam trap

The trap here is that candidates often confuse replicated distribution as a general performance booster, but they fail to recognize that replicating a large fact table is impractical and that hash distribution on the join key is the correct strategy to minimize data movement for large fact tables.

How to eliminate wrong answers

Option B is wrong because single distribution (no distribution) places all data on one node, causing a bottleneck and eliminating the parallelism benefits of Azure Synapse Analytics, leading to poor performance for large fact tables. Option C is wrong because replicated distribution copies the entire table to each node, which is impractical for a 500-million-row fact table due to excessive storage and maintenance overhead; it is suitable only for smaller dimension tables. Option D is wrong because round-robin distribution distributes rows evenly without considering join keys, so joins require data to be shuffled across nodes, causing significant data movement and slower query performance.

Practice this question →

9

Multi-Selecthard

Which THREE of the following are best practices for designing tables in a dedicated SQL pool in Azure Synapse Analytics?

Select 3 answers

A.Avoid using clustered columnstore indexes on large tables.

B.Avoid data skew by choosing a good distribution key.

C.Use round-robin distribution for all large fact tables.

D.Use replicated tables for small dimension tables (less than 1 GB).

E.Use hash distribution on a column with high cardinality for large fact tables.

AnswersB, D, E

Data skew can degrade performance.

Why this answer

Option B is correct because a good distribution key minimizes data skew, ensuring that data is evenly distributed across all distributions. This prevents performance bottlenecks where some distributions handle a disproportionate amount of data or queries, which is critical for parallel processing in a dedicated SQL pool.

Exam trap

The trap here is that candidates often assume clustered columnstore indexes are unsuitable for large tables due to memory constraints, but they are actually the default and recommended index type for fact tables in Synapse dedicated SQL pools.

Practice this question →

10

MCQmedium

A data engineer needs to store JSON documents that are frequently updated by multiple users concurrently. The solution must support optimistic concurrency control and have built-in indexing on all fields. Which Azure data store should be used?

A.Azure Cosmos DB (SQL API)

B.Azure Blob Storage

C.Azure SQL Database

D.Azure Table Storage

AnswerA

Cosmos DB supports JSON documents, optimistic concurrency, and automatic indexing.

Why this answer

Azure Cosmos DB (SQL API) is the correct choice because it natively supports optimistic concurrency control via ETags (HTTP entity tags) and provides automatic indexing of all fields without requiring manual index management. This makes it ideal for storing JSON documents that are frequently updated by multiple concurrent users, as it ensures conflict detection and resolution while maintaining high performance.

Exam trap

The trap here is that candidates often choose Azure SQL Database because they associate concurrency control with relational databases, overlooking that Cosmos DB is purpose-built for JSON documents with automatic indexing and native optimistic concurrency via ETags, which is more aligned with the requirements than a relational store.

How to eliminate wrong answers

Option B (Azure Blob Storage) is wrong because it does not support optimistic concurrency control; it uses lease-based locking for blobs, which is not designed for fine-grained concurrent updates on JSON documents and lacks built-in indexing on all fields. Option C (Azure SQL Database) is wrong because while it supports optimistic concurrency via snapshot isolation or row versioning, it requires manual index creation and is not optimized for storing and querying JSON documents natively; it is a relational store, not a document store. Option D (Azure Table Storage) is wrong because it does not support optimistic concurrency control (it uses ETags but only for individual entities, not for complex JSON documents) and its indexing is limited to partition and row keys, not all fields.

Practice this question →

11

Multi-Selecthard

You are designing a data storage solution for a financial services company. The solution must meet the following requirements: store transaction data for 7 years for regulatory compliance, support point-in-time restore (PITR) for the last 30 days, and minimize storage costs for historical data. Which THREE actions should you take?

Select 3 answers

A.Store historical data in Azure Data Lake Storage Gen2

B.Use Azure Blob Storage with cool access tier for data older than 30 days

C.Use Azure SQL Database with automated backups and PITR retention of 30 days

D.Implement a data export process using Azure Data Factory to move data older than 30 days to Blob Storage

E.Use Azure Table Storage for archival

AnswersB, C, D

Cool tier is cost-effective for infrequently accessed data.

Why this answer

Option B is correct because Azure Blob Storage's cool access tier is designed for infrequently accessed data with lower storage costs, making it ideal for historical transaction data older than 30 days. This tier provides cost-effective storage while still allowing retrieval when needed, aligning with the requirement to minimize storage costs for historical data.

Exam trap

The trap here is that candidates often confuse Azure Data Lake Storage Gen2 with a cost-effective archival solution, when in fact it is designed for high-throughput analytics and lacks the tiered pricing that Blob Storage offers for long-term retention.

Practice this question →

12

MCQhard

A data engineer runs the Azure CLI command shown in the exhibit. The blob is stored in Azure Blob Storage. The team previously set a lifecycle management rule to move blobs to the Archive tier after 30 days. The blob was created 45 days ago. What is the most likely reason the blob is still in the Cool tier?

A.The lifecycle management rule only applies to blobs in the Hot tier.

B.The lifecycle management rule was configured after the blob was created, and it can take up to 24 hours for the rule to be evaluated.

C.The blob is in a container that is excluded from the lifecycle rule.

D.The blob must be in the Hot tier for the rule to move it to Archive.

AnswerB

Correct. Lifecycle rules are evaluated once per day, so there may be a delay.

Why this answer

The lifecycle management rule is evaluated by Azure Storage once per day. If the rule was configured after the blob was created, it may not have been evaluated yet, and it can take up to 24 hours for the rule to apply. Since the blob is 45 days old and still in Cool tier, the most likely reason is that the rule has not yet been evaluated after its configuration.

Exam trap

The trap here is that candidates assume lifecycle rules are evaluated immediately or that blobs must be in Hot tier to be moved to Archive, but Azure's daily evaluation cycle and the ability to move from Cool to Archive are the key nuances tested.

How to eliminate wrong answers

Option A is wrong because lifecycle management rules can apply to blobs in any tier (Hot, Cool, or Archive) unless explicitly filtered by tier in the rule definition. Option C is wrong because there is no indication in the scenario that the container is excluded; the question states a rule was set, and exclusion would require explicit configuration. Option D is wrong because lifecycle rules can move blobs from Cool to Archive directly; blobs do not need to be in Hot tier first.

Practice this question →

13

MCQeasy

You are designing a data storage solution for a marketing analytics platform. The platform collects clickstream data from websites and needs to store it for both real-time dashboards and historical analysis. The data is semi-structured (JSON) and arrives at a rate of 10,000 events per second. You need to choose an Azure storage solution that can handle the ingestion rate, support schema-on-read, and integrate with Azure Databricks for advanced analytics. The solution must also be cost-effective for long-term storage. What should you use?

A.Azure Table Storage

B.Azure Data Lake Storage Gen2

C.Azure Cosmos DB

D.Azure SQL Database

AnswerB

ADLS Gen2 meets all requirements: high throughput, schema-on-read, Databricks integration, cost-effective.

Why this answer

Azure Data Lake Storage Gen2 (ADLS Gen2) is the correct choice because it combines a hierarchical namespace with Azure Blob Storage, providing high-throughput ingestion (up to 60 GB/s per account) to handle 10,000 events per second of semi-structured JSON data. It supports schema-on-read natively, allowing Azure Databricks to query the data directly using Spark without prior schema definition, and its tiered storage (hot, cool, archive) makes it cost-effective for long-term historical analysis.

Exam trap

The trap here is that candidates often choose Azure Cosmos DB for its real-time capabilities, overlooking that the question emphasizes cost-effective long-term storage and schema-on-read for historical analysis, which ADLS Gen2 handles far more efficiently and cheaply than Cosmos DB's per-request-unit pricing model.

How to eliminate wrong answers

Option A is wrong because Azure Table Storage is a NoSQL key-value store designed for structured data with a fixed schema, not for semi-structured JSON clickstream data, and it lacks the hierarchical namespace and high-throughput ingestion needed for 10,000 events per second. Option C is wrong because Azure Cosmos DB is optimized for low-latency real-time access with its multi-model API, but it is significantly more expensive for long-term storage of high-volume historical data and does not natively integrate with Azure Databricks for schema-on-read analytics as efficiently as ADLS Gen2. Option D is wrong because Azure SQL Database is a relational database requiring a predefined schema (schema-on-write), which conflicts with the schema-on-read requirement, and its ingestion rate and cost model are not designed for high-velocity semi-structured data at 10,000 events per second.

Practice this question →

14

MCQmedium

A company uses Azure Synapse Analytics dedicated SQL pool. They notice that some queries are slow due to high data movement. What should you do to minimize data movement for queries that join large fact tables?

A.Use round-robin distribution for all tables.

B.Partition both tables on the join keys.

C.Hash-distribute the fact tables on the join keys.

D.Use replicated tables for all large fact tables.

AnswerC

Hash distribution on join keys colocates rows and minimizes data movement.

Why this answer

Hash-distributing the fact tables on the join keys ensures that rows with the same join key value are placed on the same distribution node. This eliminates the need to shuffle data across nodes during the join, minimizing data movement and improving query performance in Azure Synapse dedicated SQL pool.

Exam trap

The trap here is that candidates confuse partitioning with distribution, thinking that partitioning on join keys reduces data movement, when in fact only hash distribution on the join key ensures collocation across nodes.

How to eliminate wrong answers

Option A is wrong because round-robin distribution distributes rows evenly without considering join keys, which does not reduce data movement for joins and can actually increase it. Option B is wrong because partitioning on join keys organizes data within a distribution but does not control data placement across distributions; data movement still occurs when joining across partitions. Option D is wrong because replicated tables are suitable for small dimension tables, not large fact tables, as replicating large tables would consume excessive storage and negate the benefits of scale-out.

Practice this question →

15

MCQmedium

You are designing a data storage solution for a real-time analytics application that ingests IoT sensor data. The data must be stored in a format that supports both streaming ingestion and batch processing with low latency for queries. Which Azure storage solution should you use?

A.Azure Blob Storage with hot access tier

B.Azure SQL Database with change data capture

C.Azure Cosmos DB with analytical store

D.Azure Data Lake Storage Gen2

AnswerD

Supports hierarchical namespace, streaming and batch ingestion, and low-latency queries through Azure Synapse.

Why this answer

Azure Data Lake Storage Gen2 (ADLS Gen2) is the correct choice because it combines a hierarchical namespace with Azure Blob Storage, enabling both streaming ingestion (via Event Hubs, Kafka, or Spark Structured Streaming) and batch processing (via PolyBase, Azure Synapse, or Databricks) while supporting low-latency queries through its POSIX-like file system and optimized columnar formats like Parquet. It directly addresses the requirement for a unified storage layer that handles real-time and batch workloads without data movement.

Exam trap

The trap here is that candidates often confuse Azure Blob Storage (option A) with ADLS Gen2, assuming the hot access tier supports streaming ingestion and low-latency queries, but they overlook the critical need for a hierarchical namespace and native batch processing capabilities that only ADLS Gen2 provides.

How to eliminate wrong answers

Option A is wrong because Azure Blob Storage with hot access tier lacks a hierarchical namespace, making it inefficient for low-latency queries and batch processing patterns that rely on directory structures and atomic rename operations; it is designed for object storage, not for real-time analytics with streaming ingestion. Option B is wrong because Azure SQL Database with change data capture is optimized for transactional workloads and relational queries, not for high-volume streaming ingestion of IoT sensor data, and it introduces latency and cost overhead for large-scale batch processing. Option C is wrong because Azure Cosmos DB with analytical store is designed for globally distributed, multi-model data with automatic indexing, but it is not optimized for batch processing with columnar formats like Parquet and incurs higher latency for large-scale analytical queries compared to ADLS Gen2's native integration with Spark and Synapse.

Practice this question →

16

MCQhard

You have an Azure Synapse Analytics dedicated SQL pool with a table that uses hash distribution on CustomerID. You notice that queries joining this table with another table on OrderDate are slow. What is the most likely cause?

A.The table is not partitioned by OrderDate

B.Statistics on the join columns are outdated

C.The table should use round-robin distribution instead

D.The join columns are not aligned; data must be shuffled across distributions

AnswerD

A is correct because hash distribution on CustomerID means OrderDate is not the distribution key, causing unnecessary data movement during join.

Why this answer

D is correct because in Azure Synapse Analytics dedicated SQL pools, hash distribution distributes rows across distributions based on a hash of the distribution key (CustomerID). When joining on OrderDate, which is not the distribution key, the join columns are not aligned across distributions. This forces data movement (shuffling) where rows from one or both tables must be redistributed to match the join key, causing significant performance degradation.

Exam trap

The trap here is that candidates often confuse partitioning with distribution, thinking that partitioning on the join column solves the data movement issue, when in fact distribution alignment is the critical factor for collocated joins in a distributed MPP system.

How to eliminate wrong answers

Option A is wrong because partitioning by OrderDate would help with partition elimination for scans or maintenance, but it does not address the fundamental issue of data movement required when join columns are not aligned with the distribution key. Option B is wrong because outdated statistics can cause suboptimal query plans, but the primary performance bottleneck here is the physical data movement across distributions, not statistics. Option C is wrong because round-robin distribution distributes rows evenly without any key, which would still require full data movement for any join, making performance even worse than hash distribution on a non-join column.

Practice this question →

17

MCQhard

You are designing a solution to store semi-structured JSON logs from a web application in Azure Cosmos DB. The logs are written once and rarely read. The application writes up to 10,000 documents per second, and each document is about 2 KB. You need to minimize RU/s cost. Which API and indexing policy should you choose?

A.Azure Cosmos DB for Cassandra with default indexing policy.

B.Azure Cosmos DB for Table with default indexing policy.

C.Azure Cosmos DB for NoSQL with default indexing policy (automatic indexing of all fields).

D.Azure Cosmos DB for MongoDB with a custom indexing policy that only includes a wildcard index on _id and disables automatic indexing.

AnswerD

Disabling automatic indexing and indexing only _id minimizes RU consumption for writes.

Why this answer

Option D is correct because Azure Cosmos DB for MongoDB with a custom indexing policy that disables automatic indexing and only includes a wildcard index on _id minimizes RU/s cost for write-heavy, rarely read workloads. Each write operation consumes RUs proportional to the number of indexed paths; by eliminating automatic indexing of all fields, you drastically reduce the per-document write RU charge. The MongoDB API supports this fine-grained indexing control, making it ideal for high-throughput ingestion of semi-structured JSON logs.

Exam trap

The trap here is that candidates often assume the NoSQL API (Option C) is always the best choice for JSON data, overlooking that its default indexing policy incurs significant RU overhead for write-heavy workloads, whereas the MongoDB API allows disabling indexing to minimize cost.

How to eliminate wrong answers

Option A is wrong because Azure Cosmos DB for Cassandra uses a default indexing policy that indexes all columns, which would incur high RU/s costs for the 10,000 writes per second. Option B is wrong because Azure Cosmos DB for Table also indexes all properties by default, leading to unnecessary RU consumption for write-heavy, rarely read logs. Option C is wrong because Azure Cosmos DB for NoSQL with default automatic indexing of all fields would maximize RU/s cost per write, which directly contradicts the requirement to minimize cost.

Practice this question →

18

MCQhard

Match each Azure storage service to its primary use case.

A.Big data analytics

B.Globally distributed, low-latency applications

C.Relational OLTP

D.NoSQL key-value store

E.Azure Data Lake Storage Gen2

F.Azure Cosmos DB

G.Azure SQL Database

H.Azure Table Storage

Why this answer

Azure Data Lake Storage Gen2 is for big data analytics. Azure Cosmos DB is for globally distributed, low-latency applications. Azure SQL Database is for relational OLTP.

Azure Table Storage is for NoSQL key-value store.

Exam trap

Candidates often confuse Azure Data Lake Storage with Blob Storage; Data Lake Storage is Blob Storage with hierarchical namespace, optimized for analytics.

Practice this question →

19

MCQhard

You are a data engineer for a financial services company. The company stores sensitive transaction data in Azure Data Lake Storage Gen2. The data is partitioned by date and loaded daily via Azure Data Factory. Recently, an audit found that the storage account allows public network access, and some containers have anonymous read access enabled. You need to secure the storage account according to the principle of least privilege while ensuring that Azure Data Factory can still load data. You must also ensure that data can be accessed by Azure Databricks for analytics. The solution must minimize administrative overhead. Which course of action should you take?

A.Keep public network access enabled but restrict it to specific IP addresses. Use storage account keys for Azure Data Factory and Azure Databricks.

B.Disable public network access. Create private endpoints for the storage account and configure Azure Data Factory and Azure Databricks to use the private endpoints. Use RBAC to assign 'Storage Blob Data Contributor' to the managed identities.

C.Enable public network access with a firewall rule to allow only the Azure Data Factory and Azure Databricks IP ranges. Keep anonymous access enabled but set the containers to private.

D.Disable public network access. Set the storage account firewall to allow access only from Azure services. Configure Azure Data Factory and Azure Databricks to use managed identities. Grant the managed identities the 'Storage Blob Data Contributor' role at the container level. Remove any anonymous access.

AnswerD

This meets security requirements, uses managed identities for authentication, and applies least privilege via RBAC.

Why this answer

Option D is correct because disabling public network access and using managed identities with RBAC (Storage Blob Data Contributor) aligns with the principle of least privilege while minimizing administrative overhead. Azure Data Factory and Azure Databricks can authenticate via managed identities without managing keys or IP ranges, and removing anonymous access eliminates the security gap. The firewall rule allowing access only from Azure services ensures that only Azure-internal traffic can reach the storage account, which is sufficient for these services when they are in the same region.

Exam trap

The trap here is that candidates often overcomplicate the solution by choosing private endpoints (Option B) or IP-based firewalls (Option A/C), not realizing that the 'Allow Azure services' firewall rule combined with managed identities provides a simpler, least-privilege-compliant path for first-party Azure services.

How to eliminate wrong answers

Option A is wrong because keeping public network access enabled and using storage account keys violates least privilege—keys are shared secrets that are hard to rotate and audit, and IP restriction alone does not prevent access from other Azure services or compromised endpoints. Option B is wrong because creating private endpoints adds significant administrative overhead (VNet configuration, DNS resolution, peering) and is not necessary when the services can use managed identities over the Azure backbone network; it also does not address the anonymous access issue directly. Option C is wrong because enabling public network access with firewall rules for IP ranges is brittle—Azure Data Factory and Azure Databricks IP ranges can change without notice, leading to failures, and keeping anonymous access enabled (even with private containers) still allows enumeration of container names and potential misconfiguration.

Practice this question →

20

MCQmedium

You are designing a data storage solution for a healthcare organization that stores patient records. The solution must comply with HIPAA and support point-in-time restore (PITR) for the last 35 days. The data is frequently accessed for reporting. Which Azure data service should you use?

A.Azure Cosmos DB

B.Azure Blob Storage with point-in-time restore

C.Azure SQL Database

D.Azure Data Lake Storage Gen2

AnswerC

Azure SQL Database supports PITR up to 35 days and is HIPAA-eligible when configured properly.

Why this answer

Azure SQL Database is the correct choice because it natively supports point-in-time restore (PITR) for up to 35 days (configurable from 7 to 35 days) and is a HIPAA-eligible service when configured with encryption, auditing, and network security. It provides transactional consistency required for patient records and supports frequent reporting workloads with features like columnstore indexes and read replicas.

Exam trap

The trap here is that candidates often confuse Azure Blob Storage's point-in-time restore (which only applies to container-level recovery of blobs) with the transactional point-in-time restore needed for relational patient records, or they assume Cosmos DB's multi-model capabilities make it suitable for structured healthcare data despite its lack of ACID compliance and limited backup retention.

How to eliminate wrong answers

Option A is wrong because Azure Cosmos DB does not support point-in-time restore for the last 35 days; its backup retention is limited to 8 hours for continuous backups or 30 days for periodic backups, and it is not a relational database, making it unsuitable for structured patient records requiring ACID transactions. Option B is wrong because Azure Blob Storage with point-in-time restore is designed for object storage, not transactional data; it lacks relational query capabilities, foreign key constraints, and native support for HIPAA-compliant auditing and row-level security required for patient records. Option D is wrong because Azure Data Lake Storage Gen2 is optimized for big data analytics and hierarchical namespace storage, not for transactional workloads with point-in-time restore; it does not provide built-in PITR for individual records and is not designed for frequent reporting on structured patient data.

Practice this question →

21

MCQhard

You are designing a data storage solution for a global e-commerce company. The company's analytics team uses Azure Synapse Serverless SQL to query Parquet files in ADLS Gen2. The data is partitioned by year, month, and day. The team frequently runs queries that aggregate sales by product category across the last 30 days. However, the queries are slow and scanning too much data. What should you do to improve performance?

A.Increase the number of files per partition to improve parallelism.

B.Convert the Parquet files to CSV format to reduce file size.

C.Create a view that uses explicit partition pruning by filtering on year, month, and day in the WHERE clause.

D.Add a new partition by product category to the folder structure.

AnswerC

Serverless SQL can eliminate partitions when filters are on partition columns.

Why this answer

Option C is correct because creating a view with explicit partition pruning by filtering on year, month, and day in the WHERE clause forces Synapse Serverless SQL to read only the relevant partitions (the last 30 days) instead of scanning the entire dataset. This reduces the amount of data scanned, directly improving query performance and lowering cost.

Exam trap

The trap here is that candidates often think adding more partitions (Option D) or changing file formats (Option B) will help, but they fail to recognize that the real bottleneck is the lack of partition pruning in the query itself, not the storage layout.

How to eliminate wrong answers

Option A is wrong because increasing the number of files per partition does not reduce the amount of data scanned; it can actually increase metadata overhead and slow down queries due to more file open/close operations. Option B is wrong because converting Parquet to CSV would increase file size (Parquet is columnar and compressed, CSV is row-based and uncompressed), leading to more data scanned and slower performance. Option D is wrong because adding a partition by product category would create a deep folder hierarchy that does not help with the existing partition pruning on date columns; the queries already filter by date, so the date-based partition structure is sufficient, and adding another partition level would not reduce the scan for the last 30 days.

Practice this question →

22

MCQhard

You are a data engineer for a multinational e-commerce company. The company uses Azure Synapse Analytics as its data warehouse. The current fact table, SalesFact, is distributed using hash distribution on the CustomerID column. It has 2 billion rows and is 2 TB in size. Recently, the business team has been running many queries that aggregate sales by product category and date, and these queries are experiencing high data movement and long execution times. The product dimension table (ProductDim) has 100,000 rows and is 100 MB. The date dimension table (DateDim) has 5,000 rows and is 5 MB. You need to redesign the storage to minimize data movement for these aggregation queries. You cannot change the fact table distribution key to ProductID because of other critical queries that rely on CustomerID. What should you do?

A.Create materialized views on the fact table that aggregate by product category and date

B.Replicate the ProductDim and DateDim tables to all compute nodes

C.Partition the fact table by date and keep the same distribution

D.Change the fact table distribution to round-robin and create non-clustered indexes on ProductID and DateID

AnswerB

Replication eliminates data movement for small dimension tables.

Why this answer

Option B is correct because replicating small dimension tables (ProductDim at 100 MB and DateDim at 5 MB) to all compute nodes eliminates the need to shuffle these tables across nodes during joins. In Azure Synapse Analytics, replicated tables are copied to each distribution, so when the fact table (hash-distributed on CustomerID) joins with ProductDim and DateDim on ProductID and DateID, no data movement occurs for the dimension tables. This directly reduces the high data movement and long execution times for aggregation queries by product category and date.

Exam trap

The trap here is that candidates often choose materialized views (Option A) thinking they solve all aggregation performance issues, but they overlook that data movement from joins with non-replicated dimension tables remains the bottleneck, whereas table replication directly addresses the shuffle cost for small dimension tables.

How to eliminate wrong answers

Option A is wrong because materialized views in Azure Synapse Analytics pre-aggregate data but still require the underlying fact table's distribution; they do not eliminate data movement when joining with non-replicated dimension tables, and the queries would still suffer from shuffling ProductDim and DateDim. Option C is wrong because partitioning the fact table by date improves partition elimination for date-range filters but does not reduce data movement during joins; the hash distribution on CustomerID remains, so joins on ProductID and DateID still require redistributing the fact table or dimension tables. Option D is wrong because changing to round-robin distribution would distribute fact table rows randomly, causing even more data movement for all joins and aggregations, and non-clustered indexes do not address the fundamental distribution issue for large-scale aggregation queries.

Practice this question →

23

MCQeasy

A data engineer needs to store semi-structured JSON log files from a web application. Each log entry is about 1 KB. The logs are rarely queried (once a month) and must be retained for 7 years for compliance. The solution must minimize storage cost. Which storage option should be used?

A.Store the logs in Azure SQL Database as a table.

B.Store the logs in Azure Files share.

C.Store the logs in Azure Blob Storage with cool access tier.

D.Store the logs in Azure Cosmos DB with a JSON container.

AnswerC

Blob Storage cool tier is low-cost for infrequent access, suitable for logs.

Why this answer

Azure Blob Storage with the cool access tier is the correct choice because it is optimized for storing large amounts of semi-structured data (like JSON logs) at low cost, with infrequent access (once a month) and long retention (7 years). The cool tier offers lower storage costs than hot or premium tiers, while still providing high durability and the ability to query logs using tools like Azure Data Lake Storage or serverless SQL. This meets the compliance requirement without the high compute or transaction costs of a database solution.

Exam trap

The trap here is that candidates often choose Azure Cosmos DB (D) because it natively supports JSON, but they overlook the extreme cost of storing and rarely querying 7 years of data in a globally distributed, high-throughput NoSQL database, which is optimized for frequent, low-latency access, not archival.

How to eliminate wrong answers

Option A is wrong because Azure SQL Database is a relational database designed for structured, transactional workloads with frequent queries, and it incurs high storage and compute costs for 7 years of 1 KB log entries, making it far more expensive than blob storage for rarely accessed data. Option B is wrong because Azure Files provides SMB/NFS file shares primarily for shared file access in VMs or on-premises apps, not for cost-effective, long-term archival of semi-structured logs, and it lacks the tiered pricing and lifecycle management of blob storage. Option D is wrong because Azure Cosmos DB is a NoSQL database optimized for low-latency, globally distributed, and frequently queried data; storing 7 years of rarely accessed logs in Cosmos DB would incur prohibitive costs due to its per-request unit (RU) pricing and storage charges, far exceeding blob storage costs.

Practice this question →

24

Multi-Selecthard

You are designing a data storage solution that must support both operational and analytical workloads using a single copy of data. Which THREE technologies should you consider?

Select 3 answers

A.Azure SQL Database with materialized views

B.Azure SQL Database with change data capture (CDC)

C.Azure SQL Database with PolyBase

D.Azure Synapse Link for Cosmos DB

E.Azure Cosmos DB with analytical store (HTAP)

AnswersC, D, E

PolyBase allows querying external data sources, enabling hybrid workloads.

Why this answer

Azure Synapse Link for Cosmos DB and Azure Cosmos DB with analytical store (HTAP) enable hybrid transactional/analytical processing by automatically replicating operational data into a column-oriented analytical store, allowing you to run near-real-time analytical queries on the same copy of data without impacting transactional performance. PolyBase in Azure SQL Database allows querying external data sources (like Azure Blob Storage or Hadoop) using T-SQL, but it does not support both operational and analytical workloads on a single copy of data—it is primarily for data virtualization and loading.

Exam trap

The trap here is that candidates often confuse PolyBase's ability to query external data with the concept of supporting both operational and analytical workloads on a single copy of data, but PolyBase is a data virtualization tool, not an HTAP solution that maintains a single copy for both workloads.

Practice this question →

25

MCQmedium

You are a data engineer for a financial services company. The company uses Azure Data Lake Storage Gen2 as its data lake. You have a directory structure where each customer has a folder containing transaction files in CSV format. The security team requires that each customer's data be accessible only to that customer's users. You need to implement fine-grained access control using Azure Data Lake Storage Gen2's POSIX-like ACLs. However, you have thousands of customers, and managing ACLs individually is not feasible. What should you do?

A.Create a shared access signature (SAS) token for each customer and distribute it securely

B.Use POSIX ACLs on each customer folder, assigning permissions to individual user identities

C.Use row-level security in Azure Data Lake Storage Gen2

D.Create an Azure AD group for each customer, add users to the group, and assign ACLs to the group on the customer folder

AnswerD

Group-based ACL assignment is scalable and manageable.

Why this answer

Option D is correct because Azure Data Lake Storage Gen2 supports POSIX-like ACLs that can be assigned to Azure AD security groups. By creating one Azure AD group per customer, adding the customer's users to that group, and then assigning the group the appropriate read/execute ACLs on the customer's folder, you achieve scalable, fine-grained access control without managing thousands of individual user ACLs. This approach aligns with the principle of least privilege and simplifies administration.

Exam trap

The trap here is that candidates often confuse row-level security (a SQL-based feature) with file-system access control in Azure Data Lake Storage Gen2, or they mistakenly believe that SAS tokens can provide granular directory-level isolation, when in fact SAS tokens operate at the container or storage account level and cannot enforce per-folder ACLs.

How to eliminate wrong answers

Option A is wrong because shared access signature (SAS) tokens provide delegated access at the storage account or container level, not at the directory or file level, and managing thousands of SAS tokens securely is operationally complex and does not integrate with Azure AD identity-based access control. Option B is wrong because assigning POSIX ACLs to individual user identities for thousands of customers is not feasible due to the Azure AD limit of 32 ACL entries per file or directory and the administrative overhead of managing individual user permissions at scale. Option C is wrong because row-level security is a feature of Azure SQL Database and Azure Synapse Analytics dedicated SQL pools, not of Azure Data Lake Storage Gen2, which uses POSIX ACLs and RBAC for access control.

Practice this question →

26

Multi-Selectmedium

You are implementing a data lake using Azure Data Lake Storage Gen2. Which THREE actions should you take to secure the data at rest and in transit?

Select 3 answers

A.Enable TLS 1.0 for compatibility with legacy clients

B.Enable Azure Storage Service Encryption (SSE) for data at rest

C.Configure firewall rules to allow only trusted IPs

D.Use Azure RBAC and ACLs to control access to data

E.Require HTTPS for all data transfers

AnswersB, D, E

SSE encrypts data at rest by default.

Why this answer

Azure Storage Service Encryption (SSE) automatically encrypts data at rest using 256-bit AES encryption, which is transparent to applications and meets compliance requirements. This is a fundamental security control for protecting data stored in Azure Data Lake Storage Gen2.

Exam trap

The trap here is that candidates may confuse network security controls (firewalls) or legacy protocol compatibility (TLS 1.0) with actual data encryption mechanisms, leading them to select options that address access or connectivity rather than encryption of data at rest and in transit.

Practice this question →

27

MCQmedium

A data engineer needs to store semi-structured JSON logs for analysis using Azure Synapse Serverless SQL. Which file format should be used for optimal query performance?

A.Avro

B.Parquet

C.CSV

AnswerB

Columnar, compressed, optimized for query performance.

Why this answer

Parquet is correct because it is a columnar storage format that enables predicate pushdown and compression, significantly reducing the amount of data scanned by Azure Synapse Serverless SQL for analytical queries on semi-structured JSON logs. This format aligns with the engine's design for high-performance read operations on large datasets, unlike row-oriented formats that require full file scans.

Exam trap

The trap here is that candidates often assume semi-structured data must stay in its native JSON format for simplicity, overlooking that columnar formats like Parquet can natively store nested JSON structures via repeated fields and maps, while providing massive performance gains in serverless SQL engines.

How to eliminate wrong answers

Option A is wrong because Avro is a row-oriented format that, while efficient for write-heavy and schema-evolving scenarios, does not support column pruning or predicate pushdown as effectively as Parquet, leading to higher I/O and slower query performance in Synapse Serverless SQL. Option C is wrong because CSV is a plain-text, row-oriented format with no built-in compression or indexing, forcing full table scans and increasing data transfer costs, which degrades query performance. Option D is wrong because storing logs as raw JSON files results in verbose, uncompressed data that lacks schema enforcement and columnar optimization, causing Synapse Serverless SQL to parse entire files without the benefits of predicate pushdown or efficient compression.

Practice this question →

28

MCQeasy

Your company uses Azure Cosmos DB for NoSQL to store user profiles. The application frequently reads profiles by user ID (the partition key). Occasionally, the application needs to query by email address, which is not part of the partition key. What should you do to optimize the occasional queries by email?

A.Create a secondary (composite) index on the email field.

B.Change the partition key to the email field.

C.Denormalize the data by storing a copy of the email in the partition key.

D.Use the Azure Cosmos DB change feed to maintain a separate container keyed by email.

AnswerA

A secondary index allows efficient queries on non-partition key fields.

Why this answer

Creating a secondary index on the email field allows Azure Cosmos DB for NoSQL to efficiently serve queries filtering by email without scanning all partitions. Since email is not the partition key, a secondary index (specifically a composite index if needed for multi-field queries, or a single-field index) enables index-based lookup across all physical partitions, optimizing the occasional query without redesigning the data model.

Exam trap

The trap here is that candidates often assume a secondary index is unnecessary or that changing the partition key is the only way to optimize non-key queries, but Azure Cosmos DB supports secondary indexes for non-partition key fields, and altering the partition key would disrupt the primary access pattern.

How to eliminate wrong answers

Option B is wrong because changing the partition key to email would break the primary access pattern (reads by user ID), causing cross-partition queries for the frequent user ID lookups and likely exceeding request unit (RU) costs. Option C is wrong because denormalizing by storing a copy of the email in the partition key does not change the partition key itself; the partition key remains user ID, so queries by email would still require a cross-partition scan unless a secondary index is used. Option D is wrong because using the change feed to maintain a separate container keyed by email introduces operational complexity and eventual consistency, and is overkill for occasional queries; a secondary index is simpler and directly supported.

Practice this question →

29

MCQmedium

Match each Azure data storage service to its primary use case.

A.Azure Blob Storage

B.Azure Cosmos DB

C.Azure Data Lake Storage Gen2

D.Azure SQL Database

Why this answer

Azure Blob Storage is for unstructured object storage. Azure Cosmos DB is for globally distributed, multi-model NoSQL data. Azure Data Lake Storage Gen2 is for big data analytics.

Azure SQL Database is for relational OLTP.

Exam trap

Candidates might confuse Azure Data Lake Storage Gen2 with Blob Storage, but ADLS Gen2 has hierarchical namespace for analytics.

Practice this question →

30

MCQhard

You are examining a T-SQL script that creates an external table in Azure Synapse serverless SQL pool. The query SELECT * FROM dbo.Sales returns zero rows, but the folder /year=2024/ in ADLS Gen2 contains Parquet files. What is the most likely cause?

A.The credential used to access ADLS Gen2 does not have sufficient permissions.

B.The serverless SQL pool does not support reading Parquet files.

C.The external table definition is missing the SCHEMA_NAME parameter.

D.The DATA_COMPRESSION setting is incompatible with Parquet files.

AnswerA

Insufficient permissions (e.g., missing Storage Blob Data Reader role) cause zero rows.

Why this answer

Option A is correct because the most common reason for SELECT * FROM an external table returning zero rows despite data existing in the underlying ADLS Gen2 folder is that the serverless SQL pool lacks the necessary permissions to read the Parquet files. The credential used in the external data source must have at least 'Storage Blob Data Reader' role on the storage account or the container, and the identity (e.g., SAS token, service principal, or managed identity) must be correctly configured. Without this, the query executes successfully but returns no rows because the pool cannot access the data.

Exam trap

The trap here is that candidates assume a missing or misconfigured schema parameter (like SCHEMA_NAME) would cause zero rows, but in reality, permission issues are the primary cause of empty results when the data path is correct and the file format is supported.

How to eliminate wrong answers

Option B is wrong because Azure Synapse serverless SQL pool fully supports reading Parquet files, including partitioned data like /year=2024/. Option C is wrong because the SCHEMA_NAME parameter is optional in CREATE EXTERNAL TABLE and is used for schema binding, not for data access or row retrieval. Option D is wrong because DATA_COMPRESSION is not applicable to Parquet files; Parquet has its own internal compression (e.g., Snappy, gzip) and the setting is ignored or causes an error, not silent zero rows.

Practice this question →

31

MCQhard

A financial services company needs to store transaction data for audit purposes. The data must be immutable and cannot be modified or deleted for 7 years. Which Azure storage feature should be used?

A.Azure Blob Storage immutability policy (time-based retention).

B.Azure Blob Storage soft delete.

C.Azure Blob Storage versioning.

D.Azure Files share snapshots.

AnswerA

Immutability policies enforce WORM (Write Once, Read Many) for a specified duration.

Why this answer

Azure Blob Storage immutability policy with time-based retention ensures that blobs cannot be modified or deleted for a specified retention period (e.g., 7 years). This meets the audit requirement for immutable storage by locking the data at the storage level, preventing any writes or deletes until the retention interval expires. The policy is enforced at the container level and applies to all blobs within, making it the correct choice for regulatory compliance.

Exam trap

The trap here is that candidates confuse soft delete or versioning with immutability, not realizing that only a locked time-based retention policy provides the strict write-once, read-many guarantee required for audit data that cannot be modified or deleted for a fixed duration.

How to eliminate wrong answers

Option B is wrong because soft delete only protects against accidental deletion by retaining deleted blobs for a configurable period, but it does not prevent modification or provide true immutability; data can still be overwritten. Option C is wrong because versioning preserves previous versions of a blob when overwritten or deleted, but it does not block writes or deletes—new versions can be created, and the current version can be modified, violating immutability. Option D is wrong because Azure Files share snapshots are point-in-time read-only copies of a file share, but they do not enforce a write-once-read-many (WORM) state on the live share; the original files can still be modified or deleted.

Practice this question →

32

MCQhard

You are a data engineer for a large e-commerce company. The company uses Azure Synapse Analytics dedicated SQL pool as its enterprise data warehouse. A new business requirement mandates that the Sales fact table, which contains 2 billion rows, must support real-time analytics with a maximum query latency of 1 second for aggregations on the most recent 24 hours of data. The table is currently hash-distributed on CustomerID and partitioned monthly by SaleDate. The current query performance for recent data is slow due to full partition scans. The data is ingested via Azure Event Hubs and processed by Azure Stream Analytics, which writes to staging tables every minute. You need to redesign the storage to meet the latency requirement while minimizing cost and maintaining data integrity. Which approach should you take?

A.Increase partition granularity to hourly partitions for the Sales table.

B.Change the Sales table to use a clustered index on SaleDate and a round-robin distribution.

C.Create a separate staging table for the last 24 hours of data with round-robin distribution and a clustered columnstore index. After each batch load, merge the staging table into the main partitioned Sales table. Queries for recent data should target the staging table.

D.Provision a second dedicated SQL pool optimized for real-time queries and replicate the recent data there.

AnswerC

The staging table is small, enabling fast queries for recent data; merging maintains the historical archive.

Why this answer

Option C is correct because it isolates the hot (recent 24-hour) data into a separate staging table with a clustered columnstore index, which is optimized for fast aggregations and high compression. This avoids full partition scans on the 2-billion-row main table, and merging into the partitioned table after each batch load maintains data integrity without requiring expensive repartitioning or additional dedicated SQL pools.

Exam trap

The trap here is that candidates assume finer partitioning (Option A) always improves query performance, but they overlook that partition elimination still requires scanning the entire partition, and that a separate hot table with columnstore indexing is more efficient for real-time aggregations on a sliding window of recent data.

How to eliminate wrong answers

Option A is wrong because increasing partition granularity to hourly partitions would increase metadata overhead and partition management complexity, and full partition scans would still occur on the most recent partition, failing to meet the 1-second latency requirement. Option B is wrong because changing to a clustered index on SaleDate with round-robin distribution eliminates data locality for joins on CustomerID and introduces full data movement for aggregations, while round-robin distribution is unsuitable for large fact tables due to lack of partition elimination. Option D is wrong because provisioning a second dedicated SQL pool duplicates storage and compute costs unnecessarily, and replicating data introduces latency and complexity in maintaining consistency, violating the cost-minimization requirement.

Practice this question →

33

MCQhard

You are designing a data storage solution for a global IoT application that ingests millions of events per second. The data is write-heavy with occasional reads for real-time dashboards. Which Azure storage option and configuration would provide the lowest latency writes with high throughput?

A.Azure Cosmos DB with multi-region writes and eventual consistency

B.Azure Cosmos DB with single-region writes and strong consistency

C.Azure Data Lake Storage Gen2 with hierarchical namespace

D.Azure Blob Storage with hot tier and append blobs

AnswerA

Why this answer

Azure Cosmos DB with multi-region writes and eventual consistency provides the lowest latency writes for a global IoT application because it allows each region to accept writes independently without cross-region coordination, and eventual consistency removes the need for quorum confirmations, reducing write latency. This configuration also offers high throughput by distributing write load across multiple regions, making it ideal for write-heavy, high-volume IoT scenarios where occasional reads for dashboards can tolerate stale data.

Exam trap

The trap here is that candidates often assume strong consistency is required for real-time dashboards, but eventual consistency is sufficient for write-heavy IoT scenarios where occasional stale reads are acceptable, and multi-region writes drastically reduce latency compared to single-region writes.

Why the other options are wrong

B

Strong consistency increases write latency and single-region limits throughput.

C

Data Lake Storage is optimized for analytics, not low-latency writes.

D

Blob storage has higher write latency and append blobs are not ideal for high-throughput ingestion.

Practice this question →

34

MCQmedium

Your company uses Azure Data Lake Storage Gen2 for a data lake. You need to implement a folder structure that separates data by sensitivity level. Which access control method should you use?

A.Use a storage account firewall and virtual network service endpoints

B.Use storage account keys for all access

C.Use Azure RBAC at the resource group level and ACLs on directories

D.Use shared access signatures (SAS) for each folder

AnswerC

C is correct because RBAC provides coarse control and ACLs provide fine-grained folder-level permissions.

Why this answer

Option C is correct because Azure Data Lake Storage Gen2 supports both Azure RBAC at the resource group level for coarse-grained control and POSIX-like ACLs on directories for fine-grained, sensitivity-based access. This combination allows you to assign read/write/execute permissions per directory, enabling a folder structure that separates data by sensitivity level without compromising security.

Exam trap

The trap here is that candidates often confuse network-level controls (firewall, VNet) or shared access signatures with the directory-level ACLs required for sensitivity-based folder separation, overlooking that only ACLs provide the POSIX-style granularity needed for hierarchical data lakes.

How to eliminate wrong answers

Option A is wrong because a storage account firewall and virtual network service endpoints control network-level access to the entire storage account, not granular folder-level permissions by sensitivity. Option B is wrong because storage account keys provide full administrative access to the entire account, bypassing any folder-level sensitivity controls and violating the principle of least privilege. Option D is wrong because shared access signatures (SAS) grant time-limited, delegated access to specific containers or blobs, but they cannot enforce POSIX ACLs on directories and are not designed for persistent, sensitivity-based folder structures.

Practice this question →

35

MCQmedium

A company is designing a data storage solution for streaming IoT telemetry data. The data is JSON-formatted, arrives at up to 10,000 events per second, and must be stored for at least 30 days for real-time dashboards and ad-hoc querying. The solution must minimize operational overhead and query latency. Which Azure service should they use?

A.Azure Blob Storage with Azure Data Lake Storage Gen2

B.Azure Data Explorer (ADX)

C.Azure Cosmos DB with analytical store

D.Azure SQL Database with elastic query

AnswerB

ADX is built for high-speed ingestion of streaming data, supports JSON, and provides sub-second query performance for dashboards.

Why this answer

Azure Data Explorer (ADX) is purpose-built for high-velocity telemetry and time-series data, ingesting up to 10,000 events per second with low latency. Its columnar storage and indexing enable sub-second queries on JSON data for real-time dashboards, while the 30-day retention is natively configurable via caching and soft-delete policies. This minimizes operational overhead by eliminating the need for manual partitioning or index tuning.

Exam trap

The trap here is that candidates confuse Azure Data Explorer with Azure Data Lake Storage, assuming that a data lake can serve real-time dashboards, but ADLS Gen2 lacks the indexing and query engine needed for sub-second latency on streaming data.

How to eliminate wrong answers

Option A is wrong because Azure Blob Storage with ADLS Gen2 is optimized for large-scale batch analytics and data lakes, not for real-time, sub-second queries on streaming data; querying JSON blobs directly incurs high latency and requires additional compute (e.g., Azure Synapse or Databricks). Option C is wrong because Azure Cosmos DB with analytical store is designed for transactional workloads with real-time analytics on operational data, but its ingestion throughput for 10,000 events/second of pure telemetry would be costly and over-provisioned, and the analytical store is better suited for hybrid transactional/analytical processing (HTAP) rather than pure streaming telemetry. Option D is wrong because Azure SQL Database with elastic query is a relational OLTP system not optimized for high-velocity JSON ingestion or time-series queries; it would require extensive schema design, indexing, and sharding to handle 10,000 events/second, and query latency would be higher due to row-based storage.

Practice this question →

36

MCQeasy

A retail company uses Azure Synapse Analytics dedicated SQL pool to store sales data. The data is loaded nightly from Azure Data Lake Storage Gen2 using PolyBase. Recently, the load process started failing with the error 'External table 'sales' is not accessible because the location does not exist or is used by another process.' You verify that the storage account, container, and file path are correct. The file is a CSV file named 'sales_20250301.csv' and it exists. Other files in the same container load successfully. What is the most likely cause of the error?

A.The network connectivity between Synapse and the storage account is intermittent.

B.The CSV file has an incorrect number of columns or contains a header row that mismatches the schema.

C.The storage account key used in the external data source has expired.

D.The file is being written to or is locked by another process during the PolyBase read.

AnswerD

The error 'used by another process' indicates a file lock, typically because the file is still being written or another reader has an exclusive lock.

Why this answer

The error 'location does not exist or is used by another process' specifically indicates that the file is locked by another process. In Azure Data Lake Storage Gen2, when a file is being written or modified, it can be locked by the writing process (e.g., a data ingestion pipeline or another ETL job). PolyBase attempts to read the file while it is still being written, causing the error.

Option D correctly identifies this concurrency issue.

Exam trap

The trap here is that candidates often confuse file-locking errors with authentication or schema issues, but the specific wording 'used by another process' directly points to a concurrency/lock conflict rather than connectivity or data format problems.

How to eliminate wrong answers

Option A is wrong because intermittent network connectivity would typically cause timeout or connection reset errors, not a 'location does not exist or is used by another process' error. Option B is wrong because schema mismatches (incorrect column count or header row) would produce data conversion or parsing errors, not a location accessibility error. Option C is wrong because an expired storage account key would result in an authentication failure (e.g., 403 Forbidden), not a location-not-found or file-locked error.

Practice this question →

37

MCQeasy

You need to store JSON files from an external partner in Azure Blob Storage. The files contain sensitive financial data. Which access method provides the highest security while allowing the partner to upload files?

A.Share the storage account access key with the partner

B.Configure a firewall to allow only the partner's IP address

C.Generate a shared access signature (SAS) with write-only permission and expiry time

D.Enable anonymous public access to a container

AnswerC

B is correct because SAS provides scoped, time-limited access.

Why this answer

Option C is correct because a Shared Access Signature (SAS) with write-only permission and an expiry time provides delegated, time-limited access to Azure Blob Storage without exposing the storage account key. This ensures the partner can upload JSON files but cannot read, modify, or list existing blobs, and access automatically revokes after the expiry, meeting the highest security requirement for sensitive financial data.

Exam trap

The trap here is that candidates often confuse IP-based firewalls (Option B) as a security method, but firewalls do not authenticate users or control data-plane permissions, and they can be bypassed if the partner's IP changes or if the partner uses a shared network.

How to eliminate wrong answers

Option A is wrong because sharing the storage account access key grants full administrative control (read, write, delete, list) over all blobs and containers, violating the principle of least privilege and exposing the key to potential compromise. Option B is wrong because a firewall restricting to the partner's IP address does not authenticate the partner's identity or control permissions; it only limits network-level access, and the partner's IP may change, leading to access failures or security gaps. Option D is wrong because enabling anonymous public access allows anyone on the internet to read and list blobs without any authentication, completely exposing sensitive financial data.

Practice this question →

38

MCQmedium

You are designing a change data capture (CDC) pipeline to ingest incremental changes from an on-premises SQL Server database into Azure Data Lake Storage Gen2. The pipeline must run every 5 minutes and handle high-volume DML changes. Which Azure service should you use to capture the changes with low latency?

A.Azure Data Share to share the SQL Server data and capture changes.

B.Azure Data Factory with a change data capture (CDC) source in the mapping data flow.

C.Azure Synapse Pipelines with a copy activity that uses a query to capture changes.

D.Azure Databricks with Auto Loader and Delta Live Tables to capture changes.

AnswerB

ADF supports CDC from SQL Server with low latency via mapping data flows.

Why this answer

Azure Data Factory's mapping data flow includes a native CDC source that can connect to SQL Server and capture incremental DML changes (inserts, updates, deletes) with low latency. This approach uses change tracking or change data capture features in SQL Server to identify changes, and the pipeline can run every 5 minutes to meet the high-volume requirement without custom coding.

Exam trap

The trap here is that candidates often confuse Azure Data Factory's copy activity (which requires manual change tracking) with the mapping data flow's native CDC source, leading them to choose option C or D, which are better suited for different ingestion patterns.

How to eliminate wrong answers

Option A is wrong because Azure Data Share is designed for sharing snapshots or incremental data between organizations, not for capturing low-latency CDC from an on-premises SQL Server into Azure Data Lake Storage Gen2. Option C is wrong because Azure Synapse Pipelines' copy activity with a query-based approach typically requires manual watermarking and cannot natively capture DML changes with low latency; it lacks the built-in CDC source that mapping data flow provides. Option D is wrong because Azure Databricks with Auto Loader and Delta Live Tables is optimized for streaming file ingestion (e.g., from cloud storage), not for directly capturing CDC from an on-premises SQL Server database.

Practice this question →

39

Multi-Selecteasy

You need to design a storage solution for IoT device telemetry data that will be queried by time range. The data is append-only and arrives at high velocity. Which TWO features should you use to optimize query performance and reduce costs?

Select 2 answers

A.Store data in columnar format (e.g., Parquet)

B.Create indexes on all columns

C.Enable row-level security

D.Partition the data by date

E.Enable geo-redundant storage

AnswersA, D

Columnar format reduces I/O and improves compression.

Why this answer

Columnar formats like Parquet store data by column rather than by row, which significantly reduces I/O when querying only a subset of columns (common in time-range queries). This compression and column pruning directly lowers storage costs and speeds up scan-heavy analytical queries on append-only IoT telemetry data.

Exam trap

The trap here is that candidates often confuse indexing (B) with partitioning, but for append-only analytical workloads, indexes add write overhead and cost without benefit, while date partitioning directly enables partition elimination for time-range queries.

Practice this question →

40

MCQhard

You are tuning a dedicated SQL pool in Azure Synapse Analytics. A query that joins two large tables (fact_sales and dim_product) is slow. The fact_sales table is hash-distributed on product_id, and dim_product is replicated. You notice that the query plan shows a shuffle move. What is the most likely cause?

A.The fact_sales table uses clustered columnstore index.

B.The dim_product table is replicated, causing a broadcast join.

C.Statistics are out of date on both tables.

D.The join condition does not include the distribution key for fact_sales.

AnswerD

Joins on non-distribution keys require data movement.

Why this answer

Option D is correct because when the join condition does not include the distribution key (product_id) of the hash-distributed fact_sales table, the SQL engine cannot perform a collocated join. Instead, it must shuffle data across distributions to satisfy the join, which introduces expensive data movement. The shuffle move in the query plan directly indicates this redistribution.

Exam trap

The trap here is that candidates often confuse a shuffle move with a broadcast join or blame indexing, but the root cause is the mismatch between the join key and the distribution key, which forces data movement regardless of other optimizations.

How to eliminate wrong answers

Option A is wrong because a clustered columnstore index is optimized for large fact tables and typically improves query performance; it would not cause a shuffle move. Option B is wrong because a replicated table (dim_product) is designed to avoid shuffles by having a copy on each distribution, enabling a broadcast join without data movement. Option C is wrong because out-of-date statistics can lead to suboptimal plans but do not directly force a shuffle move; the shuffle is a structural requirement based on the join key not matching the distribution key.

Practice this question →

41

MCQeasy

You need to store semi-structured JSON data from a web application. The data schema may change over time. The solution must support low-latency queries and be globally distributed. Which Azure data service should you use?

A.Azure Table Storage

B.Azure Cosmos DB

C.Azure Data Lake Storage Gen2

D.Azure SQL Database

AnswerB

Cosmos DB natively supports JSON documents with automatic indexing and global distribution.

Why this answer

Azure Cosmos DB is the correct choice because it natively supports semi-structured JSON documents with a flexible schema, offers single-digit-millisecond latency for queries, and provides global distribution with turnkey multi-region replication. Its API for MongoDB or SQL API allows direct ingestion of JSON data, and its schema-agnostic indexing adapts automatically to schema changes over time.

Exam trap

The trap here is that candidates often confuse Azure Table Storage's key-value model with document storage, overlooking that it lacks native JSON support and global low-latency query capabilities, while Azure Cosmos DB is specifically designed for these requirements.

How to eliminate wrong answers

Option A is wrong because Azure Table Storage is a key-value store that does not natively support JSON documents or complex nested structures; it requires manual serialization and lacks global distribution with low-latency guarantees. Option C is wrong because Azure Data Lake Storage Gen2 is optimized for large-scale batch analytics and data lakes, not for low-latency transactional queries on semi-structured data. Option D is wrong because Azure SQL Database requires a fixed relational schema and does not handle dynamic schema changes without manual migrations, nor does it offer the same turnkey global distribution as Cosmos DB.

Practice this question →

42

MCQeasy

You are tasked with designing a data storage solution for a social media analytics company. They need to store user profile data (JSON) and social media posts (text and images). The data is used for machine learning models that require fast random access to individual user profiles and the ability to run analytical queries over posts. The solution must provide low-latency reads for user profiles (milliseconds) and support for large-scale analytics on posts. Which combination of Azure data services should you recommend?

A.Azure Cosmos DB for user profiles and Azure Data Lake Storage Gen2 for posts

B.Azure Cosmos DB for both user profiles and posts

C.Azure SQL Database for both user profiles and posts

D.Azure Table Storage for user profiles and Azure Blob Storage for posts

AnswerA

Cosmos DB gives low-latency reads; ADLS Gen2 supports analytics.

Why this answer

Azure Cosmos DB provides low-latency (millisecond) reads for user profiles via its indexing and partitioning capabilities, ideal for fast random access. Azure Data Lake Storage Gen2 (ADLS Gen2) combines a hierarchical namespace with Blob Storage, enabling large-scale analytical queries on posts using tools like Azure Synapse Analytics or Spark, while efficiently storing text and images.

Exam trap

The trap here is that candidates often choose Azure Cosmos DB for both workloads (Option B) because they assume its multi-model support handles analytics, but they overlook that Cosmos DB is a transactional database not designed for large-scale analytical queries, while ADLS Gen2 is purpose-built for data lakes and analytics.

How to eliminate wrong answers

Option B is wrong because using Azure Cosmos DB for both profiles and posts would be cost-prohibitive for large-scale analytics on posts, as Cosmos DB is optimized for transactional workloads, not petabyte-scale analytical queries, and lacks native support for hierarchical file storage. Option C is wrong because Azure SQL Database is a relational store that struggles with semi-structured JSON profiles and large binary images, and it cannot scale to handle massive analytical workloads on posts without significant performance degradation and cost. Option D is wrong because Azure Table Storage is a NoSQL key-value store that does not support complex queries or indexing for fast random access to JSON profiles, and Azure Blob Storage lacks a hierarchical namespace and native analytical integration, making large-scale analytics inefficient.

Practice this question →

43

MCQhard

You are designing a data lake on Azure Data Lake Storage Gen2. The data will be used by both batch processing (Spark) and interactive querying (Azure Synapse Serverless SQL). The data is partitioned by date and stored as Parquet. What is the optimal folder structure to minimize cross-partition scans for both workloads?

A.All files in a single folder

B./year/month/day/ (e.g., /2023/12/25/)

C./yyyy-mm-dd/ (e.g., /2023-12-25/)

D.Files named by date (e.g., data_20231225.parquet)

AnswerB

Why this answer

Option B (/year/month/day/) is optimal because it aligns with Hive-style partitioning, which both Spark and Azure Synapse Serverless SQL can leverage for partition pruning. Spark uses partition discovery to read only relevant directories, and Synapse Serverless SQL uses the file path metadata to filter partitions, minimizing cross-partition scans and reducing data read overhead.

Exam trap

The trap here is that candidates often assume a flat date-based folder or filename pattern is sufficient for partitioning, but both Spark and Synapse Serverless SQL require hierarchical folder structures to enable automatic partition pruning and avoid full scans.

Why the other options are wrong

A

No partitioning at all, causing full scans.

C

Single-level partitioning does not allow efficient pruning for yearly or monthly queries.

D

Partition pruning requires folder hierarchy, not file names.

Practice this question →

44

MCQeasy

Your company is migrating an on-premises SQL Server database to Azure SQL Database. The database includes a large fact table with hourly updates. You need to minimize downtime during migration. Which Azure service should you use to replicate data continuously?

A.Use Azure SQL Managed Instance as a target

B.Use Azure Data Factory with a copy activity

C.Use Azure Database Migration Service with online migration mode

D.Use Azure Synapse Link for SQL Server

AnswerC

B is correct because DMS online migration uses CDC to replicate changes continuously.

Why this answer

Azure Database Migration Service (DMS) with online migration mode uses continuous change data capture (CDC) to replicate ongoing changes from the source SQL Server to Azure SQL Database with minimal downtime. This is the only option that supports near-zero downtime migration by synchronizing data continuously until the final cutover, which is critical for a large fact table with hourly updates.

Exam trap

The trap here is that candidates confuse Azure Data Factory's copy activity (a batch tool) with continuous replication, or assume Azure SQL Managed Instance inherently supports migration, when in fact DMS online mode is the specific service designed for minimal-downtime migrations.

How to eliminate wrong answers

Option A is wrong because Azure SQL Managed Instance is a target platform, not a migration service; it does not provide continuous data replication from an on-premises source. Option B is wrong because Azure Data Factory with a copy activity is a batch-oriented ETL tool that performs periodic snapshots, not continuous real-time replication, and would require downtime for the final data sync. Option D is wrong because Azure Synapse Link for SQL Server is designed for real-time analytics on operational data in Azure Synapse, not for migrating databases to Azure SQL Database with continuous replication.

Practice this question →

45

MCQeasy

A company is designing a data storage solution for IoT device telemetry data. The data is append-only, needs to be stored cost-effectively for long-term analytics, and must support querying by device ID and timestamp. Which Azure storage solution should they use?

A.Azure Data Lake Storage Gen2

B.Azure Cosmos DB

C.Azure SQL Database

D.Azure Blob Storage with hot access tier

AnswerA

ADLS Gen2 is designed for big data analytics, supports hierarchical namespace, and is cost-effective for long-term storage.

Why this answer

Azure Data Lake Storage Gen2 (ADLS Gen2) is the correct choice because it combines the cost-effective, append-only blob storage of Azure Blob Storage with a hierarchical namespace that enables directory-level operations and POSIX-like access control. This makes it ideal for storing large volumes of IoT telemetry data at low cost while supporting efficient querying by device ID and timestamp through partition pruning in tools like Azure Synapse Analytics or Apache Spark.

Exam trap

The trap here is that candidates often confuse Azure Blob Storage with ADLS Gen2, assuming that Blob Storage alone supports hierarchical namespace and efficient querying, when in fact ADLS Gen2 is required for the hierarchical namespace and POSIX-like directory structure that enables partition pruning and cost-effective analytics on append-only data.

How to eliminate wrong answers

Option B (Azure Cosmos DB) is wrong because it is a NoSQL database optimized for low-latency, transactional workloads with flexible schemas, not for cost-effective long-term storage of append-only telemetry data; its RU-based pricing model becomes prohibitively expensive for high-volume, append-only IoT data. Option C (Azure SQL Database) is wrong because it is a relational database designed for OLTP workloads with strong consistency and indexing, but its per-core licensing and storage costs are too high for storing petabytes of append-only telemetry data, and it does not natively support the hierarchical namespace needed for efficient partition pruning by device ID and timestamp. Option D (Azure Blob Storage with hot access tier) is wrong because while it provides cost-effective storage, the hot access tier incurs higher storage costs than the cool or archive tiers, and without the hierarchical namespace of ADLS Gen2, querying by device ID and timestamp requires full scans or external indexing, making it less efficient for analytics workloads.

Practice this question →

46

MCQmedium

A data engineering team is designing a batch processing pipeline that reads from Azure Data Lake Storage Gen2, transforms data using Azure Databricks, and writes to Azure Synapse Analytics. The pipeline must process data incrementally and handle late-arriving data up to 2 hours. Which approach should they use to track processed files?

A.Use Blob Storage event triggers to invoke Azure Functions

B.Use Azure Synapse Pipelines with a schedule and full load each time

C.Use Azure Data Factory with watermark columns in the source

D.Store processed file names in a Delta table and compare with source folder listing

AnswerD

Delta table provides a reliable way to track processed files and can be updated incrementally.

Why this answer

Storing processed file names in a Delta table allows the pipeline to track which files have been processed and compare against the source folder to identify new or late-arriving files. This is a common pattern for incremental processing.

Practice this question →

47

MCQeasy

A data engineer needs to store CSV files containing customer data in Azure Blob Storage. The files must be encrypted at rest using a customer-managed key stored in Azure Key Vault. What should they configure?

A.Azure Disk Encryption

B.Azure Storage Firewall

C.Azure Storage Service Encryption (SSE) with customer-managed keys

D.Azure Information Protection

AnswerC

Correct. SSE encrypts data at rest and can use CMK from Key Vault.

Why this answer

Azure Storage Service Encryption (SSE) for Blob Storage encrypts data at rest automatically. By choosing customer-managed keys (CMK) stored in Azure Key Vault, the customer retains control over the encryption keys, meeting the requirement for customer-managed key encryption at rest. SSE with CMK is the correct service for encrypting blobs with a key the customer manages.

Exam trap

The trap here is confusing Azure Disk Encryption (which encrypts VM disks) with Azure Storage Service Encryption (which encrypts blob data), leading candidates to select Option A when the requirement is for blob-level encryption with customer-managed keys.

How to eliminate wrong answers

Option A is wrong because Azure Disk Encryption uses BitLocker or DM-Crypt to encrypt OS and data disks of virtual machines, not the data stored in Azure Blob Storage. Option B is wrong because Azure Storage Firewall controls network access to the storage account via IP rules and virtual network rules, it does not provide encryption at rest. Option D is wrong because Azure Information Protection is a classification and labeling solution for documents and emails, not an encryption mechanism for data at rest in Azure Blob Storage.

Practice this question →

48

MCQmedium

You are designing a solution for a social media company that needs to store user profile data with strong consistency and low latency (under 10 ms) for reads and writes. The data model is simple key-value with occasional queries on secondary attributes. Which Azure data store meets these requirements?

A.Azure Table Storage

B.Azure SQL Database with clustered index on user ID

C.Azure Cache for Redis with persistence

D.Azure Cosmos DB with session consistency and secondary indexes

AnswerD

Cosmos DB provides <10 ms latency, strong consistency, and indexing on any attribute.

Why this answer

Azure Cosmos DB with session consistency and secondary indexes meets the requirements because it provides single-digit-millisecond latency for both reads and writes, supports strong consistency (session consistency offers monotonic reads and writes), and allows efficient queries on secondary attributes via indexing. This makes it ideal for a key-value store with occasional secondary attribute queries, unlike simpler stores that lack indexing or consistency guarantees.

Exam trap

The trap here is that candidates often choose Azure Cache for Redis (Option C) for low latency, overlooking that it lacks secondary indexes for attribute queries and does not provide strong consistency for writes, which are critical for profile data with occasional secondary lookups.

How to eliminate wrong answers

Option A is wrong because Azure Table Storage does not support secondary indexes, making queries on non-key attributes inefficient (full table scans), and its consistency model is only eventual by default, not strong. Option B is wrong because Azure SQL Database, while supporting strong consistency and secondary indexes, typically has higher latency (often 10-30 ms) for simple key-value operations due to relational overhead and network round-trips, failing the under-10-ms requirement. Option C is wrong because Azure Cache for Redis with persistence is an in-memory cache designed for low latency but lacks built-in secondary indexes for querying non-key attributes, and its persistence model (RDB/AOF) does not guarantee strong consistency for writes (e.g., data loss on failover).

Practice this question →

49

MCQhard

A company is migrating an on-premises Hadoop cluster to Azure. The cluster uses Hive tables stored as Parquet files on HDFS. They want to minimize changes to existing Hive queries and continue using HiveQL. Which Azure storage solution should they choose?

A.Azure HDInsight with Hive and Azure Data Lake Storage Gen2

B.Azure SQL Database with PolyBase

C.Azure Databricks with Delta Lake

D.Azure Synapse Analytics with external tables

AnswerA

HDInsight provides Hive-compatible environment; ADLS Gen2 replaces HDFS seamlessly.

Why this answer

Azure HDInsight with Hive and Azure Data Lake Storage Gen2 (ADLS Gen2) is the correct choice because it provides a fully managed Hadoop service that supports HiveQL with minimal changes. ADLS Gen2 offers a hierarchical namespace and is optimized for Hadoop workloads, allowing the existing Parquet files on HDFS to be directly mounted and queried without data movement or schema changes.

Exam trap

The trap here is that candidates often confuse 'supporting HiveQL' with 'running Hive on any Azure service,' but only HDInsight provides the native Hive runtime and HDFS-compatible storage (ADLS Gen2) needed to run existing HiveQL queries unchanged.

How to eliminate wrong answers

Option B is wrong because Azure SQL Database with PolyBase is a relational database engine that does not natively support HiveQL or the Hadoop file system; it would require rewriting queries and converting Parquet files to a relational format. Option C is wrong because Azure Databricks with Delta Lake is an Apache Spark-based platform that, while supporting HiveQL via Spark SQL, introduces Delta Lake's transactional layer, which changes the storage format and requires query modifications; it is not a direct Hive-on-HDFS replacement. Option D is wrong because Azure Synapse Analytics with external tables uses PolyBase to query external data, but it does not run HiveQL natively and requires creating external table definitions, altering the query interface and adding complexity.

Practice this question →

50

MCQhard

Your organization uses Azure Synapse Analytics dedicated SQL pool. You need to implement a solution that reduces storage costs for historical data that is rarely accessed but must be available for querying within minutes. The solution should not require application changes. What should you do?

A.Create external tables pointing to data stored in Azure Data Lake Storage Gen2 with appropriate tiering

B.Use Azure Blob Storage with PolyBase and configure lifecycle management to archive data

C.Drop older partitions and reload data when needed

D.Change the distribution to round-robin for the fact table

AnswerA

External tables allow querying data in Data Lake Storage without moving it, and tiering reduces cost.

Why this answer

Option A is correct because creating external tables in Azure Synapse Analytics dedicated SQL pool that point to data stored in Azure Data Lake Storage Gen2 (ADLS Gen2) allows you to query historical data directly from low-cost storage tiers (e.g., cool or archive) without moving it into the pool. This reduces storage costs for rarely accessed data while keeping it queryable within minutes, and requires no application changes since the external tables are accessed via standard T-SQL queries.

Exam trap

The trap here is that candidates often confuse PolyBase with external tables, assuming PolyBase requires data to be in hot tier or that lifecycle management alone solves the query latency requirement, but they overlook the rehydration delay of archive tier and the need for zero application changes.

How to eliminate wrong answers

Option B is wrong because Azure Blob Storage with PolyBase and lifecycle management to archive data would require application changes to switch query targets and does not natively support querying archived data within minutes (archive tier has a rehydration delay of up to 15 hours). Option C is wrong because dropping older partitions and reloading data when needed is a manual, time-consuming process that violates the requirement of being available for querying within minutes and would require application changes to manage the reload logic. Option D is wrong because changing the distribution to round-robin for the fact table improves query performance for certain workloads but does not reduce storage costs for historical data or address the requirement of tiering rarely accessed data.

Practice this question →

51

MCQmedium

Your company has an Azure Synapse Analytics dedicated SQL pool. You need to implement a solution that automatically moves data between the 'PRIMARY' filegroup and a secondary filegroup based on data age, without manual intervention. Which feature should you use?

A.Materialized views

B.Azure Data Lake Storage tiering

C.PolyBase

D.Partition switching

AnswerD

Partition switching enables efficient data movement between partitions, which can be automated with partition management.

Why this answer

Partition switching in Azure Synapse Analytics dedicated SQL pool allows you to efficiently move data between filegroups by switching partitions between tables. By aligning partitions with data age, you can automatically transfer older data to a secondary filegroup without manual intervention, using a scheduled process like a stored procedure or Azure Data Factory.

Exam trap

The trap here is that candidates confuse data movement between filegroups with storage tiering or external data access, leading them to choose PolyBase or Azure Data Lake Storage tiering instead of recognizing that partition switching is the native mechanism for intra-database data relocation in Synapse dedicated SQL pools.

How to eliminate wrong answers

Option A is wrong because materialized views improve query performance by pre-computing and storing results, but they do not move data between filegroups or manage data lifecycle. Option B is wrong because Azure Data Lake Storage tiering manages data across hot, cool, and archive storage tiers at the storage account level, not within a dedicated SQL pool's filegroups. Option C is wrong because PolyBase is used for querying external data sources (e.g., Azure Blob Storage) using T-SQL, not for moving data between filegroups within the same SQL pool.

Practice this question →

52

MCQmedium

Your team needs to provide near-real-time analytics on IoT sensor data streaming into Azure Event Hubs. The data must be stored in Azure Data Lake Storage Gen2 in Parquet format, partitioned by date and device ID. Which architecture should you implement?

A.Use Azure Stream Analytics with output to Data Lake Storage Gen2, using partitioning by date and device ID.

B.Use Azure Data Factory with a tumbling window trigger to copy data from Event Hubs to Data Lake Storage.

C.Use Azure Functions to read from Event Hubs and write to Data Lake Storage.

D.Use Azure Databricks with Structured Streaming to read from Event Hubs and write to Data Lake Storage.

AnswerA

Stream Analytics provides native partitioning and Parquet output.

Why this answer

Azure Stream Analytics provides native, low-latency processing of streaming data from Event Hubs with direct output to Azure Data Lake Storage Gen2. It supports automatic partitioning by specifying a partition key (e.g., date and device ID) in the output configuration, enabling efficient, near-real-time writes in Parquet format without additional orchestration.

Exam trap

Microsoft often tests the misconception that any service capable of reading from Event Hubs is suitable for near-real-time analytics, ignoring the critical requirements for native partitioning, low latency, and managed checkpointing that Stream Analytics uniquely provides.

How to eliminate wrong answers

Option B is wrong because Azure Data Factory with a tumbling window trigger operates on a batch schedule (minimum 1 minute), not near-real-time, and cannot natively read from Event Hubs as a streaming source. Option C is wrong because Azure Functions, while capable of event-driven processing, lack native support for checkpointing and exactly-once semantics for high-throughput streaming, leading to potential data loss or duplication. Option D is wrong because Azure Databricks Structured Streaming can achieve near-real-time processing but introduces significant overhead (cluster startup, cost, complexity) compared to the simpler, fully managed Stream Analytics solution for this specific use case.

Practice this question →

53

MCQhard

You are troubleshooting slow COPY INTO performance in Azure Synapse Analytics dedicated SQL pool when loading Parquet files from Azure Data Lake Storage Gen2. The files are 1 GB each. What should you do to improve performance?

A.Reduce the file size to 100 MB to increase parallelism

B.Use PolyBase instead of COPY INTO

C.Increase the number of files to match the number of distributions

D.Disable parallel processing in the COPY command

AnswerC

D is correct because each distribution can read a file in parallel, maximizing throughput.

Why this answer

Option C is correct because COPY INTO in Azure Synapse dedicated SQL pool distributes data across 60 distributions. To maximize parallelism, the number of input files should match or exceed the number of distributions. With 1 GB files, you have too few files to fully utilize all distributions, causing some distributions to remain idle.

Increasing the number of files to at least 60 ensures each distribution gets work, improving throughput.

Exam trap

The trap here is that candidates focus on file size reduction (Option A) as a general optimization, but the specific requirement in Synapse dedicated SQL pool is to match the number of files to the number of distributions (60) to avoid distribution skew and maximize parallelism.

How to eliminate wrong answers

Option A is wrong because reducing file size to 100 MB increases the number of files but does not guarantee they match the distribution count; the key is file count, not size, and 100 MB files may still result in fewer than 60 files. Option B is wrong because PolyBase is an older technology that uses external tables and has additional overhead; COPY INTO is the recommended, optimized method for loading Parquet files and is generally faster. Option D is wrong because disabling parallel processing would force sequential loading, drastically reducing performance; COPY INTO inherently uses parallel processing to leverage all distributions.

Practice this question →

54

MCQmedium

A data engineer is designing a solution that uses Azure Data Factory to copy data from an on-premises SQL Server to Azure Synapse Analytics. The data transfer must be encrypted in transit. Which property should be configured in the linked service?

A.ConnectionString with Integrated Security

B.AuthenticateVia

C.EncryptedConnection

D.UseSystemTrustStore

AnswerC

Enables TLS encryption for data transfer.

Why this answer

Option C is correct because the EncryptedConnection property in an Azure Data Factory linked service enforces encryption for data in transit between the on-premises SQL Server and Azure Synapse Analytics. When set to true, it uses TLS/SSL to encrypt the connection, ensuring that data transferred over the network is protected from interception or tampering.

Exam trap

The trap here is that candidates often confuse 'encryption in transit' with authentication methods (like Integrated Security) or certificate validation settings (like UseSystemTrustStore), leading them to select options that address identity or trust rather than the actual encryption of the data channel.

How to eliminate wrong answers

Option A is wrong because ConnectionString with Integrated Security specifies Windows authentication credentials but does not control encryption of the data in transit; it is unrelated to TLS/SSL enforcement. Option B is wrong because AuthenticateVia defines the authentication method (e.g., Managed Identity, Service Principal) for the linked service, not the encryption of the data channel. Option D is wrong because UseSystemTrustStore determines whether to use the system's certificate trust store for validating the server's TLS certificate, but it does not enable or disable encryption itself; encryption must be explicitly set via EncryptedConnection.

Practice this question →

55

Multi-Selecthard

Which TWO strategies can be used to optimize storage costs for historical data in Azure Data Lake Storage Gen2?

Select 2 answers

A.Enable soft delete to recover data

B.Use geo-redundant storage (GRS) for durability

C.Implement lifecycle management policies to move data to archive tier

D.Store data in compressed columnar format like Parquet

E.Encrypt data with Azure Storage Service Encryption

AnswersC, D

Archive tier is cheapest.

Why this answer

Option C is correct because Azure Blob Storage lifecycle management policies allow you to automatically transition data from hot to cool to archive tiers based on age or last modification time. Moving historical data to the archive tier significantly reduces storage costs, as archive is the lowest-cost storage tier, though it incurs higher retrieval latency and costs.

Exam trap

The trap here is that candidates confuse data protection features (soft delete, encryption, replication) with cost optimization strategies, but only tiering and compression directly reduce the amount or cost of stored data.

Practice this question →

56

Multi-Selecthard

Which THREE are best practices for optimizing query performance in Azure Synapse Analytics dedicated SQL pool?

Select 3 answers

A.Use materialized views for complex aggregations

B.Use the largest resource class for all queries

C.Create clustered columnstore indexes

D.Use hash distribution on columns used in JOINs

E.Use round-robin distribution for large fact tables

AnswersA, C, D

Pre-computes results.

Why this answer

Materialized views precompute and store the results of complex aggregations, such as SUM, COUNT, AVG, or GROUP BY operations, in Azure Synapse dedicated SQL pool. When a query references the same aggregation pattern, the optimizer can automatically substitute the materialized view, significantly reducing compute and I/O overhead by avoiding full table scans and recomputation. This is a best practice for improving performance on repetitive analytical workloads.

Exam trap

The trap here is that candidates often confuse resource class with performance optimization, assuming larger resource classes always speed up queries, when in fact they reduce concurrency and can cause resource contention, making them a poor general-purpose best practice.

Practice this question →

57

MCQmedium

You are designing a solution to store streaming data from multiple sources into Azure Data Lake Storage Gen2. The data must be organized by ingestion time and source system. Each source system produces data in a different format: CSV, JSON, and Parquet. The solution must allow efficient querying using Azure Synapse Serverless SQL and must support partitioning on ingestion date. What is the recommended folder structure?

A./data/{date}/{source_system}/ (e.g., /data/2023-01-01/SourceA/)

B./data/{source_system}/{date}/ (e.g., /data/SourceA/2023-01-01/)

C./data/{source_system}/ with files named {timestamp}.csv/.json/.parquet

D./data/{source_system}/{year}/{month}/{day}/ (e.g., /data/SourceA/2023/01/01/)

AnswerB

This structure separates sources and dates, enabling efficient query pruning.

Why this answer

Option B is correct because it places the source system partition first, which aligns with Azure Synapse Serverless SQL's partition elimination behavior when querying by source system. The date partition at the end allows efficient pruning on ingestion date, and the hierarchical folder structure maps directly to Hive-style partitioning, which Synapse Serverless SQL natively supports for CSV, JSON, and Parquet formats.

Exam trap

The trap here is that candidates often choose Option D because they assume year/month/day granularity is required for performance, but Azure Synapse Serverless SQL treats each folder level as a separate partition key, and a single date folder is sufficient for efficient pruning without unnecessary complexity.

How to eliminate wrong answers

Option A is wrong because placing the date before the source system forces Synapse Serverless SQL to scan all source system folders for a given date, preventing efficient partition elimination when filtering by source system. Option C is wrong because flat file naming without folder-level partitioning prevents Synapse Serverless SQL from using partition pruning at all, leading to full scans even for date- or source-filtered queries. Option D is wrong because while it uses Hive-style partitioning, the year/month/day hierarchy is redundant when a single date partition is sufficient; it adds unnecessary folder depth without performance benefit and complicates partition management.

Practice this question →

58

MCQhard

You are a data engineer at a healthcare analytics company. The company stores patient records in an Azure Data Lake Storage Gen2 account organized by /patient/{patientId}/year={yyyy}/month={MM}/day={dd}/*.parquet. There are 10,000 patients, and each patient has about 1 GB of data per year. The data is used by data scientists who run ad-hoc queries using Azure Synapse Serverless SQL. They complain that queries scanning multiple patients over the last year take too long and consume too much data. They often need to filter by patientId and a date range. You need to improve query performance and reduce the amount of data scanned. You cannot change the folder structure because it is used by other processes. What should you do?

A.Reorganize the folder structure to /year={yyyy}/month={MM}/day={dd}/patientId={patientId}/*.parquet.

B.Convert the Parquet files to CSV format to improve compression and reduce file size.

C.Create views that aggregate data by patient and date, and instruct data scientists to query the views.

D.Create external tables in Synapse Serverless SQL that use the folder structure as partitions, and ensure queries filter on year, month, and day.

AnswerD

External tables with partition elimination reduce scanned data.

Why this answer

Option D is correct because creating external tables in Synapse Serverless SQL with the existing folder structure as partitions allows the query engine to perform partition elimination. When queries filter on year, month, and day, Synapse Serverless SQL will only scan the relevant folders, drastically reducing data scanned and improving performance. This approach does not require changing the folder structure, which is used by other processes.

Exam trap

The trap here is that candidates may think views can improve performance, but in Synapse Serverless SQL, views are non-materialized and do not reduce data scanned unless the underlying data is partitioned and queries filter on partition columns.

How to eliminate wrong answers

Option A is wrong because it suggests reorganizing the folder structure, which is explicitly prohibited by the requirement that the folder structure cannot be changed. Option B is wrong because converting Parquet to CSV would increase file size (CSV is not compressed by default and lacks columnar storage benefits), worsening performance and data scanned. Option C is wrong because creating views does not change the underlying data layout or partition elimination; views are just saved queries and do not reduce the amount of data scanned unless they are materialized, which Synapse Serverless SQL does not support.

Practice this question →

59

Multi-Selecteasy

Which TWO features are available in Azure Data Lake Storage Gen2 but not in Azure Blob Storage? (Choose two.)

Select 2 answers

A.Hierarchical namespace

B.Immutable storage

C.Soft delete for blobs

D.Lifecycle management policies

E.POSIX-compliant access control lists

AnswersA, E

B is correct because hierarchical namespace enables directory structures.

Why this answer

Azure Data Lake Storage Gen2 (ADLS Gen2) extends Azure Blob Storage by adding a hierarchical namespace, which organizes objects into a directory structure similar to a file system. This enables efficient directory-level operations (e.g., renaming or deleting a directory in O(1) time) and supports POSIX-compliant access control lists (ACLs) for fine-grained permissions. These two features are not available in standard Azure Blob Storage, which uses a flat namespace and only supports container-level access policies.

Exam trap

The trap here is that candidates often assume features like soft delete or lifecycle management are exclusive to ADLS Gen2, when in fact they are shared with Blob Storage, while the hierarchical namespace and POSIX ACLs are the true differentiators.

Practice this question →

60

Multi-Selectmedium

Which TWO of the following are valid methods to load data into a dedicated SQL pool in Azure Synapse Analytics?

Select 2 answers

A.BULK INSERT

B.Azure Data Factory Copy Activity

C.COPY statement

D.Azure Import/Export service

E.PolyBase

AnswersC, E

COPY is a highly recommended loading method.

Why this answer

The COPY statement is a first-class, high-throughput ingestion method for dedicated SQL pools in Azure Synapse Analytics. It provides a simple and flexible way to load data from Azure Data Lake Storage Gen2 or Azure Blob Storage with built-in error handling, automatic schema inference, and support for various file formats (CSV, Parquet, ORC). Unlike PolyBase, the COPY statement does not require external tables and is optimized for performance with parallel loading.

Exam trap

The trap here is that candidates often confuse the COPY statement with BULK INSERT, assuming both are valid for dedicated SQL pools, but BULK INSERT is only supported in SQL Server and Azure SQL Database, not in Synapse dedicated SQL pools.

Practice this question →

61

Multi-Selecthard

Which THREE of the following are required to configure a managed private endpoint for Azure Data Factory when connecting to an Azure SQL Database that has a private endpoint?

Select 3 answers

A.The managed identity of Azure Data Factory must be granted access to the private endpoint.

B.The private endpoint must be in the same subscription as the Azure Data Factory.

C.A private DNS zone for Azure SQL Database must be linked to the virtual network.

D.The Azure SQL Database firewall must have 'Allow Azure services and resources to access this server' enabled.

E.The private endpoint must be in the same region as the Azure Data Factory.

AnswersC, D, E

Required for name resolution.

Why this answer

Option C is correct because when using a managed private endpoint for Azure Data Factory to connect to an Azure SQL Database that has a private endpoint, a private DNS zone (privatelink.database.windows.net) must be linked to the virtual network. This ensures that DNS resolution for the Azure SQL Database private endpoint resolves to the private IP address within the virtual network, enabling connectivity over the private link without relying on public endpoints.

Exam trap

The trap here is that candidates often confuse granting the managed identity access to the private endpoint (which is unnecessary) with granting it access to the Azure SQL Database itself, or assume that the private endpoint must be in the same subscription or region as the Data Factory, when in fact only the virtual network region matters and cross-subscription private endpoints are supported with proper approvals.

Practice this question →

62

Multi-Selecthard

Your company uses Azure Synapse Analytics for a data warehouse. The fact table is 500 GB and distributed by hash on CustomerID. You notice that queries joining the fact table with the Customer dimension table are slow due to data movement. The Customer dimension table is 10 GB. Which THREE actions should you take to improve query performance?

Select 3 answers

A.Use heap index on the fact table

B.Change the fact table distribution to round-robin

C.Change the fact table distribution to hash on CustomerID

D.Implement partitioning on the fact table by date

E.Replicate the Customer dimension table to all compute nodes

AnswersC, D, E

Hash distribution on the join key colocates data with the dimension.

Why this answer

Option C is correct because the fact table is already distributed by hash on CustomerID, which is the join key with the Customer dimension table. This co-locates matching rows on the same compute node, eliminating data movement during joins. Keeping this distribution is essential for performance; changing it would break co-location and increase shuffle overhead.

Exam trap

The trap here is that candidates might think changing the distribution method (e.g., to round-robin) would help, but they overlook that the fact table is already correctly hash-distributed on the join key, and the real issue is the dimension table not being replicated, causing unnecessary data movement.

Practice this question →

63

MCQhard

You are designing a near-real-time analytics pipeline for a retail company. Transaction data is generated in Azure SQL Database and must be replicated to Azure Synapse Analytics (dedicated SQL pool) with less than 5 minutes latency. The source table has 50 million rows and 200 columns, but only 30 columns are needed for analytics. Which approach should you recommend?

A.Use Azure SQL Database Change Tracking and push changes to Azure Event Hubs, then use Azure Stream Analytics to write to Synapse.

B.Enable Change Data Capture (CDC) on the source table and use Azure Data Factory with a 1-minute tumbling window to copy changes into Synapse.

C.Use Azure Synapse PolyBase to directly query the source SQL database every 5 minutes.

D.Schedule a full copy of the entire table every 5 minutes using Azure Data Factory.

AnswerB

CDC captures only changed rows, and ADF can run frequently to meet latency target.

Why this answer

Option B is correct because Azure Data Factory (ADF) with Change Data Capture (CDC) on the source SQL database can incrementally copy only changed rows (inserts, updates, deletes) into Azure Synapse Analytics using a 1-minute tumbling window, meeting the sub-5-minute latency requirement while minimizing data volume. This approach efficiently handles 50 million rows by transferring only the 30 needed columns, avoiding full table scans and reducing network load.

Exam trap

The trap here is that candidates often confuse Change Tracking (which only tracks that a row changed, not the actual changes) with Change Data Capture (which captures the before-and-after values), leading them to choose Option A without realizing the missing push mechanism and the need for additional services to achieve near-real-time replication.

How to eliminate wrong answers

Option A is wrong because Azure SQL Database Change Tracking does not natively push changes to Event Hubs; it requires custom logic or additional services (e.g., Azure Functions) to bridge the gap, adding complexity and potential latency that may not guarantee sub-5-minute replication. Option C is wrong because PolyBase in Synapse is designed for batch querying of external data sources, not for near-real-time incremental replication; querying the source SQL database every 5 minutes would perform full table scans on 50 million rows, causing high source database load and failing to meet latency requirements. Option D is wrong because scheduling a full copy of the entire 50-million-row table every 5 minutes is extremely inefficient, consumes excessive bandwidth and Synapse storage resources, and would likely exceed the latency window due to the time required for a full data transfer.

Practice this question →

64

Multi-Selectmedium

You are designing a data storage solution for a manufacturing company that collects sensor data from machines. The data is stored in Azure Data Lake Storage Gen2. You need to ensure that the solution can handle large volumes of streaming data (up to 100 MB/s) and provide real-time dashboards. Which TWO services should you include?

Select 2 answers

A.Azure Analysis Services

B.Azure Data Factory

C.Azure Databricks

D.Azure Stream Analytics

E.Azure Event Hubs

AnswersD, E

Stream Analytics processes streaming data and can output to real-time dashboards.

Why this answer

Azure Stream Analytics is correct because it is a real-time analytics service designed to process high-velocity streaming data (up to 100 MB/s) from sources like Event Hubs and output to dashboards and storage. It provides low-latency, SQL-based querying for real-time dashboards, making it ideal for manufacturing sensor data scenarios.

Exam trap

Microsoft often tests the misconception that Azure Databricks is a real-time dashboard service, but it is primarily a processing engine that requires additional integration for dashboard output, whereas Stream Analytics is purpose-built for direct, low-latency dashboarding.

Practice this question →

65

MCQmedium

A data engineer is designing a solution to store historical sales data for a retail company. The data is append-only and accessed infrequently for compliance reports. The solution must minimize storage costs while allowing retrieval within 24 hours. Which storage tier should be used for the data?

A.Hot tier

B.Archive tier

C.Cool tier

D.Premium tier

AnswerC

Cost-effective for infrequently accessed data with retrieval within hours.

Why this answer

The Cool tier is the correct choice because it is optimized for data that is infrequently accessed and stored for at least 30 days, offering low storage costs with retrieval times in the range of seconds to hours, which meets the 24-hour retrieval requirement. The data is append-only and used for compliance, so the Cool tier balances cost and accessibility without the high retrieval costs or long rehydration delays of the Archive tier.

Exam trap

The trap here is that candidates often choose the Archive tier because it has the lowest storage cost, overlooking the rehydration latency and the fact that retrieval within 24 hours is not guaranteed with standard priority rehydration, especially under heavy demand.

How to eliminate wrong answers

Option A is wrong because the Hot tier is designed for frequently accessed data and has higher storage costs, which would unnecessarily increase expenses for infrequently accessed compliance data. Option B is wrong because the Archive tier has the lowest storage cost but requires a rehydration process that can take up to 15 hours (and often longer), which may not guarantee retrieval within 24 hours and incurs significant read and data retrieval costs. Option D is wrong because the Premium tier is for high-performance, low-latency access (e.g., for transactional or real-time workloads) and is the most expensive option, making it unsuitable for cost-minimized, infrequently accessed historical data.

Practice this question →

66

MCQmedium

You are reviewing a copy job configuration in Azure Data Factory that copies Parquet files from Azure Data Lake Storage Gen2 to Azure Synapse Analytics. The exhibit shows the job settings. If the source folder contains a file that is not in Parquet format (e.g., a CSV file), what will happen?

A.The copy job will skip the CSV file and stop.

B.The copy job will fail with an error.

C.The copy job will skip the CSV file and continue copying other Parquet files.

D.The copy job will attempt to read the CSV file as Parquet and may produce corrupt data.

AnswerC

skipIncompatibleFiles=true causes skipping non-Parquet files.

Why this answer

When using Azure Data Factory's Copy Activity with a wildcard file path or a dataset that filters for Parquet files (e.g., *.parquet), the service evaluates the file pattern before attempting to read the file. If a CSV file is present in the same folder but does not match the Parquet filter, ADF simply ignores it and continues processing only the matching Parquet files. This behavior is by design to allow flexible file selection without causing failures.

Exam trap

The trap here is that candidates assume ADF will attempt to read all files in a folder regardless of extension, leading them to choose Option D (corrupt data) or Option B (failure), when in fact ADF respects the file pattern filter and silently skips non-matching files.

How to eliminate wrong answers

Option A is wrong because the copy job does not stop after skipping a non-matching file; it continues processing remaining files that match the filter. Option B is wrong because the copy job does not fail with an error when encountering a non-Parquet file; it only fails if the file matches the pattern but cannot be parsed as Parquet. Option D is wrong because ADF does not attempt to read a CSV file as Parquet when the file pattern explicitly excludes it; the file is simply not processed.

Practice this question →

67

MCQhard

You are designing a solution to store telemetry data from millions of devices. Each device sends a JSON payload every 5 seconds. The data must be partitioned by device ID and time for efficient querying and must support real-time streaming ingestion. Which Azure storage solution should you recommend?

A.Azure SQL Database

B.Azure Blob Storage with Azure Event Hubs

C.Azure Cosmos DB

D.Azure Table Storage

AnswerC

Correct. Cosmos DB supports real-time ingestion, automatic partitioning, and low-latency queries.

Why this answer

Azure Cosmos DB is the correct choice because it offers a multi-model, globally distributed database service with native support for real-time streaming ingestion via the Change Feed and automatic indexing. Its partition key design (device ID + time) enables efficient querying across millions of devices, and it guarantees single-digit millisecond read/write latencies essential for telemetry data arriving every 5 seconds.

Exam trap

The trap here is that candidates often confuse Azure Blob Storage with Event Hubs as a streaming solution, but Blob Storage is not designed for real-time, low-latency writes from millions of devices, and the combination adds unnecessary complexity and latency compared to Cosmos DB's native streaming support.

How to eliminate wrong answers

Option A is wrong because Azure SQL Database is a relational database that does not natively support real-time streaming ingestion at the scale of millions of devices every 5 seconds, and its partitioning capabilities are limited compared to Cosmos DB's horizontal scaling. Option B is wrong because Azure Blob Storage is an object store optimized for large, unstructured data, not for low-latency, high-frequency writes from millions of devices, and while Event Hubs can ingest streams, the combination requires additional processing to write to Blob Storage, adding latency and complexity. Option D is wrong because Azure Table Storage is a NoSQL key-value store that lacks native support for real-time streaming ingestion, automatic indexing, and the flexible querying capabilities needed for time-based and device-ID-based queries at this scale.

Practice this question →

68

MCQmedium

You need to partition a large Azure SQL Database table by date to improve query performance and manageability. Which partitioning strategy should you use?

A.CREATE PARTITION FUNCTION myDateRange (datetime2) AS RANGE RIGHT FOR VALUES ('2023-01-01', '2023-02-01', ...)

B.CREATE PARTITION FUNCTION myDateRange (datetime2) AS RANGE LEFT FOR VALUES ('2023-01-01', '2023-02-01', ...)

C.Use Azure SQL Database automatic partitioning feature

D.Apply a clustered columnstore index with partitioning

AnswerA

Why this answer

Option A is correct because `RANGE RIGHT` ensures that each boundary value belongs to the right partition, which is the standard approach for date-based partitioning. This means values less than '2023-01-01' go into the first partition, values >= '2023-01-01' and < '2023-02-01' go into the second, and so on, aligning with typical date range queries and simplifying partition management (e.g., switching out old partitions).

Exam trap

The trap here is that candidates often confuse `RANGE LEFT` and `RANGE RIGHT`, mistakenly thinking `LEFT` is the default or more natural for date ranges, when in fact `RANGE RIGHT` is the standard for non-overlapping, sliding-window date partitions.

Why the other options are wrong

B

RANGE LEFT includes the boundary value in the left partition, which can lead to uneven data distribution.

C

Azure SQL Database does not have automatic partitioning; you must create it manually.

D

Columnstore indexes are for analytics, not the primary partitioning mechanism.

Practice this question →

69

MCQhard

You are responsible for securing an Azure Synapse Analytics workspace. The workspace contains dedicated SQL pools and serverless SQL pools. You need to ensure that only users with specific Microsoft Entra ID roles can query serverless SQL pools, while dedicated SQL pools use SQL authentication. What should you do?

A.Create SQL logins for serverless SQL pools and assign permissions to Microsoft Entra ID groups.

B.Configure serverless SQL pools to use Microsoft Entra ID authentication only, and dedicated SQL pools to allow SQL authentication.

C.Use a managed identity for serverless SQL pools and assign it to Microsoft Entra ID roles.

D.Disable SQL authentication for all pools and enforce Microsoft Entra ID authentication only.

AnswerB

Serverless SQL pools support only Entra ID; dedicated SQL pools can use both.

Why this answer

Option B is correct because serverless SQL pools in Azure Synapse Analytics natively support Microsoft Entra ID authentication, allowing you to restrict access to specific Entra ID roles by disabling SQL authentication for those pools. Dedicated SQL pools can independently be configured to allow SQL authentication, enabling coexistence of both authentication methods as required. This separation ensures that only users with the designated Entra ID roles can query serverless SQL pools, while dedicated SQL pools remain accessible via SQL logins.

Exam trap

The trap here is that candidates may assume SQL logins can be created for serverless SQL pools (Option A) or that managed identities can be used to control user access (Option C), when in fact serverless pools only support Microsoft Entra ID authentication and managed identities are for service principals, not user permissions.

How to eliminate wrong answers

Option A is wrong because serverless SQL pools do not support SQL logins; they rely exclusively on Microsoft Entra ID authentication, so creating SQL logins for them is not possible. Option C is wrong because managed identities are used for service-to-service authentication (e.g., connecting to Azure Storage), not for granting user access to serverless SQL pools via Entra ID roles. Option D is wrong because it would disable SQL authentication entirely, contradicting the requirement that dedicated SQL pools must use SQL authentication.

Practice this question →

70

Multi-Selecteasy

You are designing a data storage solution for a real-time dashboard that displays streaming data from Azure Event Hubs. The data must be stored in a format that supports both real-time and batch analytics with minimal latency. Which TWO technologies should you use?

Select 2 answers

A.Azure Stream Analytics

B.Azure Data Factory

C.Azure Synapse Analytics

D.Azure Analysis Services

E.Azure SQL Database

AnswersA, C

Stream Analytics processes streaming data in real-time.

Why this answer

Azure Stream Analytics is correct because it provides real-time stream processing with low latency, directly ingesting data from Event Hubs and outputting to storage or analytics services. It enables both real-time dashboard queries and batch analytics by writing to a staging store like Azure Data Lake Storage, which can then be queried by Azure Synapse Analytics for historical analysis.

Exam trap

The trap here is that candidates often confuse Azure Data Factory as a real-time processing tool, but it is strictly a batch orchestration service with no native stream processing capability.

Practice this question →

71

MCQhard

You are migrating an on-premises SQL Server database to Azure. The database has a large fact table (500 GB) and several dimension tables (10 GB total). Reporting queries join the fact table with dimension tables and aggregate by date. Which Azure service and table design should you recommend to minimize query latency?

A.Azure Synapse SQL Pool with replicated tables for both fact and dimension tables

B.Azure SQL Database Hyperscale with columnstore indexes

C.Azure Synapse SQL Pool with hash distribution on the fact table's foreign key and round-robin for dimension tables

D.Azure SQL Database with rowstore indexes and a single database

AnswerC

Hash distribution enables co-location joins, improving query performance.

Why this answer

Option C is correct because Azure Synapse SQL Pool with hash distribution on the fact table's foreign key ensures that related rows from the fact and dimension tables are co-located on the same compute node, minimizing data movement during joins. Round-robin distribution for the small dimension tables is appropriate since they are under 1 GB each and can be broadcast to all nodes, further reducing shuffle overhead. This design optimizes parallel query execution for large fact table aggregations by date.

Exam trap

The trap here is that candidates often confuse replicated tables as a universal performance booster, not realizing that replicating a large fact table is impractical and that hash distribution on the join key is the correct pattern for large fact tables in a distributed MPP environment.

How to eliminate wrong answers

Option A is wrong because replicated tables are designed for small dimension tables (typically under 1 GB), but replicating a 500 GB fact table would consume excessive storage and cause high replication overhead, defeating the purpose of minimizing query latency. Option B is wrong because Azure SQL Database Hyperscale is optimized for transactional workloads with high concurrency and large database sizes, not for large-scale analytical aggregations across a massive fact table; columnstore indexes help but the single-node architecture cannot match the distributed parallel processing of Synapse SQL Pool. Option D is wrong because rowstore indexes and a single database instance lack the distributed compute and storage needed to efficiently join and aggregate a 500 GB fact table, leading to high I/O and long query times.

Practice this question →

72

Multi-Selectmedium

Which TWO Azure services can be used to implement a data lakehouse architecture with Delta Lake?

Select 2 answers

A.Azure Databricks

B.Azure Data Factory

C.Azure Cosmos DB

D.Azure Synapse Analytics serverless SQL pool

E.Azure SQL Database

AnswersA, D

Databricks is the primary platform for Delta Lake.

Why this answer

Azure Databricks is correct because it provides a unified analytics platform with native Delta Lake support, enabling ACID transactions, schema enforcement, and time travel on data lakes. Delta Lake is an open-source storage layer that brings reliability to data lakes, and Azure Databricks is the primary service for running Delta Lake workloads at scale.

Exam trap

The trap here is that candidates often assume Azure Data Factory can implement a data lakehouse because it can copy data to ADLS, but they miss that a data lakehouse requires a compute engine (like Spark or serverless SQL) and a transactional storage layer (Delta Lake), not just data movement.

Practice this question →

73

Drag & Dropmedium

Drag and drop the steps to implement Azure Data Lake Storage Gen2 lifecycle management to move data to cool and archive tiers into the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

After creating the account, define a lifecycle rule with conditions and actions, then apply it.

Practice this question →

74

MCQeasy

You are an administrator for an Azure Synapse Analytics dedicated SQL pool. You execute the T-SQL statements shown in the exhibit. The external table 'dbo.Orders' is created. Which statement about querying this external table is true?

A.Querying the external table automatically imports data into a round-robin distribution.

B.The table cannot be queried until the data is imported into the dedicated SQL pool.

C.You can query the external table using standard T-SQL SELECT statements.

D.You must first create a PolyBase external table before querying.

AnswerC

External tables support SELECT queries.

Why this answer

Option C is correct because an external table in Azure Synapse Analytics dedicated SQL pool is a read-only abstraction over data stored externally (e.g., in Azure Blob Storage or Azure Data Lake Store). You can query it directly using standard T-SQL SELECT statements without importing data into the pool, leveraging PolyBase to push down predicate filtering and read only the required data.

Exam trap

The trap here is that candidates often assume external tables require an explicit import step before querying, but in reality, PolyBase allows direct querying of external data without any data movement into the dedicated SQL pool.

How to eliminate wrong answers

Option A is wrong because querying an external table does not automatically import data into a round-robin distribution; external tables remain external and data is not stored in the pool unless you explicitly use CREATE TABLE AS SELECT (CTAS) to import it. Option B is wrong because the external table can be queried immediately after creation without importing data; the data stays in external storage and is accessed on-the-fly by PolyBase. Option D is wrong because the T-SQL statements in the exhibit already create a PolyBase external table (using CREATE EXTERNAL TABLE with an external data source and file format), so no additional PolyBase external table creation is needed before querying.

Practice this question →

75

Multi-Selectmedium

You are designing a data storage solution for a healthcare company that must comply with HIPAA. The solution needs to store structured patient records and unstructured medical images. Data must be encrypted at rest and in transit. Which TWO storage solutions meet these requirements?

Select 2 answers

A.Azure Redis Cache

B.Azure Blob Storage

C.Azure Table Storage

D.Azure Cosmos DB

E.Azure SQL Database

AnswersB, E

Blob Storage supports encryption at rest and in transit, and is suitable for images.

Why this answer

Azure Blob Storage supports storing unstructured data like medical images and offers encryption at rest via Storage Service Encryption (SSE) and in transit via HTTPS/TLS. It is HIPAA-eligible when configured with appropriate access controls and logging, making it suitable for the unstructured image component of the solution.

Exam trap

The trap here is that candidates may choose Cosmos DB or Table Storage for structured data, overlooking that Azure SQL Database is the preferred HIPAA-compliant relational store for structured patient records, while Blob Storage is the correct choice for large unstructured images.

Practice this question →

Page 1 of 3 · 191 questions totalNext →

Ready to test yourself?

Try a timed practice session using only Design Implement Data Storage questions.

Start 20-question session