CCNA Design and implement data storage Questions — Page 2 of 3

MCQeasy

You are a data engineer at a financial services company. The company uses Azure Cosmos DB for NoSQL to store customer transaction data. The data is partitioned by customerId. The application team needs to run analytical queries that aggregate transactions by date across all customers. These queries are currently slow and consume high RUs. You need to enable faster analytical queries without impacting the transactional workload. What should you do?

A.Increase the provisioned RU/s on the container to handle both transactional and analytical queries.

B.Change the partition key to /date to optimize for analytical queries.

C.Create a materialized view using the change feed and store aggregated data in a separate container.

D.Enable the Azure Cosmos DB analytical store (Synapse Link) and query the data using Azure Synapse Serverless SQL.

AnswerD

Analytical store provides a separate columnar store for efficient analytical queries.

Why this answer

Option D is correct because enabling the Azure Cosmos DB analytical store (Synapse Link) creates a separate column-oriented store optimized for large-scale analytical queries without consuming RUs from the transactional workload. By querying this analytical store using Azure Synapse Serverless SQL, you can run fast aggregations across all customers by date while the transactional container remains unaffected.

Exam trap

The trap here is that candidates may think increasing RU/s or changing the partition key is a simpler fix, but the DP-203 exam specifically tests the understanding that analytical workloads must be isolated from transactional workloads using a dedicated analytical store like Synapse Link.

How to eliminate wrong answers

Option A is wrong because simply increasing provisioned RU/s does not separate the analytical workload from the transactional workload; both would still compete for the same throughput, and analytical queries would continue to consume high RUs, potentially throttling transactions. Option B is wrong because changing the partition key to /date would destroy the existing container's partitioning strategy, causing hot partitions for high-volume dates and severely degrading transactional performance for customer-based lookups; partition keys cannot be changed after creation without data migration. Option C is wrong because creating a materialized view using the change feed requires custom code to maintain the view, adds operational complexity, and still consumes RUs on the source container for change feed processing, whereas the analytical store provides a fully managed, zero-ETL solution.

Practice this question →

MCQmedium

You are designing a data storage solution for real-time analytics on IoT telemetry. The system must ingest 10,000 events per second and support sub-second query latency. Which Azure data store should you use?

A.Azure Table Storage.

B.Azure SQL Database with in-memory OLTP.

C.Azure Cosmos DB with analytical store.

D.Azure Data Explorer (ADX).

AnswerD

ADX is designed for real-time analytics on large volumes of streaming data with sub-second query latency.

Why this answer

Azure Data Explorer (ADX) is purpose-built for high-velocity telemetry and log analytics, ingesting 10,000+ events per second with sub-second query latency via its columnar storage and distributed query engine. It supports real-time analytics on streaming IoT data without requiring pre-defined schemas or indexing, making it the optimal choice for this scenario.

Exam trap

The trap here is that candidates often confuse Azure Cosmos DB's analytical store with a real-time analytics solution, but it is designed for hybrid transactional/analytical processing (HTAP) on operational data, not for high-velocity streaming telemetry analytics where ADX excels.

How to eliminate wrong answers

Option A is wrong because Azure Table Storage is a NoSQL key-value store designed for simple lookups and batch operations, not for real-time analytics with sub-second query latency on high-velocity streaming data. Option B is wrong because Azure SQL Database with in-memory OLTP is optimized for transactional workloads (OLTP) with high concurrency, not for analytical queries on high-volume telemetry; it cannot sustain 10,000 events per second with sub-second analytical latency. Option C is wrong because Azure Cosmos DB with analytical store is designed for globally distributed operational data with eventual consistency for analytics, but its analytical store uses a separate columnar format optimized for read-heavy workloads, not real-time sub-second queries on streaming IoT telemetry; it introduces higher latency for ingestion and query compared to ADX.

Practice this question →

MCQmedium

You are designing a data storage solution for a retail company that expects high volumes of small, time-series sensor data from thousands of IoT devices. The data must be stored cost-effectively and queried by time range with low latency. Which Azure data store should you recommend?

A.Azure Cosmos DB with analytical store

B.Azure SQL Database with columnstore indexes

C.Azure Blob Storage with Azure Data Lake Storage Gen2

D.Azure Data Explorer (ADX)

AnswerD

ADX is designed for high-speed ingestion and real-time analytics on time-series data with built-in optimization.

Why this answer

Azure Data Explorer (ADX) is optimized for high-volume, time-series data ingestion and low-latency queries over time ranges. It uses a columnar storage engine and automatic indexing, making it cost-effective for sensor data from thousands of IoT devices.

Exam trap

The trap here is that candidates often choose Azure Blob Storage with Data Lake Storage Gen2 because it is cheap for storage, but they overlook that it lacks native query capabilities for low-latency time-series queries, requiring additional compute layers.

How to eliminate wrong answers

Option A is wrong because Azure Cosmos DB with analytical store is designed for globally distributed, multi-model data with operational and analytical workloads, not specifically for high-throughput time-series ingestion and time-range queries at low cost. Option B is wrong because Azure SQL Database with columnstore indexes is a relational database optimized for transactional workloads and analytical queries on structured data, but it cannot efficiently handle the high ingestion rates and time-series query patterns of thousands of IoT devices without significant cost and performance overhead. Option C is wrong because Azure Blob Storage with Azure Data Lake Storage Gen2 is a hierarchical file storage for big data analytics, not a query engine; it requires additional compute services like Azure Synapse or Spark to query time-series data, adding latency and complexity.

Practice this question →

Multi-Selecthard

Which THREE factors should be considered when designing a partitioning strategy for a large fact table in Azure Synapse Analytics dedicated SQL pool?

Select 3 answers

A.Index type

B.Data compression

C.Distribution key

D.Partition column

E.Number of partitions

AnswersB, D, E

Partition size affects compression efficiency.

Why this answer

Data compression is a key factor in partitioning strategy because it directly impacts storage efficiency and query performance. In Azure Synapse Analytics dedicated SQL pool, columnstore compression is applied per partition, and choosing the right compression level (e.g., COLUMNSTORE vs. COLUMNSTORE_ARCHIVE) can significantly reduce I/O and improve scan speeds for large fact tables.

Exam trap

The trap here is that candidates confuse distribution key (which controls data placement across nodes) with partition column (which controls data organization within a node), leading them to incorrectly select distribution key as a partitioning factor.

Practice this question →

Drag & Dropmedium

Drag and drop the steps to configure Azure Databricks auto-scaling cluster for ETL workloads into the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

First create the workspace, then the cluster with auto-scaling settings, choose runtime, attach libraries, and set policies.

Practice this question →

MCQmedium

Refer to the exhibit. An ARM template deploys an Azure Synapse Analytics workspace. What is the purpose of the 'managedVirtualNetwork' property set to 'default'?

A.It configures the workspace to use a user-assigned managed identity

B.It disables public network access to the workspace

C.It creates the workspace in a private endpoint configuration

D.It enables the workspace to use a managed virtual network for network isolation

AnswerD

Managed VNet provides isolation.

Why this answer

Setting 'managedVirtualNetwork' to 'default' in an ARM template for Azure Synapse Analytics enables a managed virtual network that provides network isolation for the workspace. This allows the workspace to use private endpoints and managed private endpoints for secure data integration without exposing traffic to the public internet. It is a foundational setting for implementing a secure, network-isolated Synapse environment.

Exam trap

The trap here is that candidates confuse 'managedVirtualNetwork' with directly creating private endpoints or disabling public access, when in fact it is the prerequisite that enables the workspace to use a managed virtual network for network isolation, with private endpoints and public network access controls being separate configurations.

How to eliminate wrong answers

Option A is wrong because the 'managedVirtualNetwork' property controls network isolation, not identity configuration; user-assigned managed identities are configured via the 'identity' property in the ARM template. Option B is wrong because disabling public network access is a separate setting (e.g., 'publicNetworkAccess' property), not the purpose of 'managedVirtualNetwork'. Option C is wrong because setting 'managedVirtualNetwork' to 'default' does not directly create the workspace in a private endpoint configuration; it enables the managed virtual network, and private endpoints must be explicitly created within that network for specific resources.

Practice this question →

MCQhard

You are reviewing an Azure Data Factory dataset JSON definition for a data lake. The dataset is used in a copy activity that loads sales data into Azure Data Lake Storage Gen2. The pipeline runs successfully, but you notice that the output file always overwrites the previous file with the name 'sales.parquet' regardless of the folderPath parameter. What is the most likely cause?

A.The folderPath parameter is not being evaluated correctly

B.The linkedServiceName is not correctly configured

C.The fileName property is hardcoded and not parameterized

D.The format type is incorrect; it should be 'Parquet' instead of 'ParquetFormat'

AnswerC

The fileName is fixed to 'sales.parquet', causing overwrites.

Why this answer

Option C is correct because the dataset's fileName property is hardcoded to 'sales.parquet', which means every pipeline run writes to the same file, overwriting it regardless of the folderPath parameter. In Azure Data Factory, if fileName is static and not parameterized, the copy activity will always target that exact file name, even if folderPath changes dynamically.

Exam trap

The trap here is that candidates often focus on folderPath or linked service configuration, overlooking that the fileName property is hardcoded, which is the direct cause of the overwrite behavior.

How to eliminate wrong answers

Option A is wrong because if the folderPath parameter were not evaluated correctly, the pipeline would likely fail or write to an unexpected folder, but the observed behavior is consistent overwriting of the same file, indicating the folderPath is working but the fileName is fixed. Option B is wrong because an incorrectly configured linkedServiceName would cause authentication or connection failures, not a consistent overwrite behavior. Option D is wrong because 'ParquetFormat' is a valid format type in Azure Data Factory datasets for Parquet files; the issue is not about format type but about the fileName property being static.

Practice this question →

MCQhard

A multinational bank needs to store customer transaction records for 10 years to meet regulatory compliance. The data is rarely accessed after the first year. The solution must minimize storage costs while allowing queries on recent data with low latency. Which tiering strategy should you implement?

A.Store all data in Azure SQL Database with partitioning and drop older partitions

B.Use Azure Data Lake Storage Gen2 with a single storage tier

C.Store data in Azure Cosmos DB with time-to-live (TTL) and use Azure Blob Storage for backups

D.Use Azure Blob Storage with lifecycle management to transition from Hot to Cool to Archive tiers

AnswerD

Lifecycle management automates tier transitions, minimizing cost while retaining data.

Why this answer

Option D is correct because Azure Blob Storage lifecycle management automatically transitions blobs from Hot to Cool to Archive tiers based on age, minimizing storage costs for rarely accessed data after the first year while keeping recent data in Hot tier for low-latency queries. This aligns with the 10-year retention requirement and cost optimization goal without manual intervention.

Exam trap

The trap here is that candidates may choose Option C thinking TTL in Cosmos DB can handle retention, but TTL deletes data automatically, which violates regulatory retention requirements, not just cost optimization.

How to eliminate wrong answers

Option A is wrong because Azure SQL Database with partitioning and dropping older partitions permanently deletes data, violating the 10-year regulatory retention requirement. Option B is wrong because Azure Data Lake Storage Gen2 with a single storage tier (e.g., Hot) does not provide automatic cost optimization for rarely accessed data over 10 years, leading to higher costs. Option C is wrong because Azure Cosmos DB with TTL automatically deletes expired data, which cannot be used for long-term retention, and Azure Blob Storage for backups does not replace a tiering strategy for the primary data store.

Practice this question →

MCQeasy

A company wants to ingest streaming data from IoT devices into Azure for real-time analytics. The data must be available for immediate querying and also stored long-term in a cost-effective format. Which Azure service should be used as the primary ingestion endpoint?

A.Azure SQL Database

B.Azure Event Hubs

C.Azure Data Lake Storage Gen2

D.Azure Blob Storage

AnswerB

Optimized for high-throughput streaming ingestion.

Why this answer

Azure Event Hubs is the correct primary ingestion endpoint for streaming IoT data because it is a fully managed, real-time data ingestion service optimized for high-throughput, low-latency event streaming. It can ingest millions of events per second from IoT devices and integrates natively with Azure Stream Analytics and other analytics services for immediate querying, while also supporting long-term retention via Event Hubs Capture to cost-effective storage like Azure Blob Storage or Data Lake Storage.

Exam trap

The trap here is that candidates often confuse Azure Blob Storage or Data Lake Storage as ingestion endpoints because they are cost-effective for storage, but they lack the real-time streaming capabilities and event-ordering guarantees that Event Hubs provides for immediate querying.

How to eliminate wrong answers

Option A is wrong because Azure SQL Database is a relational database designed for OLTP workloads, not for high-volume streaming ingestion; it lacks native support for event streaming protocols like AMQP or Kafka and would create bottlenecks and high costs for real-time IoT data. Option C is wrong because Azure Data Lake Storage Gen2 is a hierarchical file storage optimized for big data analytics and batch processing, not a real-time ingestion endpoint; it cannot natively accept streaming events or provide sub-second querying without an intermediary ingestion service. Option D is wrong because Azure Blob Storage is an object storage service for unstructured data, not designed for real-time event ingestion; it does not support streaming protocols or provide the low-latency, ordered event delivery required for immediate querying.

Practice this question →

MCQeasy

A company stores IoT sensor data in Azure Blob Storage. The data is appended every minute and must be queried in near real-time using a SQL interface. Which Azure service should be used to enable this?

A.Azure Cosmos DB

B.Azure SQL Database

C.Azure Synapse SQL Pool

D.Azure Data Lake Storage Gen2

AnswerD

Correct. Data Lake Storage Gen2 enables SQL queries via Azure Synapse Serverless SQL.

Why this answer

Azure Data Lake Storage Gen2 (ADLS Gen2) is the correct choice because it combines Blob Storage's scalable, append-friendly architecture with a hierarchical namespace and full POSIX-like ACLs, enabling SQL-based querying via Azure Synapse Serverless SQL or PolyBase. The append-blob pattern (every-minute writes) is natively supported, and the SQL interface (e.g., OPENROWSET) can query the data in near real-time without moving it.

Exam trap

The trap here is that candidates confuse 'SQL interface' with a traditional relational database (like Azure SQL Database) or a NoSQL store (like Cosmos DB), missing that ADLS Gen2 paired with Synapse Serverless SQL provides a schema-on-read SQL layer directly over blob data.

How to eliminate wrong answers

Option A is wrong because Azure Cosmos DB is a NoSQL database optimized for low-latency, globally distributed key-value or document workloads, not for SQL-based ad-hoc querying of append-only blob data. Option B is wrong because Azure SQL Database is a relational OLTP engine that requires schema-defined tables and transactional ingestion, making it unsuitable for direct, schema-on-read queries over raw append-blob files. Option C is wrong because Azure Synapse SQL Pool (dedicated) is a massively parallel processing (MPP) data warehouse designed for large-scale batch analytics, not for near real-time queries on continuously appended blob data without complex ingestion pipelines.

Practice this question →

Multi-Selecthard

Which THREE components are required to implement a modern data warehouse architecture on Microsoft Azure using Azure Synapse Analytics?

Select 3 answers

A.Microsoft Purview for data governance and lineage.

B.Power BI as the data visualization layer.

C.Azure Data Lake Storage Gen2 as the data lake.

D.Azure Analysis Services for semantic modeling.

E.Azure Synapse dedicated SQL pool for data warehousing.

AnswersA, C, E

Purview provides metadata management and data discovery.

Why this answer

Options A, C, and D are correct because a modern data warehouse typically uses a data lake for storage (ADLS Gen2), a compute engine for analytics (Synapse SQL pool), and a metadata catalog (Purview). Option B is wrong because Power BI is a visualization tool, not required. Option E is wrong because Azure Analysis Services is optional for semantic modeling.

Practice this question →

MCQmedium

A company is designing a data lake on Azure Data Lake Storage Gen2. Data comes from multiple sources with varying schemas. The team must minimize storage costs while keeping all data available for future processing. Which storage tier should they use for the raw ingested data?

A.Premium tier

B.Archive tier

C.Cool tier

D.Hot tier

AnswerC

Correct. Cool tier balances cost and availability for infrequently accessed data.

Why this answer

The Cool tier is the optimal choice for raw ingested data in a data lake because it offers low storage costs while maintaining low-latency access for future processing. Unlike the Archive tier, Cool tier data is immediately available without the multi-hour rehydration delay, and it is significantly cheaper than the Hot tier for data that is infrequently accessed but must remain online for ETL or batch processing.

Exam trap

The trap here is that candidates often confuse 'minimize storage costs' with 'cheapest tier possible' and select Archive, forgetting that raw data must be immediately accessible for future processing, which Archive cannot provide without significant delay.

How to eliminate wrong answers

Option A is wrong because the Premium tier is designed for high-transaction workloads with sub-millisecond latency, not for cost-efficient storage of raw data; it would incur unnecessary expense. Option B is wrong because the Archive tier requires data to be rehydrated (taking up to 15 hours) before it can be read, making it unsuitable for raw data that must be available for future processing. Option D is wrong because the Hot tier is optimized for frequent access and has the highest storage cost, which contradicts the requirement to minimize storage costs for data that is not accessed often.

Practice this question →

MCQmedium

You execute the above T-SQL in a serverless SQL pool in Azure Synapse Analytics. The external table creation succeeds, but when you query the table, it returns zero rows. The folder 'sales/products/' exists in the container and contains multiple .parquet files. What is the most likely cause?

A.The file format is incorrect; should be DELIMITEDTEXT instead of PARQUET.

B.The LOCATION path in the external table does not match the actual file path.

C.The external data source location uses the wrong endpoint; should use .blob.core.windows.net instead.

D.The credential used in the external data source does not have read permissions.

AnswerB

If the files are in a subfolder or the path is incorrect, no files are read.

Why this answer

Option B is correct because the external table's LOCATION parameter specifies a path relative to the external data source's root. Even though the folder 'sales/products/' exists, the LOCATION must exactly match the subfolder path within the container. A mismatch (e.g., missing a trailing slash, case sensitivity, or an extra prefix) causes the serverless SQL pool to scan no files, returning zero rows.

Exam trap

The trap here is that candidates assume the LOCATION must include the full container path, but it is relative to the external data source's root, so a mismatch in the relative subfolder (e.g., missing a slash or using an absolute path) leads to zero rows without an error.

How to eliminate wrong answers

Option A is wrong because the files are .parquet, so PARQUET is the correct file format; using DELIMITEDTEXT would fail to parse the binary Parquet data. Option C is wrong because serverless SQL pools in Azure Synapse use the .dfs.core.windows.net endpoint (Azure Data Lake Storage Gen2) by default; .blob.core.windows.net is for legacy Blob Storage and would cause a connection error, not zero rows. Option D is wrong because if the credential lacked read permissions, the query would throw an authorization error (e.g., 'Access denied'), not silently return zero rows.

Practice this question →

MCQhard

A company stores sensitive customer data in Azure Data Lake Storage Gen2. They need to implement a data retention policy where data older than 90 days is automatically moved to the 'cold' access tier, and data older than 365 days is deleted. Which Azure feature should be used to automate this?

A.Blob Storage lifecycle management

B.Azure Automation

C.Azure Data Factory

D.Azure Policy

AnswerA

Lifecycle management policies can automatically move blobs between tiers and delete based on age.

Why this answer

Azure Blob Storage lifecycle management is the correct feature because it allows you to define rules that automatically transition blobs to cooler access tiers (like the 'cold' tier) after a specified number of days and delete them after a further period. This directly meets the requirement to move data older than 90 days to the cold tier and delete data older than 365 days, all without custom code or manual intervention.

Exam trap

The trap here is that candidates often confuse Azure Policy (which enforces resource-level compliance) with data lifecycle management, or they assume a general automation tool like Azure Automation is needed when a native, policy-driven feature already exists.

How to eliminate wrong answers

Option B (Azure Automation) is wrong because it is a general-purpose automation service for running PowerShell or Python runbooks, not a native data lifecycle management feature; it would require custom scripting to enumerate blobs, check ages, and perform tier changes or deletions, adding complexity and maintenance overhead. Option C (Azure Data Factory) is wrong because it is an orchestration and data integration service for moving and transforming data between stores, not a policy-based lifecycle management tool; it could be used to copy or delete data but lacks the declarative, rule-based tiering and deletion capabilities of lifecycle management. Option D (Azure Policy) is wrong because it enforces compliance rules on Azure resource configurations (e.g., requiring encryption or specific SKUs) and cannot directly manage blob tier transitions or deletions based on data age.

Practice this question →

Multi-Selecteasy

Which TWO are valid methods to load data into Azure Synapse Analytics dedicated SQL pool?

Select 2 answers

A.BCp utility

B.Azure Data Factory Copy Activity

C.PolyBase

D.BULK INSERT

E.COPY statement

AnswersC, E

Loads from Azure Storage.

Why this answer

PolyBase is a valid method for loading data into Azure Synapse Analytics dedicated SQL pool because it uses the T-SQL language to access and combine data from external sources like Azure Blob Storage or Azure Data Lake Store without needing to move the data first. The COPY statement is also valid as it provides a high-throughput, flexible ingestion mechanism that supports various file formats and error handling options directly via T-SQL.

Exam trap

The trap here is that candidates often confuse BULK INSERT (which is SQL Server-specific) with the COPY statement (which is Synapse-specific), or they mistakenly think Azure Data Factory is a direct loading method rather than an orchestration tool.

Practice this question →

MCQeasy

Which Azure storage solution is best suited for storing large volumes of unstructured data, such as log files and media files, and supports both hierarchical namespace and POSIX-like access control lists?

A.Azure Blob Storage

B.Azure Data Lake Storage Gen2

C.Azure Files

D.Azure SQL Database

AnswerB

Why this answer

Azure Data Lake Storage Gen2 (ADLS Gen2) combines a hierarchical namespace with POSIX-like access control lists (ACLs) on top of Azure Blob Storage. This makes it ideal for storing large volumes of unstructured data (e.g., log files, media files) while supporting fine-grained, POSIX-compliant permissions and directory-level operations that are essential for big data analytics workloads.

Exam trap

The trap here is that candidates often choose Azure Blob Storage because it is the underlying storage for ADLS Gen2, but they overlook the key differentiators—hierarchical namespace and POSIX ACLs—that are exclusive to ADLS Gen2 and not available in standard Blob Storage.

Why the other options are wrong

Blob Storage supports unstructured data but does not provide a hierarchical namespace or POSIX ACLs by default.

Azure Files is for SMB file shares, not optimized for large-scale unstructured data.

Azure SQL Database is a relational database for structured data, not for unstructured data.

Practice this question →

Multi-Selectmedium

Which TWO of the following are recommended practices for designing a data storage solution using Azure Data Lake Storage Gen2?

Select 2 answers

A.Enable soft delete to protect against accidental deletion

B.Use Kerberos authentication for the storage account

C.Use a partition strategy that groups related data together

D.Store all files in a single directory for simplicity

E.Enable anonymous public access for ease of use

AnswersA, C

Soft delete provides a recovery window for deleted data.

Why this answer

Option A is correct because enabling soft delete on Azure Data Lake Storage Gen2 protects against accidental deletion by retaining deleted data for a specified retention period. This allows recovery of blobs or snapshots that were deleted, overwritten, or modified, which is a critical data protection practice for enterprise storage solutions.

Exam trap

Microsoft often tests the misconception that Kerberos is the primary authentication method for Azure Data Lake Storage Gen2, but the correct protocol is Azure AD OAuth 2.0, and candidates may confuse Gen2 with Gen1 or on-premises Hadoop.

Practice this question →

Multi-Selectmedium

Which TWO Azure services can be used to implement a polyglot persistence architecture for an e-commerce application that requires both a relational database for orders and a document database for product catalogs?

Select 2 answers

A.Azure Cache for Redis

B.Azure SQL Database

C.Azure Cosmos DB

D.Azure Table Storage

E.Azure Data Lake Storage Gen2

AnswersB, C

Suitable for relational data like orders.

Why this answer

Azure SQL Database is a relational database service that supports ACID transactions and structured querying, making it ideal for storing and managing e-commerce order data with strong consistency and referential integrity. Azure Cosmos DB is a multi-model NoSQL database that provides document database capabilities with flexible schemas and low-latency access, perfectly suited for product catalogs that require high read throughput and schema evolution. Together, they enable a polyglot persistence architecture by using the best storage model for each workload.

Exam trap

The trap here is that candidates often confuse Azure Table Storage with a document database, but Table Storage is a key-value store without native JSON support or rich querying, whereas Cosmos DB provides a true document database with SQL API and indexing.

Practice this question →

MCQeasy

You run the above query on a table named 'visits' in a dedicated SQL pool. The table has 1 billion rows and is hash-distributed on user_id. The query takes a long time. What is the most likely reason?

A.The query uses a date filter which cannot be pushed down to the distribution.

B.The table is hash-distributed on user_id, but the query uses a different column for aggregation.

C.The table should use a replicated distribution instead of hash distribution.

D.COUNT(DISTINCT) operations are expensive because they require data movement across distributions.

AnswerD

COUNT(DISTINCT) needs to combine distinct values from all distributions.

Why this answer

In a dedicated SQL pool, COUNT(DISTINCT) is inherently expensive because it requires all distinct values to be gathered across distributions before counting. Since the table is hash-distributed on user_id, the distinct count on a different column (likely visit_date or another attribute) forces data shuffling across all distributions to ensure uniqueness, causing significant performance degradation.

Exam trap

The trap here is that candidates often blame the distribution key mismatch (Option B) or filter pushdown (Option A), overlooking the fact that COUNT(DISTINCT) forces a global data movement step regardless of distribution strategy.

How to eliminate wrong answers

Option A is wrong because date filters can be pushed down to distributions in dedicated SQL pool via partition elimination or predicate pushdown, so this is not the primary bottleneck. Option B is wrong because aggregation on a different column than the distribution key does not inherently cause slowness; hash distribution supports aggregation on any column, though it may require partial aggregation per distribution. Option C is wrong because replicated distribution is typically beneficial for small dimension tables (under 2 GB), not for a 1-billion-row fact table, and would cause massive storage overhead and maintenance issues.

Practice this question →

MCQeasy

A logistics company needs to store delivery tracking data that is updated frequently by multiple services. The solution must support transactions across multiple documents and provide real-time analytics. Which Azure service should you recommend?

A.Azure Table Storage

B.Azure Cosmos DB with SQL API

C.Azure Data Lake Storage Gen2

D.Azure Service Bus

AnswerB

Cosmos DB supports transactions via stored procedures and analytical store for analytics.

Why this answer

Azure Cosmos DB with SQL API is the correct choice because it provides multi-document transaction support (ACID within a logical partition) and real-time analytics via its change feed and integrated analytical store. This meets the requirement for frequent updates from multiple services while enabling low-latency reads for analytics, unlike other Azure storage options that lack transactional guarantees across documents or real-time query capabilities.

Exam trap

The trap here is that candidates often confuse Azure Table Storage's single-entity transactions with multi-document support, or mistakenly think Azure Data Lake Storage Gen2 can handle transactional updates, when it is designed for append-heavy, analytical workloads.

How to eliminate wrong answers

Option A is wrong because Azure Table Storage does not support multi-document transactions; it only offers single-entity transactions and lacks the ability to perform ACID operations across multiple documents. Option C is wrong because Azure Data Lake Storage Gen2 is optimized for large-scale batch analytics and data lake workloads, not for transactional updates or real-time analytics on frequently updated data. Option D is wrong because Azure Service Bus is a message broker for decoupling services and asynchronous communication, not a data store for transactional or analytical workloads.

Practice this question →

MCQhard

You have an Azure Data Factory pipeline that loads data from an on-premises SQL Server to an Azure Synapse Analytics dedicated SQL pool. The pipeline uses a staging Azure Blob Storage account. Recently, the pipeline has been failing with timeout errors. You need to ensure the pipeline completes successfully within the scheduled window. What should you do?

A.Increase the staging blob's copy timeout and enable parallel copy in the copy activity

B.Move the staging storage to Azure Data Lake Storage Gen2

C.Increase the dedicated SQL pool's data warehouse units (DWU)

D.Use PolyBase instead of staging for data loading

AnswerA

Increasing timeout allows more time for large data transfers, and parallel copy improves throughput.

Why this answer

The timeout errors occur because the default copy timeout for staging blob storage is insufficient for large data volumes or slow network conditions. Increasing the staging blob's copy timeout and enabling parallel copy in the copy activity directly addresses this by allowing more time for data transfer and leveraging multiple concurrent connections to improve throughput, ensuring the pipeline completes within the scheduled window.

Exam trap

The trap here is that candidates may confuse scaling the destination (Synapse DWUs) or changing storage tiers with fixing the root cause of timeout errors, which is often a configuration issue in the copy activity's staging settings rather than a capacity problem.

How to eliminate wrong answers

Option B is wrong because moving to Azure Data Lake Storage Gen2 does not inherently resolve timeout issues; it provides hierarchical namespace and POSIX permissions but does not change the copy timeout or parallelism settings. Option C is wrong because increasing DWUs scales the dedicated SQL pool's compute resources for query performance, not for data ingestion timeout issues during staging. Option D is wrong because PolyBase is an alternative loading method that bypasses staging but does not address timeout errors in the existing staging-based pipeline; it would require architectural changes and may introduce different performance characteristics.

Practice this question →

Multi-Selectmedium

Which TWO Azure services can be used to implement a data lake architecture for storing structured, semi-structured, and unstructured data?

Select 2 answers

A.Azure Blob Storage with hierarchical namespace

B.Azure SQL Database

C.Azure Synapse Analytics

D.Azure Cosmos DB

E.Azure Data Lake Storage Gen2

AnswersA, E

Can function as a data lake.

Why this answer

Azure Data Lake Storage Gen2 (ADLS Gen2) is built on top of Azure Blob Storage and adds a hierarchical namespace, enabling a true data lake architecture. It supports storing structured, semi-structured, and unstructured data at scale, with POSIX-like access control lists and directory-level operations. Azure Blob Storage with hierarchical namespace enabled is essentially the same service as ADLS Gen2, providing the same data lake capabilities.

Exam trap

The trap here is that candidates often confuse Azure Synapse Analytics as a storage service for data lakes, when in fact it is a query and analytics engine that relies on ADLS Gen2 or other storage for the underlying data.

Practice this question →

MCQmedium

You are designing a data lake on Azure Data Lake Storage Gen2. The data includes customer PII that must be encrypted at rest using customer-managed keys. Which feature should you enable?

A.Enable double encryption for Azure Storage

B.Configure Azure Storage encryption with customer-managed keys in Azure Key Vault

C.Enable infrastructure encryption

D.Use Azure Customer Lockbox for access control

AnswerB

C is correct because customer-managed keys in Key Vault allow you to control encryption keys.

Why this answer

Option B is correct because Azure Storage encryption with customer-managed keys (CMK) in Azure Key Vault allows you to control the encryption keys used to encrypt data at rest in Azure Data Lake Storage Gen2. This meets the requirement for encrypting PII with customer-managed keys, as CMK provides an extra layer of security by letting you manage key rotation, revocation, and access policies.

Exam trap

The trap here is that candidates confuse 'double encryption' or 'infrastructure encryption' with customer-managed key control, but only CMK gives you direct ownership of the encryption keys for data at rest.

How to eliminate wrong answers

Option A is wrong because double encryption for Azure Storage encrypts data twice (once at the service level and once at the infrastructure level) but does not inherently use customer-managed keys; it can use Microsoft-managed keys. Option C is wrong because infrastructure encryption uses platform-managed keys to encrypt the storage infrastructure, not customer-managed keys for the data itself. Option D is wrong because Azure Customer Lockbox provides access control for Microsoft support engineers to access your data, not encryption at rest with customer-managed keys.

Practice this question →

MCQeasy

You need to design a data storage solution for a batch processing pipeline that processes petabytes of data daily. The data is stored in Parquet format and must be accessible by both Azure Databricks and Azure Synapse Analytics. Which storage solution should you recommend?

A.Azure Data Lake Storage Gen2

B.Azure Files

C.Azure SQL Database

D.Azure Blob Storage

AnswerA

Provides hierarchical namespace, high throughput, and is compatible with Databricks and Synapse.

Why this answer

Azure Data Lake Storage Gen2 (ADLS Gen2) is the correct choice because it combines a hierarchical namespace with Azure Blob Storage's scalable object storage, providing native POSIX-like access control and high throughput for petabyte-scale batch processing. Both Azure Databricks and Azure Synapse Analytics have optimized connectors for ADLS Gen2 that leverage the hierarchical namespace for efficient partition pruning and file listing, which is critical for Parquet-based analytics at this scale.

Exam trap

The trap here is that candidates often confuse Azure Blob Storage (flat namespace) with ADLS Gen2 (hierarchical namespace), assuming both are equivalent for big data analytics, but the hierarchical namespace is a critical differentiator for performance at petabyte scale in batch pipelines.

How to eliminate wrong answers

Option B (Azure Files) is wrong because it is designed for SMB-based file sharing with low-latency access for small-to-medium workloads, not for petabyte-scale batch analytics, and lacks the throughput and native integration with Spark and Synapse pipelines required for Parquet data. Option C (Azure SQL Database) is wrong because it is a relational OLTP store optimized for transactional queries and small row-based operations, not for storing and processing petabytes of Parquet files in a batch pipeline. Option D (Azure Blob Storage) is wrong because while it can store Parquet files, it lacks a hierarchical namespace, which forces Azure Databricks and Synapse to use slower flat namespace listing operations (e.g., ListBlobs) that degrade performance at petabyte scale compared to ADLS Gen2's directory-aware operations.

Practice this question →

100

MCQeasy

Your company stores sensitive customer data in Azure Data Lake Storage Gen2. You need to implement a security solution that prevents unauthorized access from the public internet while allowing access from specific Azure services and on-premises networks. Which feature should you configure?

A.Use private endpoints for all storage accounts.

B.Enable Microsoft Entra ID authentication only.

C.Configure firewall rules and virtual network service endpoints.

D.Use shared access signatures (SAS) with IP restrictions.

AnswerC

Firewall rules and service endpoints restrict access from public internet while allowing trusted services and VNets.

Why this answer

Option C is correct because configuring firewall rules and virtual network service endpoints allows you to restrict access to your Azure Data Lake Storage Gen2 account to only traffic originating from specific virtual networks and trusted Azure services, while blocking all public internet traffic. This provides a network-level security boundary that meets the requirement of preventing unauthorized public access while permitting access from on-premises networks (via VPN/ExpressRoute) and specific Azure services.

Exam trap

The trap here is that candidates often confuse network-level access controls (firewall and service endpoints) with identity-based controls (Entra ID) or delegation mechanisms (SAS), mistakenly thinking authentication alone can block public internet traffic.

How to eliminate wrong answers

Option A is wrong because private endpoints use Azure Private Link to assign a private IP from a virtual network to the storage account, but they do not inherently block public internet access; you must also disable public network access separately, and they do not directly allow access from on-premises networks without additional VPN/ExpressRoute configuration. Option B is wrong because enabling Microsoft Entra ID authentication only controls identity-based access (who can authenticate), not network-level access (where traffic can originate from); it does not prevent public internet traffic from reaching the storage endpoint. Option D is wrong because shared access signatures (SAS) with IP restrictions provide time-limited, delegated access with source IP constraints, but they are designed for granular per-request authorization, not as a permanent network security boundary to block all public internet traffic while allowing Azure services and on-premises networks.

Practice this question →

101

Multi-Selectmedium

Which TWO factors should you consider when choosing between Azure SQL Database and Azure SQL Managed Instance for migrating a legacy application? (Choose two.)

Select 2 answers

A.Support for active geo-replication

B.Authentication using Microsoft Entra ID

C.Need for SQL Server Agent jobs

D.Requirement for cross-database queries

E.Integration with Azure VNet

AnswersC, D

A is correct because SQL Database does not have SQL Agent; Managed Instance does.

Why this answer

Option C is correct because SQL Managed Instance includes full support for SQL Server Agent, which is essential for legacy applications that rely on scheduled jobs, alerts, or automation. Azure SQL Database does not support SQL Server Agent, making Managed Instance the appropriate choice when this feature is required.

Exam trap

The trap here is that candidates often assume VNet integration (Option E) is exclusive to Managed Instance, but Azure SQL Database also supports VNet integration via private endpoints, making it a non-differentiating factor for this migration decision.

Practice this question →

102

MCQeasy

You need to store data that is rarely accessed but must be retained for 10 years for compliance. The data will be accessed occasionally for audits. Which Azure storage tier is the most cost-effective?

A.Cool storage tier

B.Hot storage tier

C.Archive storage tier

D.Premium storage tier

AnswerA

Cool tier is designed for infrequently accessed data with lower storage cost.

Why this answer

The Cool storage tier is designed for data that is infrequently accessed and stored for at least 30 days, making it a cost-effective choice for compliance data accessed occasionally over a 10-year period. It offers lower storage costs than Hot tier while still providing low-latency access for audits, unlike Archive which requires hours to rehydrate.

Exam trap

The trap here is that candidates often choose Archive tier for long-term retention without considering the access latency requirement for audits, assuming 'rarely accessed' means never needing quick access.

How to eliminate wrong answers

Option B (Hot storage tier) is wrong because it is optimized for frequent access and has higher storage costs, making it unnecessarily expensive for rarely accessed compliance data. Option C (Archive storage tier) is wrong because although it has the lowest storage cost, it requires a rehydration process that can take up to 15 hours to access data, which is impractical for occasional audit access. Option D (Premium storage tier) is wrong because it is designed for high-performance, low-latency workloads (e.g., VMs, databases) and has the highest cost, making it unsuitable for rarely accessed archival data.

Practice this question →

103

MCQhard

You need to assign permissions to a service principal so that it can write data to a specific container in Azure Data Lake Storage Gen2, but not delete blobs. The above JSON shows the built-in role 'Storage Blob Data Contributor'. The role includes delete permission in DataActions. What should you do?

A.Create a custom role that includes read and write DataActions but excludes the delete DataAction, then assign that custom role.

B.Assign the Storage Blob Data Contributor role and create a deny assignment that denies delete.

C.Assign the Storage Blob Data Contributor role and use ACLs to deny delete on the container.

D.Assign the Storage Blob Data Contributor role and remove the delete permission at the role assignment scope.

AnswerA

Custom roles allow fine-grained control over permissions.

Why this answer

Option A is correct because Azure RBAC roles are all-or-nothing at the permission level; you cannot selectively remove a single DataAction from a built-in role at assignment time. The only way to grant write access without delete is to create a custom role that explicitly includes the required read and write DataActions (e.g., Microsoft.Storage/storageAccounts/blobServices/containers/blobs/write) and omits the delete DataAction (Microsoft.Storage/storageAccounts/blobServices/containers/blobs/delete). This custom role is then assigned to the service principal at the container scope, ensuring it can write data but never delete blobs.

Exam trap

The trap here is that candidates mistakenly believe you can modify a built-in role's permissions at assignment time (Option D) or that ACLs can override RBAC permissions (Option C), when in reality Azure requires a custom role for such granular control.

How to eliminate wrong answers

Option B is wrong because deny assignments cannot be used to selectively block a specific DataAction within a role assignment; deny assignments are designed to block all assignments of a role at a higher scope, not to carve out individual permissions. Option C is wrong because ACLs in Azure Data Lake Storage Gen2 are applied to the data plane for user/group identities, but they cannot override an RBAC role assignment that explicitly grants delete permission; RBAC takes precedence over ACLs when both are present. Option D is wrong because built-in roles like Storage Blob Data Contributor have fixed DataActions that cannot be modified at the role assignment scope; you cannot 'remove' a permission from a built-in role during assignment.

Practice this question →

104

MCQhard

A company uses Azure Synapse Analytics dedicated SQL pool to store sales data. The sales table is partitioned by month and has a clustered columnstore index. Over time, the performance of queries filtering on a specific month has degraded. The data engineer suspects high rowgroup elimination. Which action should be taken to improve performance?

A.Change the table distribution to hash-distributed on the partition key.

B.Reorganize or rebuild the columnstore index on the table.

C.Drop and recreate the affected partitions.

D.Update statistics on the partitioned column.

AnswerB

Reorganizing or rebuilding consolidates small rowgroups, improving partition elimination.

Why this answer

Option B is correct because reorganizing or rebuilding the columnstore index compresses fragmented rowgroups and merges small rowgroups into optimal sizes (typically 102,400 rows per rowgroup). This directly addresses the degraded rowgroup elimination: when rowgroups are too small or fragmented, the engine cannot efficiently skip entire rowgroups during partition-level scans, causing more I/O and slower performance.

Exam trap

The trap here is that candidates confuse rowgroup elimination (a columnstore physical storage concept) with partition elimination (a table design concept), and incorrectly choose partition-related actions like dropping partitions or updating statistics instead of addressing the columnstore index fragmentation directly.

How to eliminate wrong answers

Option A is wrong because changing the distribution to hash-distributed on the partition key does not fix rowgroup fragmentation; distribution affects data movement across distributions, not the internal rowgroup structure of columnstore indexes. Option C is wrong because dropping and recreating affected partitions is an overly aggressive operation that drops data and requires reloading; it does not target the root cause of fragmented rowgroups within the columnstore index. Option D is wrong because updating statistics on the partitioned column improves cardinality estimates for the query optimizer but does not repair the physical rowgroup layout that causes poor rowgroup elimination.

Practice this question →

105

MCQeasy

Which Azure service provides fully managed, serverless relational database capabilities for transactional workloads in a data storage solution?

A.Azure Cosmos DB

B.Azure SQL Database

C.Azure Synapse Analytics

D.Azure Data Lake Storage Gen2

AnswerB

Why this answer

Azure SQL Database is a fully managed, serverless relational database service designed for transactional workloads. It provides built-in high availability, automatic scaling, and pay-per-use billing, making it ideal for OLTP scenarios without the need to manage underlying infrastructure.

Exam trap

The trap here is that candidates confuse 'fully managed serverless relational database' with Azure Cosmos DB (which is serverless but not relational) or Azure Synapse Analytics (which is relational but designed for analytics, not transactions), leading them to overlook the specific OLTP focus of Azure SQL Database.

Why the other options are wrong

Cosmos DB is a NoSQL database, not relational.

Synapse Analytics is for large-scale analytics, not transactional workloads.

Data Lake Storage is for big data analytics, not relational OLTP.

Practice this question →

106

MCQeasy

You need to store log files from multiple applications in a central location for long-term retention and occasional analysis. The data is rarely accessed after 30 days. Which storage solution should you use to minimize cost?

A.Azure Files

B.Azure Data Lake Storage Gen2

C.Azure Cosmos DB

D.Azure Blob Storage (cool or archive tier)

AnswerD

Blob Storage with cool/archive tiers provides low-cost storage for infrequently accessed data.

Why this answer

Azure Blob Storage with cool or archive tier is the most cost-effective solution for storing log files that are rarely accessed after 30 days. The cool tier offers low storage costs with higher access costs, while the archive tier provides the lowest storage cost but requires rehydration for access, making both ideal for long-term retention and occasional analysis.

Exam trap

The trap here is that candidates may choose Azure Data Lake Storage Gen2 for its analytics capabilities, overlooking that blob storage tiers are specifically designed for cost-efficient long-term retention of infrequently accessed data.

How to eliminate wrong answers

Option A is wrong because Azure Files provides fully managed file shares using SMB protocol, which is designed for shared access and active workloads, not for cost-optimized long-term archival storage. Option B is wrong because Azure Data Lake Storage Gen2 is optimized for big data analytics with hierarchical namespace and high-throughput access, incurring higher storage costs than blob storage tiers for infrequently accessed data. Option C is wrong because Azure Cosmos DB is a NoSQL database with low-latency access and global distribution, designed for transactional workloads, not for cost-efficient long-term retention of log files.

Practice this question →

107

MCQeasy

You need to design a storage solution for streaming data from IoT devices. The solution must support real-time analytics and long-term storage for historical analysis. Which combination of Azure services should you use?

A.Azure Queue Storage and Azure Cosmos DB

B.Azure Event Hubs and Azure Blob Storage

C.Azure IoT Hub and Azure SQL Database

D.Azure Event Hubs and Azure Data Lake Storage Gen2

AnswerD

Event Hubs ingests streaming data, and Data Lake Storage stores it for analytics.

Why this answer

Azure Event Hubs is designed for high-throughput, low-latency ingestion of streaming data from IoT devices, supporting real-time analytics via integration with Azure Stream Analytics. Azure Data Lake Storage Gen2 provides hierarchical namespace and POSIX-compliant access for long-term storage, enabling efficient historical analysis with tools like Azure Synapse Analytics or Spark. This combination meets both real-time and historical requirements without the limitations of other options.

Exam trap

The trap here is that candidates often confuse Azure IoT Hub with Event Hubs, assuming IoT Hub is the primary streaming ingestion service, but IoT Hub is for device management and control, not high-throughput event streaming, making Event Hubs the correct choice for this scenario.

How to eliminate wrong answers

Option A is wrong because Azure Queue Storage is a message queuing service for decoupling applications, not optimized for high-throughput streaming ingestion, and Azure Cosmos DB is a NoSQL database for transactional workloads, not designed for long-term analytical storage with cost-effective tiering. Option B is wrong because while Azure Event Hubs handles streaming ingestion, Azure Blob Storage lacks the hierarchical namespace and optimized analytics capabilities of Data Lake Storage Gen2, making it less suitable for large-scale historical analysis with tools like Spark or Hive. Option C is wrong because Azure IoT Hub is primarily for device management and bidirectional communication, not a high-throughput event ingestion service, and Azure SQL Database is a relational OLTP store not designed for streaming ingestion or cost-effective long-term storage of massive historical data.

Practice this question →

108

MCQmedium

Refer to the exhibit. A data engineer creates an external table in Azure Synapse Serverless SQL. Which statement about this table is correct?

A.The table supports indexing for performance

B.The external data source TYPE must be 'HADOOP' for Azure Data Lake Storage Gen2

C.The table references a single Parquet file named 'Sales.parquet'

D.The table is read-only

AnswerD

External tables are read-only; modifications must be done to underlying files.

Why this answer

External tables in Azure Synapse Serverless SQL are read-only because they provide a relational abstraction over data stored externally (e.g., in Azure Data Lake Storage Gen2 or Blob Storage). You cannot perform INSERT, UPDATE, DELETE, or DDL modifications on the underlying data through the external table; it is designed solely for querying with T-SQL. This is a fundamental constraint of the serverless SQL pool architecture, which uses the OPENROWSET or CREATE EXTERNAL TABLE syntax to read files in place without a storage engine.

Exam trap

The trap here is that candidates confuse external tables in Synapse Serverless SQL with external tables in dedicated SQL pools (which also support PolyBase with Hadoop connectors) and mistakenly think they can write to or index the table, or they misremember the required data source TYPE for ADLS Gen2.

How to eliminate wrong answers

Option A is wrong because external tables in Synapse Serverless SQL do not support indexing; indexing is a feature of dedicated SQL pools where you can create clustered columnstore indexes, but serverless SQL pool relies on file-level statistics and predicate pushdown to Parquet/CSV files. Option B is wrong because for Azure Data Lake Storage Gen2, the external data source TYPE must be 'HDFS' (not 'HADOOP') when using the abfss:// protocol; 'HADOOP' is used for legacy WASB or on-premises HDFS, and Synapse Serverless SQL requires the 'HDFS' type for ADLS Gen2. Option C is wrong because the CREATE EXTERNAL TABLE statement can reference a folder path containing multiple Parquet files, a glob pattern (e.g., 'Sales*.parquet'), or a single file, but the statement in the exhibit (not shown) typically points to a folder or pattern, and the question does not specify a single file; the table definition uses a LOCATION that can include wildcards, so it is not limited to one file.

Practice this question →

109

MCQmedium

You are designing a data storage solution for a global e-commerce company. The company needs to store clickstream data from millions of users with high write throughput and low-latency reads for real-time analytics. The data is semi-structured and includes nested JSON objects. Which Azure data store should you recommend?

A.Azure Cosmos DB

B.Azure Table Storage

C.Azure SQL Database

D.Azure Blob Storage

AnswerA

Azure Cosmos DB provides high throughput, low-latency, and native JSON support.

Why this answer

Azure Cosmos DB is the correct choice because it provides a multi-model, globally distributed database service with guaranteed single-digit-millisecond read and write latencies at the 99th percentile, making it ideal for high-throughput clickstream ingestion and real-time analytics. Its native support for semi-structured data and nested JSON objects via the SQL API (or MongoDB API) allows direct storage and querying of complex event payloads without schema flattening. Additionally, Cosmos DB offers automatic indexing and tunable consistency levels to balance performance and data freshness for global e-commerce scenarios.

Exam trap

The trap here is that candidates often choose Azure Table Storage because it is a NoSQL store, but they overlook its lack of native JSON support and sub-10ms latency guarantees, confusing its simple key-value model with the richer document capabilities of Cosmos DB.

How to eliminate wrong answers

Option B (Azure Table Storage) is wrong because it is a NoSQL key-value store that does not natively support nested JSON objects; it requires flattening complex structures into flat key-value pairs, which adds overhead and complicates real-time analytics on clickstream data. Option C (Azure SQL Database) is wrong because it is a relational database that enforces a fixed schema, making it poorly suited for semi-structured, schema-on-read clickstream data with varying nested JSON fields; it also cannot match Cosmos DB's sub-10ms write throughput at scale. Option D (Azure Blob Storage) is wrong because it is an object store designed for large, unstructured binary data (e.g., images, logs) and does not provide low-latency, indexed query capabilities for real-time analytics on individual clickstream events; it lacks native support for querying nested JSON without additional compute layers like Azure Data Lake or Synapse.

Practice this question →

110

MCQmedium

A company is designing a data lake solution on Azure Data Lake Storage Gen2. Data will be ingested from IoT devices at high frequency (every 5 seconds). Each device sends a JSON payload of 2 KB. The data must be stored in a hierarchical namespace and partitioned by date and device ID to optimize query performance. Which partition strategy should be used?

A.Use Azure SQL Database with clustered columnstore index on date and device ID.

B.Organize folders as /YYYY/MM/DD/DeviceID/ in ADLS Gen2 and use file naming that includes timestamp.

C.Use Azure Table Storage with PartitionKey set to date and RowKey set to device ID.

D.Use Azure Cosmos DB with partition key on (date, device ID) and TTL for data retention.

AnswerB

This folder structure enables efficient partition pruning based on date and device ID.

Why this answer

Option B is correct because ADLS Gen2 with a hierarchical namespace allows folder-based partitioning by date and device ID (e.g., /YYYY/MM/DD/DeviceID/), which directly maps to the query optimization requirement. This structure enables efficient partition pruning for time-range and device-specific queries, and the high-frequency 2 KB JSON payloads are well-suited for append-friendly file naming with timestamps.

Exam trap

The trap here is that candidates confuse storage services (ADLS Gen2) with database or NoSQL solutions (SQL Database, Table Storage, Cosmos DB), failing to recognize that the question explicitly requires a data lake with a hierarchical namespace, which only ADLS Gen2 provides.

How to eliminate wrong answers

Option A is wrong because Azure SQL Database with a clustered columnstore index is a relational store, not a data lake solution, and it does not support a hierarchical namespace or folder-based partitioning as required. Option C is wrong because Azure Table Storage is a NoSQL key-value store that lacks a hierarchical namespace and folder organization; its PartitionKey/RowKey model does not provide the folder-based partitioning by date and device ID needed for ADLS Gen2. Option D is wrong because Azure Cosmos DB is a globally distributed NoSQL database, not a data lake storage service, and its partition key on (date, device ID) does not create a hierarchical folder structure in ADLS Gen2.

Practice this question →

111

MCQhard

Your team is migrating an on-premises SQL Server data warehouse to Azure Synapse Analytics. The source data includes fact tables and dimension tables with complex relationships. You need to design the storage in Azure Synapse to minimize query latency for star schema queries. Which distribution and index strategy should you use for the fact table?

A.Hash distribution on the most joined dimension key with clustered columnstore index

B.Round-robin distribution with clustered columnstore index

C.Replicated distribution with clustered columnstore index

D.Hash distribution on a dimension key with heap index

AnswerA

Hash distribution colocates join data and columnstore index optimizes analytics.

Why this answer

Hash distribution on the most joined dimension key ensures that rows with the same key value are co-located on the same distribution, minimizing data movement during star schema joins. A clustered columnstore index provides high compression and batch-mode processing, which significantly reduces query latency for analytical workloads in Azure Synapse.

Exam trap

The trap here is that candidates often choose round-robin distribution thinking it balances load evenly, but they overlook the severe join performance penalty caused by data movement across distributions in star schema queries.

How to eliminate wrong answers

Option B is wrong because round-robin distribution distributes rows evenly without considering join keys, causing excessive data shuffling across distributions during joins, which increases query latency. Option C is wrong because replicated distribution copies the entire table to each distribution node, which is impractical for large fact tables due to storage overhead and data movement during updates. Option D is wrong because a heap index lacks ordering and compression, leading to full table scans and poor query performance for star schema queries.

Practice this question →

112

MCQeasy

You need to store historical sales data for 10 years with infrequent queries. The storage cost must be minimized while retaining the ability to query using Azure Synapse serverless SQL pool. Which storage tier should you use?

A.Azure Storage Archive tier.

B.Azure Storage Premium tier.

C.Azure Storage Hot tier.

D.Azure Storage Cool tier.

AnswerD

Cool tier provides low-cost storage for infrequent access and is online for queries.

Why this answer

The Cool tier is the correct choice because it provides low-cost storage for data that is infrequently accessed (e.g., historical sales data spanning 10 years) while still supporting immediate read access via Azure Synapse serverless SQL pool. Unlike the Archive tier, Cool tier data is online and can be queried without the need for time-consuming rehydration, making it suitable for infrequent but on-demand analytical queries.

Exam trap

The trap here is that candidates often confuse 'infrequent queries' with 'no queries' and incorrectly choose the Archive tier, forgetting that Azure Synapse serverless SQL pool cannot directly query archived data without a time-consuming rehydration process.

How to eliminate wrong answers

Option A is wrong because the Archive tier is designed for long-term backup and rarely accessed data, requiring a rehydration step (which can take up to 15 hours) before data can be queried by Azure Synapse serverless SQL pool, making it unsuitable for even infrequent queries. Option B is wrong because the Premium tier is optimized for low-latency, high-transaction workloads and is significantly more expensive, which contradicts the requirement to minimize storage cost. Option C is wrong because the Hot tier is intended for frequently accessed data and has higher storage costs than the Cool tier, so it does not meet the cost-minimization goal for infrequently queried historical data.

Practice this question →

113

MCQmedium

You are reviewing an ARM template snippet for an Azure Blob Storage container. What is the effect of this configuration?

A.It enables legal hold on the container for 2555 days.

B.It blocks append writes to blobs in the container.

C.It enforces a 7-year immutable retention policy allowing appends.

D.It sets a 3-year retention policy blocking any modifications.

AnswerC

2555 days ≈ 7 years; allowProtectedAppendWrites allows log appends.

Why this answer

The ARM template snippet configures a time-based retention policy on the container with a retention period of 2555 days (7 years) and sets `allowProtectedAppendWrites` to true. This enables an immutable storage policy that prevents deletion or modification of existing blobs but allows new append blocks to be added to append blobs, making option C correct.

Exam trap

The trap here is that candidates confuse the 2555-day value with a 3-year period (1095 days) or misinterpret `allowProtectedAppendWrites` as blocking appends, when in fact it enables them under immutable storage.

How to eliminate wrong answers

Option A is wrong because legal hold is a separate immutable policy that does not have a time limit; it remains in effect until explicitly removed, and the snippet specifies a time-based retention period of 2555 days, not legal hold. Option B is wrong because the snippet sets `allowProtectedAppendWrites` to true, which explicitly permits append writes to append blobs, not blocks them. Option D is wrong because the retention period is 2555 days (7 years), not 3 years, and the policy allows appends rather than blocking all modifications.

Practice this question →

114

MCQhard

A company uses Azure Data Lake Storage Gen2 with a hierarchical namespace. They need to secure access to specific directories using RBAC roles. Which RBAC role should be assigned to a user to grant read and write access to a specific folder without giving access to other folders in the same container?

A.Storage Account Contributor

B.Storage Blob Data Contributor with ACLs at folder level

C.Storage Blob Data Reader

D.Storage Blob Data Owner

AnswerB

Grants read/write to specific folder.

Why this answer

Option B is correct because Azure Data Lake Storage Gen2 supports POSIX-like access control lists (ACLs) at the directory and file level. Assigning the Storage Blob Data Contributor role at the storage account level grants broad data-plane access, but when combined with ACL entries on a specific folder, you can restrict read and write permissions to only that folder. This allows granular security without affecting other folders in the same container.

Exam trap

The trap here is that candidates often assume RBAC roles alone can be scoped to a folder level, but in Azure Data Lake Storage Gen2, RBAC roles apply to the entire storage account or container, and folder-level security requires ACLs.

How to eliminate wrong answers

Option A is wrong because Storage Account Contributor is an Azure RBAC role that grants management-plane access (e.g., creating storage accounts, managing keys) but does not grant any data-plane permissions to read or write blob data. Option C is wrong because Storage Blob Data Reader provides read-only access to all blob data in the storage account; it cannot be scoped to a specific folder and does not grant write access. Option D is wrong because Storage Blob Data Owner grants full data-plane control (read, write, delete, and ACL management) over all blobs in the storage account, which is far broader than the required folder-level restriction.

Practice this question →

115

Multi-Selecteasy

Which TWO Azure services can be used to ingest streaming data into Azure Synapse Analytics?

Select 2 answers

A.Azure Databricks Auto Loader.

B.Azure Stream Analytics.

C.Azure Data Factory.

D.Azure Event Hubs.

E.Azure Logic Apps.

AnswersB, D

Stream Analytics can output to Synapse SQL pool or ADLS Gen2.

Why this answer

Azure Stream Analytics is correct because it is a fully managed stream processing engine that can ingest streaming data from sources like Azure Event Hubs and output directly to Azure Synapse Analytics (via SQL pool or dedicated SQL pool). It enables real-time analytics on streaming data before landing it in Synapse for further analysis.

Exam trap

The trap here is that candidates often confuse batch ingestion tools (like Azure Data Factory or Auto Loader) with real-time streaming services, or mistakenly think Event Hubs alone can ingest into Synapse without a processing layer like Stream Analytics.

Practice this question →

116

Multi-Selecthard

Which THREE considerations are important when designing a table distribution strategy for an Azure Synapse Analytics dedicated SQL pool? (Choose three.)

Select 3 answers

A.Align distribution keys on tables that are frequently joined together

B.Minimize data skew by choosing a distribution key with many unique values

C.Use round-robin distribution for large fact tables to distribute data evenly

D.Use replicated tables for large tables to avoid data movement

E.Consider the size of the table and the frequency of joins

AnswersA, B, E

B is correct because colocated joins avoid data movement.

Why this answer

Option A is correct because aligning distribution keys on tables that are frequently joined together ensures that the join columns are hash-distributed on the same key, enabling collocated joins. This avoids data movement across distributions during query execution, which significantly improves performance in Azure Synapse Analytics dedicated SQL pools.

Exam trap

The trap here is that candidates often confuse round-robin distribution as a good choice for large fact tables because it distributes data evenly, but they overlook the severe performance penalty from data movement during joins and aggregations.

Practice this question →

117

MCQeasy

A data engineer needs to store log data from multiple applications in Azure. The data is append-only, heavily compressed, and queried infrequently. Cost minimization is critical. Which storage solution is best?

A.Azure Table Storage

B.Azure Cosmos DB with analytical store

C.Azure Blob Storage with cool or archive access tier

D.Azure Data Lake Storage Gen2 with hot tier

AnswerC

Low cost for infrequent access.

Why this answer

Azure Blob Storage with cool or archive access tier is the best choice because the data is append-only, heavily compressed, and infrequently queried, making cost minimization the top priority. The cool tier offers low storage costs with higher access charges, while the archive tier provides the lowest storage cost for data that is rarely accessed and can tolerate hours of retrieval latency. This aligns perfectly with the append-only, infrequently queried nature of the log data.

Exam trap

The trap here is that candidates often confuse Azure Blob Storage's access tiers with Data Lake Storage Gen2's tiers, assuming the hot tier is always the default for log data, but the question's emphasis on 'cost minimization' and 'infrequently queried' explicitly points to cool or archive tiers, not the hot tier.

How to eliminate wrong answers

Option A is wrong because Azure Table Storage is a NoSQL key-value store designed for structured, semi-structured data with frequent point queries, not for large, append-only, compressed log blobs; it lacks the cost-optimized access tiers needed for infrequent access. Option B is wrong because Azure Cosmos DB with analytical store is a globally distributed, multi-model database optimized for low-latency transactional and analytical workloads, which is over-engineered and costly for append-only, infrequently queried log data; its analytical store is designed for near-real-time analytics, not cold storage. Option D is wrong because Azure Data Lake Storage Gen2 with hot tier is optimized for high-frequency access and big data analytics, with higher storage costs than cool or archive tiers, making it unsuitable for cost minimization when data is infrequently queried.

Practice this question →

118

MCQmedium

You are using Azure Synapse Analytics serverless SQL pool to query data in Parquet files stored in Azure Data Lake Storage Gen2. The queries are slow when filtering on a date column. You need to improve query performance without changing the data structure. What should you do?

A.Create materialized views in the serverless SQL pool

B.Increase the service level objective (SLO) of the serverless SQL pool

C.Convert the Parquet files to CSV format

D.Partition the Parquet files into folders by date

AnswerD

Partition elimination allows the serverless SQL pool to read only relevant folders, improving performance.

Why this answer

D is correct because partitioning Parquet files into folders by date enables partition elimination in Azure Synapse serverless SQL pool. When queries filter on the date column, the engine can prune entire folders from the scan, reading only the relevant Parquet files. This reduces I/O and improves performance without altering the data structure or format.

Exam trap

The trap here is that candidates may assume serverless SQL pool supports materialized views or SLO adjustments like dedicated SQL pool, but serverless SQL pool lacks these features and relies on data layout optimizations such as partitioning for performance.

How to eliminate wrong answers

Option A is wrong because materialized views in serverless SQL pool are not supported; they are only available in dedicated SQL pool. Option B is wrong because serverless SQL pool does not have a configurable service level objective (SLO); it scales automatically based on workload and cannot be manually increased. Option C is wrong because converting Parquet to CSV would increase file size and degrade performance due to the lack of columnar compression and predicate pushdown capabilities.

Practice this question →

119

MCQeasy

You are designing a data storage solution for a retail company. The data includes transactional data that requires low-latency queries (under 10 milliseconds) and large historical data for analytics. The solution must minimize storage costs. Which approach should you recommend?

A.Use Azure Data Lake Storage Gen2 for both transactional and historical data

B.Use Azure Cache for Redis for transactional data and Azure SQL Database for historical data

C.Use Azure Cosmos DB for transactional data and Azure Blob Storage for historical data

D.Use Azure SQL Database with Hyperscale tier for both transactional and historical data

AnswerC

Cosmos DB offers low-latency reads/writes, and Blob Storage is cheap for bulk historical data.

Why this answer

Azure Cosmos DB provides single-digit millisecond latency for transactional workloads, meeting the under-10ms requirement, while Azure Blob Storage offers low-cost storage for large historical data. This combination minimizes storage costs by using the most cost-effective service for each workload type.

Exam trap

The trap here is that candidates may assume a single service like Azure SQL Database or Data Lake Storage can handle both transactional and analytical workloads efficiently, overlooking the cost and performance trade-offs that make a hybrid approach optimal.

How to eliminate wrong answers

Option A is wrong because Azure Data Lake Storage Gen2 is optimized for big data analytics, not low-latency transactional queries, and cannot guarantee under 10ms response times. Option B is wrong because Azure Cache for Redis is an in-memory cache, not a durable transactional store, and Azure SQL Database for historical data incurs higher storage costs compared to Blob Storage. Option D is wrong because Azure SQL Database Hyperscale, while scalable, is more expensive for large historical data storage and does not minimize costs as effectively as Blob Storage.

Practice this question →

120

Multi-Selecteasy

Which TWO of the following Azure services can be used to orchestrate data pipelines that include data transformation?

Select 2 answers

A.Azure Synapse Pipelines

B.Azure Data Factory

C.Azure Logic Apps

D.Azure Databricks

E.Azure Functions

AnswersA, B

Synapse Pipelines are built on ADF and provide similar capabilities.

Why this answer

Azure Synapse Pipelines (option A) is correct because it provides a native orchestration service within Azure Synapse Analytics for building, scheduling, and monitoring data pipelines that include data transformation activities, such as Data Flow or Stored Procedure tasks. It leverages the same integration runtime as Azure Data Factory and supports both code-free and code-based transformation steps.

Exam trap

The trap here is that candidates often confuse compute services (like Databricks or Functions) with orchestration services, mistakenly thinking they can replace Azure Data Factory or Synapse Pipelines for end-to-end pipeline management, when in fact they are typically used as activities within an orchestrated pipeline.

Practice this question →

121

MCQeasy

You need to store semi-structured JSON data from a web application that requires low-latency reads and writes at a global scale. The data must be indexed automatically and support SQL-like queries. Which Azure data store should you use?

A.Azure SQL Database

B.Azure Blob Storage

C.Azure Cosmos DB (NoSQL API)

D.Azure Table Storage

AnswerC

Cosmos DB provides global distribution, auto-indexing, and SQL API for JSON.

Why this answer

Azure Cosmos DB with the NoSQL API is the correct choice because it natively stores semi-structured JSON documents, provides automatic indexing of all properties, supports SQL-like queries via its query engine, and offers low-latency reads and writes at global scale through multi-region replication and configurable consistency levels. This combination directly matches the requirements for a globally distributed web application needing fast, queryable JSON storage.

Exam trap

The trap here is that candidates often confuse Azure Table Storage's key-value capabilities with Cosmos DB's document model, mistakenly thinking Table Storage supports SQL-like queries and automatic indexing, when in fact it only supports OData queries and requires explicit partition and row keys for efficient access.

How to eliminate wrong answers

Option A is wrong because Azure SQL Database is a relational database that requires a fixed schema and is not optimized for semi-structured JSON data without manual schema management or JSON functions, nor does it provide automatic indexing of all JSON properties. Option B is wrong because Azure Blob Storage is an object store for unstructured data that does not support SQL-like queries or automatic indexing; it requires a separate compute layer (e.g., Azure Data Lake Analytics) for querying. Option D is wrong because Azure Table Storage is a NoSQL key-value store that does not support SQL-like queries, automatic indexing of all fields, or native JSON document storage; it uses OData queries and a flat schema.

Practice this question →

122

MCQeasy

Refer to the exhibit. An Azure Policy is defined to enforce network security on storage accounts. What does this policy do?

A.Denies storage accounts that do not have any IP rules defined

B.Denies storage accounts that have firewall rules configured

C.Denies storage accounts that have public network access disabled

D.Denies storage accounts that allow public network access from all networks

AnswerD

defaultAction=Allow means public access from all networks.

Why this answer

The Azure Policy in the exhibit uses the 'Deny' effect with a condition that checks if the 'networkAcls.defaultAction' property is set to 'Allow'. When 'defaultAction' is 'Allow', the storage account permits traffic from all networks, including the internet. The policy denies such configurations to enforce network security by requiring that public network access be restricted.

Exam trap

The trap here is that candidates confuse the 'defaultAction' property with the presence of IP rules or firewall settings, leading them to think the policy denies accounts with any firewall rules rather than those that allow all networks.

How to eliminate wrong answers

Option A is wrong because the policy does not evaluate the presence or absence of IP rules; it only checks the 'defaultAction' property. Option B is wrong because the policy denies accounts that allow all networks, not those with firewall rules configured; firewall rules are a separate mechanism. Option C is wrong because the policy denies accounts where public network access is enabled (defaultAction = 'Allow'), not disabled; disabling public access would set defaultAction to 'Deny', which the policy does not target.

Practice this question →

123

MCQhard

A company uses Azure Synapse Analytics dedicated SQL pool for data warehousing. They notice that queries against a large fact table are slow. The table is hash-distributed on ProductID, but many queries filter on OrderDate. What should the data engineer do to improve query performance?

A.Change the distribution to round-robin

B.Replicate the table to all distributions

C.Create a columnstore index on OrderDate

D.Change the distribution to hash on OrderDate

AnswerD

Aligns distribution with filter column, minimizing data movement.

Why this answer

Option D is correct because changing the distribution key to OrderDate aligns the physical data layout with the most common query filter predicate. In a dedicated SQL pool, hash distribution distributes rows across distributions based on the hash of the distribution column. When queries filter on OrderDate, a hash on OrderDate enables partition elimination and distribution-level pruning, reducing data movement and improving scan performance.

Exam trap

The trap here is that candidates often confuse indexing (columnstore) with distribution strategy, assuming a non-clustered index on the filter column is sufficient, when in fact the distribution key must match the most frequent filter predicate to avoid full distribution scans.

How to eliminate wrong answers

Option A is wrong because round-robin distribution distributes rows evenly without any logical grouping, which forces full table scans and data shuffling for all queries, making performance worse for filtered queries. Option B is wrong because replicating a large fact table to all distributions would consume excessive storage and cause significant overhead during data loading, and is typically reserved for small dimension tables. Option C is wrong because a columnstore index on OrderDate improves compression and scan efficiency but does not address the distribution mismatch; queries would still need to scan all distributions, missing the benefit of distribution elimination.

Practice this question →

124

MCQmedium

You are designing a near real-time analytics solution for a retail company. The company has a transactional database in Azure SQL Database that records sales transactions. The data must be available in Azure Synapse Analytics dedicated SQL pool for reporting with less than 15 minutes of latency. The data volume is about 10 GB per day. You need to design the data ingestion pipeline. You also need to ensure that the pipeline can handle schema changes (e.g., new columns added to the source table) without manual intervention. Which approach should you use?

A.Use Azure Databricks with Auto Loader to read from SQL Database and write to Synapse using Delta Lake.

B.Use Azure Data Factory with a change data capture (CDC) mapping data flow to read changes from SQL Database and write to Synapse dedicated SQL pool. Enable schema drift to handle new columns.

C.Use Azure Data Share to share the SQL Database tables with Synapse and schedule snapshots every 15 minutes.

D.Use Azure Synapse Pipelines with a copy activity to perform a full load every 15 minutes.

AnswerB

CDC provides incremental changes; schema drift allows automatic handling of new columns.

Why this answer

Option B is correct because Azure Data Factory's mapping data flows support Change Data Capture (CDC) to incrementally load only changed rows from Azure SQL Database, meeting the <15-minute latency requirement for 10 GB/day. The 'Enable schema drift' option in mapping data flows automatically handles new columns added to the source table without manual intervention, which is essential for schema evolution.

Exam trap

The trap here is that candidates may choose Azure Databricks with Auto Loader (Option A) because it is associated with handling schema evolution, but they overlook that Auto Loader is file-based and not designed for direct CDC from Azure SQL Database, making it unsuitable for this transactional source.

How to eliminate wrong answers

Option A is wrong because Azure Databricks Auto Loader is designed for ingesting files (e.g., from cloud storage) and does not natively connect to Azure SQL Database for CDC; it would require additional connectors and manual schema handling, adding complexity without meeting the near-real-time requirement efficiently. Option C is wrong because Azure Data Share provides snapshot-based sharing with scheduled refreshes, but it does not support CDC or schema drift; it would require full or incremental snapshots that may not handle new columns automatically and could exceed latency if schema changes occur. Option D is wrong because using a full load every 15 minutes for 10 GB/day is inefficient and could cause high resource consumption and potential timeouts; it does not handle schema changes automatically and is not a best practice for near-real-time ingestion.

Practice this question →

125

Multi-Selectmedium

Which of the following are valid methods to secure data at rest in Azure Data Lake Storage Gen2? (Choose two.)

Select 2 answers

A.Azure Storage Service Encryption (SSE) with Microsoft-managed keys

B.Azure Active Directory (Azure AD) authentication for storage accounts

C.Customer-managed keys stored in Azure Key Vault

D.Configure firewall rules to restrict IP access

AnswersA, C

Why this answer

Azure Storage Service Encryption (SSE) with Microsoft-managed keys encrypts data at rest automatically for Azure Data Lake Storage Gen2 using 256-bit AES encryption. This is enabled by default for all storage accounts, ensuring data written to disk is encrypted before being persisted, with no additional configuration required.

Exam trap

The trap here is confusing network security controls (like firewalls or Azure AD authentication) with data-at-rest encryption methods, leading candidates to select options that protect access rather than the stored data itself.

Why the other options are wrong

Azure AD authentication controls access, not encryption at rest.

Firewall rules control network access, not encryption at rest.

Practice this question →

126

Drag & Dropmedium

Drag and drop the steps to configure Azure Synapse Analytics serverless SQL pool to query data in Azure Data Lake Storage Gen2 into the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

First, set up the storage account and data. Then create the serverless SQL pool, a database, external data source referencing the storage, and finally external file format and external table to query.

Practice this question →

127

Multi-Selectmedium

Which TWO actions should you take to optimize query performance in Azure Synapse Analytics dedicated SQL pool when working with large fact tables?

Select 2 answers

A.Use replicated distribution for the fact table.

B.Use round-robin distribution to evenly distribute data.

C.Create statistics on columns used in WHERE clauses.

D.Use clustered index instead of columnstore index.

E.Implement table partitioning on a date column.

AnswersC, E

Statistics help the optimizer choose efficient query plans.

Why this answer

Option C is correct because creating statistics on columns used in WHERE clauses enables the Azure Synapse Analytics dedicated SQL pool query optimizer to generate more accurate cardinality estimates, leading to better join strategies and index selections. Without up-to-date statistics, the optimizer may choose suboptimal plans, especially for large fact tables where data distribution skew is common.

Exam trap

The trap here is that candidates often confuse distribution methods (replicated, round-robin, hash) with performance tuning for large fact tables, overlooking that statistics maintenance is a critical and separate optimization step that directly impacts query plan quality.

Practice this question →

128

Multi-Selectmedium

You are designing a hybrid data storage architecture for a global e-commerce platform. Which two Azure services should you combine to achieve low-latency read access for users worldwide and durable archival storage for compliance?

Select 2 answers

A.Azure Cosmos DB

B.Azure Table Storage

C.Azure Blob Storage

D.Azure SQL Database

E.Azure Files

AnswersA, C

Why this answer

Azure Cosmos DB is correct because it provides globally distributed, multi-region write and read capabilities with turnkey global distribution, enabling low-latency access for users worldwide via its multi-homing API. Azure Blob Storage is correct because it offers durable, tiered archival storage (e.g., Cool, Archive access tiers) at low cost, meeting compliance requirements for long-term retention of data such as transaction logs or invoices.

Exam trap

The trap here is that candidates often confuse Azure Table Storage with Cosmos DB's Table API, assuming Table Storage supports global distribution, when in fact only Cosmos DB's Table API (a different service) provides multi-region replication and low-latency reads.

Why the other options are wrong

Table Storage is a single-region service with higher latency.

Global distribution requires complex sharding and is not as seamless as Cosmos DB.

Not designed for archival or global low-latency reads.

Practice this question →

129

MCQeasy

You need to store streaming data from Azure Event Hubs into Azure Data Lake Storage Gen2 in near real-time. The data should be stored in Avro format with a folder structure: /raw/{eventhub}/{yyyy}/{MM}/{dd}/{HH}/{mm}. Which Azure service should you use to ingest the data?

A.Event Hubs Capture feature to automatically capture events to ADLS Gen2.

B.Azure Stream Analytics with a job that reads from Event Hubs and writes to ADLS Gen2.

C.Azure Data Factory with a tumbling window trigger to copy data from Event Hubs every 5 minutes.

D.Azure Databricks with Auto Loader to read from Event Hubs and write to ADLS Gen2.

AnswerB

Stream Analytics supports Avro output and custom partition path patterns.

Why this answer

Azure Stream Analytics is the correct choice because it natively supports reading from Event Hubs and writing to ADLS Gen2 with built-in time-based partitioning into the exact folder structure /raw/{eventhub}/{yyyy}/{MM}/{dd}/{HH}/{mm}. It provides near real-time processing with sub-minute latency and can output data in Avro format directly, meeting all requirements without additional code or orchestration.

Exam trap

The trap here is that candidates often choose Event Hubs Capture because it seems like a simple 'capture to storage' feature, but they overlook the requirement for near real-time per-minute partitioning, which Capture cannot achieve due to its fixed 5-minute minimum window.

How to eliminate wrong answers

Option A is wrong because Event Hubs Capture writes data in fixed 5-minute or 300 MB windows, not in near real-time per minute, and its folder structure is /{EventHub}/{Namespace}/{YYYY}/{MM}/{DD}/{HH}/{mm} but cannot dynamically include the event hub name as a folder variable in the path. Option C is wrong because Azure Data Factory with a tumbling window trigger introduces at least 5 minutes of latency and is designed for batch processing, not near real-time streaming, and cannot read from Event Hubs directly without a staging layer. Option D is wrong because Azure Databricks with Auto Loader is optimized for incremental file ingestion from cloud storage, not for streaming from Event Hubs; it would require additional structured streaming code and incurs cluster startup and runtime overhead, making it less suitable for simple near real-time ingestion.

Practice this question →

130

Matchingmedium

Match each Azure service to its primary purpose in a data engineering pipeline.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Scalable data lake for analytics workloads

Unified analytics platform with SQL and Spark

Cloud-based ETL and data integration service

Real-time stream processing service

Apache Spark-based analytics platform

Why these pairings

These are core Azure data services commonly used in DP-203.

Practice this question →

131

Multi-Selecthard

Which THREE of the following are valid methods to load data into Azure Synapse Analytics?

Select 3 answers

A.Azure Data Factory pipeline

B.COPY INTO T-SQL command

C.BULK INSERT with a data source from Azure Blob Storage

D.SQL Server Integration Services (SSIS) package

E.PolyBase from Azure Blob Storage

AnswersA, B, E

Data Factory can copy data into Synapse using the Copy activity.

Why this answer

Azure Data Factory (ADF) is a fully managed cloud-based ETL service that provides native connectors to Azure Synapse Analytics. It supports high-throughput data movement using the PolyBase engine or staged copy, making it a valid and recommended method for loading data at scale into Synapse dedicated SQL pools.

Exam trap

The trap here is that candidates often confuse BULK INSERT (which works only for SQL Server on-premises or IaaS VMs) with the COPY INTO command (which is the Synapse-specific equivalent for Azure Blob Storage), leading them to incorrectly select Option C as valid.

Practice this question →

132

MCQmedium

A company uses Azure SQL Database for an OLTP application. They need to run complex analytical queries without impacting OLTP performance. Which solution should they implement?

A.Enable automatic tuning

B.Enable geo-replication

C.Create a readable secondary replica

D.Create a failover group

AnswerC

Correct. A readable secondary replica allows read-only queries without affecting the primary.

Why this answer

Creating a readable secondary replica (Option C) offloads read-only analytical queries to a synchronized copy of the database, isolating them from the primary OLTP workload. Azure SQL Database supports this via Active Geo-Replication or Hyperscale named replicas, ensuring the primary remains unaffected by heavy analytical processing.

Exam trap

The trap here is that candidates confuse high-availability features (failover groups, geo-replication) with workload isolation, assuming any replication solves the performance impact, whereas only a dedicated readable secondary explicitly separates read-only analytical traffic from the primary OLTP workload.

How to eliminate wrong answers

Option A is wrong because automatic tuning focuses on index and query plan optimization for the primary database, not on isolating analytical workloads. Option B is wrong because geo-replication provides disaster recovery and read-scale capabilities, but its primary purpose is not to offload complex analytical queries without impacting OLTP; it can be used for read-only workloads but is not the optimal solution for analytical isolation. Option D is wrong because a failover group manages geo-replication and failover for high availability, not for distributing analytical queries away from the primary.

Practice this question →

133

Multi-Selecthard

Which TWO actions should be taken to secure data at rest in Azure Data Lake Storage Gen2?

Select 2 answers

A.Enable soft delete for blobs

B.Configure firewall and virtual network rules

C.Enable customer-managed keys in Azure Key Vault

D.Assign storage blob data contributor roles to users

E.Use client-side encryption with Azure Key Vault

AnswersB, C

Restricts network access to storage.

Why this answer

Option B is correct because configuring firewall and virtual network rules restricts network-level access to the storage account, preventing unauthorized traffic from reaching the data at rest. Option C is correct because enabling customer-managed keys (CMK) in Azure Key Vault provides an additional encryption layer for data at rest, allowing you to control and rotate encryption keys independently of Azure-managed keys.

Exam trap

The trap here is confusing data protection mechanisms (soft delete, RBAC, client-side encryption) with the specific Azure-native controls for securing data at rest (network isolation and encryption key management).

Practice this question →

134

Multi-Selecthard

Which THREE factors should you consider when choosing a shard key for Azure Cosmos DB to ensure even distribution and optimal performance?

Select 3 answers

A.Stable key values that do not change

B.Frequent updates to the key value

C.High cardinality of the key

D.Low cardinality of the key

E.Even distribution of request volume across partitions

AnswersA, C, E

Stable keys avoid partition splits and data movement.

Why this answer

Stable key values that do not change (Option A) are correct because Azure Cosmos DB uses the shard key to determine the physical partition for each document. If the key value changes after insertion, the document cannot be moved to a different partition, leading to cross-partition queries and degraded performance. A stable key ensures that the partition assignment remains consistent throughout the document's lifetime, which is essential for optimal query routing and throughput management.

Exam trap

The trap here is that candidates often confuse 'low cardinality' with 'high cardinality' or assume that frequently updated keys are acceptable because they think Cosmos DB can repartition dynamically, but in reality, the shard key is immutable after document creation.

Practice this question →

135

MCQeasy

You need to store log data from multiple Azure services in a single location for long-term retention and cost-effective querying. The data is append-only and rarely modified. Which storage solution should you use?

A.Azure SQL Database

B.Azure Data Lake Storage Gen2

C.Azure Table Storage

D.Azure Cosmos DB

AnswerB

ADLS Gen2 provides scalable, cost-effective storage for log data with analytics capabilities.

Why this answer

Azure Data Lake Storage Gen2 (ADLS Gen2) is optimized for append-only, rarely modified data like logs, offering hierarchical namespace, POSIX-like ACLs, and cost-effective tiered storage (hot/cool/archive). It supports high-throughput querying via Azure Synapse, Athena, or Spark, making it ideal for long-term retention and analytics on immutable log data.

Exam trap

The trap here is that candidates confuse 'append-only' with a database requirement and pick Azure SQL Database or Cosmos DB, overlooking that log storage prioritizes cost and schema-on-read over transactional consistency or low-latency writes.

How to eliminate wrong answers

Option A is wrong because Azure SQL Database is a relational OLTP store with high transaction costs and schema rigidity, not designed for append-only log blobs or cost-effective long-term retention. Option C is wrong because Azure Table Storage is a NoSQL key-value store for semi-structured data with limited query capabilities (only on partition/row key) and no native support for hierarchical namespaces or large-scale analytics. Option D is wrong because Azure Cosmos DB is a globally distributed, multi-model database optimized for low-latency reads/writes and real-time applications, not for cost-effective archival of append-only logs; its RU-based pricing makes long-term storage expensive.

Practice this question →

136

MCQmedium

You are designing a data storage solution for a retail company that needs to store transaction data that is frequently updated and requires strong consistency. The solution must support complex queries and joins across multiple tables. Which Azure data service should you recommend?

A.Azure Cosmos DB

B.Azure SQL Database

C.Azure Synapse Analytics

D.Azure Table Storage

AnswerB

Why this answer

Azure SQL Database is a fully managed relational database service that provides strong consistency, supports complex queries and joins across multiple tables, and is optimized for frequently updated transaction data. It offers ACID compliance and built-in high availability, making it the ideal choice for this retail scenario.

Exam trap

The trap here is that candidates often choose Azure Cosmos DB for its low-latency and global distribution capabilities, overlooking that it does not provide native relational joins or the strong consistency required for transactional workloads, which Azure SQL Database is specifically designed for.

Why the other options are wrong

Cosmos DB is NoSQL and while it can be configured for strong consistency, it does not natively support complex joins across multiple tables as efficiently as a relational database.

Synapse is a data warehouse for analytics, not designed for transactional workloads with frequent updates.

Table Storage is a NoSQL key-value store with limited query capabilities and no support for complex joins.

Practice this question →

137

MCQhard

You are a data engineer at a financial services company. The company uses Azure Synapse Analytics with a dedicated SQL pool for its data warehouse. The current table 'FactTransactions' is 2 TB and uses round-robin distribution. Query performance is poor for queries that frequently filter on 'CustomerID' and join with a 'DimCustomer' table (10 GB, replicated). You need to redesign the table to improve query performance while minimizing data movement during queries. The solution must also support incremental data loading with minimal overhead. You cannot change the storage size limit or add more DWU. What should you do?

A.Replicate the FactTransactions table to all distributions.

B.Keep round-robin distribution but add indexes on CustomerID and TransactionDate.

C.Change distribution to hash on TransactionDate and partition by month.

D.Recreate the table using hash distribution on CustomerID and use CTAS for incremental loads.

AnswerD

Collocates data on CustomerID, reducing data movement; CTAS handles incremental loads.

Why this answer

Option D is correct because hash distribution on CustomerID ensures that rows with the same CustomerID are co-located on the same distribution, which eliminates data movement during joins with the replicated DimCustomer table. Using CTAS (CREATE TABLE AS SELECT) for incremental loads allows you to efficiently rebuild the table with minimal overhead by loading only new data into a staging table, then swapping partitions or using CTAS to replace the target table without blocking reads.

Exam trap

The trap here is that candidates often choose hash distribution on a date column (Option C) thinking it helps with time-based queries, but the question specifically requires improving join performance on CustomerID, so the distribution key must match the join key to avoid data movement.

How to eliminate wrong answers

Option A is wrong because replicating a 2 TB FactTransactions table to all distributions would exceed the storage capacity (each distribution would hold a full copy, multiplying storage by 60 distributions) and is not feasible given the storage size limit. Option B is wrong because round-robin distribution distributes rows randomly across distributions, so queries filtering on CustomerID still require data shuffling across nodes to gather matching rows, and indexes on a columnstore table are not effective for point lookups in a distributed environment. Option C is wrong because hash distribution on TransactionDate does not align with the join key (CustomerID), so joins with DimCustomer would still cause data movement; partitioning by month adds maintenance overhead and does not reduce data shuffling for the join.

Practice this question →

138

Multi-Selecteasy

You are designing a data storage solution for a retail company that needs to store semi-structured IoT sensor data from thousands of devices. The data is ingested in near real-time, and queries will involve filtering by device ID and timestamp. The solution must minimize storage costs while supporting interactive queries. Which TWO Azure data storage options are most appropriate?

Select 2 answers

A.Azure Table Storage

B.Azure Cosmos DB

C.Azure Blob Storage

D.Azure Data Lake Storage Gen2

E.Azure SQL Database

AnswersC, D

Cost-effective for large volumes of semi-structured data.

Why this answer

Azure Blob Storage (C) is correct because it provides a cost-effective, scalable object store for semi-structured IoT data, supporting near real-time ingestion via REST APIs or SDKs and enabling interactive queries through Azure Data Lake Storage Gen2's hierarchical namespace and integration with query engines like Azure Synapse Serverless SQL or PolyBase. Azure Data Lake Storage Gen2 (D) is correct as it builds on Blob Storage with a hierarchical namespace, optimized for analytics workloads and interactive queries using tools like Azure Synapse or Databricks, while minimizing costs through tiered storage and lifecycle management.

Exam trap

The trap here is that candidates often choose Azure Cosmos DB for its low-latency querying capabilities, overlooking that the question emphasizes minimizing storage costs for large volumes of semi-structured IoT data, where object storage (Blob/Data Lake) is far cheaper and still supports interactive queries via analytics engines.

Practice this question →

139

MCQmedium

Your company stores sensitive customer data in Azure Data Lake Storage Gen2. You need to ensure that only authorized users can access the data, and that access is audited. Which approach should you use to control access to the data lake?

A.Use Azure managed identities for all user access

B.Use Azure RBAC roles and ACLs on directories/files, and enable storage analytics logging

C.Use shared access signatures (SAS) with stored access policies

D.Configure a virtual network service endpoint and firewall rules

AnswerB

RBAC provides coarse access, ACLs provide fine-grained, and logging audits access.

Why this answer

Option A is correct because RBAC combined with ACLs provides fine-grained access control and Azure Storage analytics logs enable auditing. Option B is wrong because SAS tokens are less granular and harder to manage for many users. Option C is wrong because firewall rules control network access, not user authorization.

Option D is wrong because managed identities are for service-to-service authentication, not user access.

Practice this question →

140

Multi-Selecthard

Which TWO options are valid ways to load data into Azure Synapse SQL Pool? (Choose two.)

Select 2 answers

A.Using INSERT INTO VALUES

B.Using BCP utility

C.Using the COPY statement

D.Using SQL Server Integration Services (SSIS)

E.Using PolyBase to load from Azure Blob Storage

AnswersC, E

Correct. COPY is the preferred bulk load method.

Why this answer

The COPY statement (Option C) is a valid and recommended method for loading data into Azure Synapse SQL Pool because it provides a high-throughput, fully managed ingestion path from Azure Data Lake Storage or Azure Blob Storage. It supports parallel loading, automatic schema inference, and built-in error handling, making it the preferred choice over older methods like PolyBase for many scenarios.

Exam trap

The trap here is that candidates may incorrectly assume INSERT INTO VALUES or BCP are valid for Synapse SQL Pool due to their familiarity with traditional SQL Server, but Synapse's MPP architecture requires bulk loading methods like COPY or PolyBase to achieve acceptable performance.

Practice this question →

141

MCQhard

You are using Azure Synapse Analytics serverless SQL pool to query Parquet files in Azure Data Lake Storage Gen2. The query is slow and you suspect that the file layout is not optimized. You examine the files and find that each file is 50 MB. What should you do to improve query performance?

A.Partition the data into folders by date

B.Compress the files with Gzip

C.Convert the files to CSV format to reduce overhead

D.Merge the small files into larger files of at least 100 MB each

AnswerD

Larger files reduce metadata overhead and improve query performance.

Why this answer

Option D is correct because Azure Synapse serverless SQL pool performs best when reading files of at least 100 MB each. Small files (50 MB) cause excessive metadata operations, partition discovery, and I/O overhead, leading to slower queries. Merging them into larger files reduces the number of file open/close operations and improves parallelism efficiency.

Exam trap

The trap here is that candidates often focus on compression or format conversion to improve performance, but the real bottleneck in serverless SQL pool with many small files is the metadata and scheduling overhead, not the data size itself.

How to eliminate wrong answers

Option A is wrong because partitioning by date does not address the core issue of small file size; while partitioning can help with predicate pushdown, it does not reduce the overhead of reading many small files. Option B is wrong because compressing with Gzip can actually degrade performance in serverless SQL pool, as Gzip is not splittable and forces sequential reads, whereas the pool prefers splittable formats like Snappy or uncompressed Parquet. Option C is wrong because converting to CSV format increases overhead due to lack of columnar storage, schema inference, and predicate pushdown, making queries slower than with optimized Parquet files.

Practice this question →

142

MCQhard

You are designing a data lake architecture for a healthcare company. The solution must support fine-grained access control at the file level, encryption at rest and in transit, and integration with Microsoft Purview for data lineage. Which storage solution should you recommend?

A.Azure NetApp Files.

B.Azure Files.

C.Azure Data Lake Storage Gen2 (ADLS Gen2).

D.Azure Blob Storage.

AnswerC

ADLS Gen2 provides ACLs, encryption, and Purview integration.

Why this answer

Azure Data Lake Storage Gen2 (ADLS Gen2) is the correct choice because it combines a hierarchical namespace with POSIX-like access control lists (ACLs) for fine-grained file-level permissions, supports encryption at rest (Azure Storage Service Encryption) and in transit (TLS 1.2+), and natively integrates with Microsoft Purview for automated data lineage and cataloging. This makes it ideal for healthcare scenarios requiring strict compliance and auditability.

Exam trap

The trap here is that candidates often confuse Azure Blob Storage with ADLS Gen2, assuming blob storage's container-level permissions are sufficient for file-level control, but the hierarchical namespace and POSIX ACLs are exclusive to ADLS Gen2 and required for the fine-grained access described.

How to eliminate wrong answers

Option A is wrong because Azure NetApp Files provides NFS/SMB file shares with ACLs but lacks native integration with Microsoft Purview for data lineage and is not optimized for large-scale analytics workloads like data lakes. Option B is wrong because Azure Files offers SMB file shares with ACLs but does not support the hierarchical namespace or POSIX ACLs needed for fine-grained file-level control in a data lake, and its Purview integration is limited compared to ADLS Gen2. Option D is wrong because Azure Blob Storage provides encryption and Purview integration but lacks a hierarchical namespace and POSIX ACLs, making it impossible to enforce fine-grained access control at the individual file level.

Practice this question →

143

MCQhard

You are troubleshooting a slow-running query in Azure Synapse Analytics dedicated SQL pool. The query joins a large fact table (hash-distributed on ProductID) with a small dimension table (replicated). Upon reviewing the query plan, you see a 'ShuffleMove' operation. What is the most likely cause of the slow performance?

A.The dimension table is not actually replicated due to its size exceeding the replication threshold

B.The query is running with a low resource class

C.Result set caching is enabled

D.Statistics are outdated on the dimension table

AnswerA

If the dimension table is too large to be replicated, it will be distributed and cause shuffle.

Why this answer

The 'ShuffleMove' operation in a Synapse dedicated SQL pool query plan indicates that data is being moved between distributions to complete the join. If the small dimension table is supposed to be replicated but exceeds the replication threshold (default 60 GB compressed), it will not be replicated and instead remains hash-distributed. This forces a shuffle of the large fact table's data to align with the dimension table's distribution, causing significant data movement and slow performance.

Exam trap

The trap here is that candidates often assume a 'ShuffleMove' is always caused by a join key mismatch or poor statistics, but the specific scenario of a dimension table failing to replicate due to size is a common and subtle cause that directly triggers data movement.

How to eliminate wrong answers

Option B is wrong because a low resource class affects concurrency and memory allocation, but it would not introduce a 'ShuffleMove' operation; it would manifest as spilling to tempdb or timeouts. Option C is wrong because result set caching improves performance for repeated queries by storing results, and it does not cause a 'ShuffleMove' operation; in fact, it would reduce the need for shuffles. Option D is wrong because outdated statistics can lead to suboptimal cardinality estimates and poor join choices, but they do not directly cause a 'ShuffleMove' operation; the shuffle is a physical data movement decision based on distribution type, not statistics.

Practice this question →

144

MCQeasy

Your company uses Azure Synapse Analytics dedicated SQL pool to store a fact table with 2 billion rows. You need to improve query performance for a workload that frequently aggregates sales by date and product category. Which distribution and index type should you use?

A.Hash-distribute on product_category and use a clustered columnstore index.

B.Replicate the table and use a clustered index.

C.Round-robin distribution and a clustered columnstore index.

D.Hash-distribute on date and use a clustered index.

AnswerA

Hash distribution on the join/aggregation key improves performance; columnstore is ideal for large data volumes.

Why this answer

Hash-distributing on product_category ensures that rows with the same product category are co-located on the same distribution, enabling local aggregation without data movement. A clustered columnstore index provides high compression and batch-mode processing, which is ideal for large fact tables and analytical workloads that aggregate millions of rows by columns like date and product_category.

Exam trap

The trap here is that candidates often choose round-robin distribution (Option C) thinking it balances data evenly, but they overlook that it causes data shuffling for any aggregation on a non-distribution column, while hash distribution on the grouping column avoids that overhead entirely.

How to eliminate wrong answers

Option B is wrong because replicating a 2-billion-row table is impractical due to storage overhead and the 60-distribution replication limit, and a clustered index lacks the columnar compression and batch-mode execution needed for large-scale aggregation. Option C is wrong because round-robin distribution distributes rows randomly, causing data movement during joins and aggregations, which degrades performance for frequent grouping by product_category. Option D is wrong because hash-distributing on date scatters rows with the same product_category across distributions, forcing expensive shuffle operations during aggregation, and a clustered index is less efficient than columnstore for analytical queries.

Practice this question →

145

MCQhard

You are designing a data storage solution for an Azure Data Lake Storage Gen2 account that will store sensitive customer data. The solution must enforce that all data is encrypted at rest using customer-managed keys (CMK) stored in Azure Key Vault. Additionally, you need to prevent data from being accessed by any Azure service except Azure Synapse Analytics. Which combination of configurations should you implement?

A.Enable Azure Storage encryption with customer-managed keys stored in Azure Key Vault and configure a firewall with a service endpoint for Azure Synapse Analytics

B.Enable Azure Storage encryption with customer-managed keys and use a shared access signature (SAS) token for Azure Synapse Analytics

C.Enable Azure Storage encryption with customer-managed keys and assign an Azure Policy denying public network access

D.Enable Azure Storage encryption with Microsoft-managed keys and configure a private endpoint

AnswerA

This combination provides CMK encryption and restricts network access to Azure Synapse Analytics only.

Why this answer

Option A is correct because it combines two essential controls: Azure Storage encryption with customer-managed keys (CMK) stored in Azure Key Vault ensures data is encrypted at rest using keys you control, and a firewall with a service endpoint for Azure Synapse Analytics restricts network access so that only traffic from Azure Synapse Analytics can reach the storage account. This meets both the encryption and access restriction requirements.

Exam trap

The trap here is that candidates often confuse service endpoints with private endpoints or SAS tokens, thinking any network restriction or key management approach will suffice, but the question specifically requires both CMK and service-specific access control, which only a service endpoint for Azure Synapse Analytics combined with a firewall provides.

How to eliminate wrong answers

Option B is wrong because a shared access signature (SAS) token grants time-limited, delegated access to specific resources but does not restrict access to only Azure Synapse Analytics; it can be used by any client with the token, and it does not enforce network-level isolation. Option C is wrong because assigning an Azure Policy denying public network access only blocks public endpoints but does not explicitly allow only Azure Synapse Analytics; it would block all access unless a private endpoint or service endpoint is configured, and it does not address the encryption requirement (CMK is already enabled, but the policy alone does not enforce the service-specific restriction). Option D is wrong because it uses Microsoft-managed keys instead of customer-managed keys, which fails the explicit requirement for CMK stored in Azure Key Vault, even though a private endpoint provides network isolation.

Practice this question →

146

MCQmedium

You are designing a data storage solution for a retail company that needs to store semi-structured JSON data from IoT sensors. The data is ingested continuously and must support both real-time analytics and batch processing. Which Azure data store should you recommend?

A.Azure Data Lake Storage Gen2

B.Azure Blob Storage

C.Azure SQL Database

D.Azure Cosmos DB

AnswerA

ADLS Gen2 combines Blob Storage with a hierarchical namespace and is designed for big data analytics.

Why this answer

Azure Data Lake Storage Gen2 (ADLS Gen2) is the correct choice because it combines a hierarchical file system with the scalability and low cost of Azure Blob Storage, making it ideal for storing semi-structured JSON data from IoT sensors. It supports both real-time analytics (via services like Azure Stream Analytics or Apache Spark) and batch processing (via tools like Azure Data Factory or PolyBase) without data movement, and it natively handles JSON files with schema-on-read capabilities.

Exam trap

The trap here is that candidates often choose Azure Cosmos DB (Option D) because it natively supports JSON and real-time access, but they overlook the requirement for batch processing and cost-effective storage at scale, which ADLS Gen2 is designed for as a data lake solution.

How to eliminate wrong answers

Option B (Azure Blob Storage) is wrong because while it can store JSON data, it lacks a hierarchical namespace, making it less efficient for directory-based operations and batch processing patterns that require folder-level security or rename operations, which ADLS Gen2 provides. Option C (Azure SQL Database) is wrong because it is a relational database optimized for structured data and transactional workloads, not for storing semi-structured JSON at scale with schema-on-read; it would require schema definition and indexing, adding latency and cost for high-volume IoT ingestion. Option D (Azure Cosmos DB) is wrong because it is a NoSQL database designed for low-latency, transactional access to semi-structured data, but it is not optimized for batch processing or large-scale analytical queries on raw JSON files; it is better suited for operational workloads rather than the combined real-time and batch analytics scenario described.

Practice this question →

147

Multi-Selectmedium

Which THREE considerations should be evaluated when designing a partitioning strategy for a large fact table in Azure Synapse Dedicated SQL Pool?

Select 3 answers

A.Align partition boundaries with distribution key

B.Use hash distribution for all fact tables

C.Ensure partition elimination is possible in queries

D.Limit the number of partitions to fewer than 60

E.Choose a partition column with high cardinality

AnswersA, C, D

Prevents data movement.

Why this answer

Option A is correct because aligning partition boundaries with the distribution key in Azure Synapse Dedicated SQL Pool ensures that data movement during partition switching or merging is minimized, as each distribution contains its own set of partitions. This alignment avoids cross-distribution data transfers, which can significantly improve performance and maintainability of large fact tables.

Exam trap

The trap here is that candidates often confuse partition column cardinality with distribution key cardinality, assuming high cardinality is always beneficial, but for partitioning, low cardinality is required to avoid creating too many small partitions that hurt performance.

Practice this question →

148

MCQhard

A healthcare company stores patient records in Azure Blob Storage. The compliance team requires that all data be encrypted at rest using customer-managed keys (CMK) stored in Azure Key Vault. Additionally, the storage account must be accessible only from a specific virtual network (VNet) and must support versioning to protect against accidental deletion. The storage account is currently using Microsoft-managed keys and has public network access enabled. You need to implement the required changes with minimal downtime. Which course of action should you take?

A.Create a new storage account in the same region with infrastructure encryption enabled and customer-managed keys configured. Enable versioning and VNet access. Use AzCopy to copy data from the old account to the new one. Update applications to use the new storage account. Delete the old account.

B.Modify the existing storage account to use customer-managed keys by updating the encryption settings in the Azure portal. Then, enable versioning and configure firewall rules to allow access from the VNet.

C.Delete the existing storage account and recreate it with the same name, enabling customer-managed keys, versioning, and VNet access. Restore data from backups.

D.Create a new storage account with the same name in a different region. Enable infrastructure encryption and customer-managed keys. Use Azure Data Factory to copy data from the old account to the new one. Then, delete the old account and update applications to point to the new account.

AnswerA

This meets all requirements. The new account is created with CMK support, versioning, and VNet access. Data is migrated with AzCopy, and applications are updated.

Why this answer

Option A is correct because creating a new storage account with infrastructure encryption and customer-managed keys (CMK) is the only way to enable infrastructure encryption, which cannot be enabled on an existing account. Using AzCopy ensures minimal downtime by copying data in the background while applications continue to use the old account until the cutover. Enabling versioning and VNet access on the new account meets all compliance requirements without disrupting the existing environment.

Exam trap

The trap here is that candidates assume infrastructure encryption can be enabled on an existing storage account, but it is a creation-time-only setting, making a new account the only viable path to meet the compliance requirement with minimal downtime.

How to eliminate wrong answers

Option B is wrong because you cannot enable infrastructure encryption on an existing storage account; it must be set at creation time. Additionally, changing encryption settings from Microsoft-managed to customer-managed keys on an existing account does not enable infrastructure encryption, which is required by the compliance team. Option C is wrong because deleting and recreating the account with the same name causes significant downtime, and restoring from backups is not a minimal-downtime approach; also, infrastructure encryption cannot be enabled on a recreated account with the same name if the original was created without it.

Option D is wrong because creating the new account in a different region violates the requirement to keep data in the same region, and Azure Data Factory is not the optimal tool for a one-time bulk copy; AzCopy is more efficient for this scenario.

Practice this question →

149

MCQmedium

Your company is building a real-time analytics solution for monitoring manufacturing equipment. Sensors send JSON data every second to an Azure Event Hubs instance. The data must be stored in Azure Data Lake Storage Gen2 in Parquet format, partitioned by date and hour. You use Azure Stream Analytics to read from Event Hubs and write to ADLS Gen2. Currently, the output is writing many small Parquet files (under 1 MB each), which is causing performance issues when reading the data. You need to optimize the output to produce fewer, larger files while maintaining low latency. What should you do?

A.Partition the output by minute instead of hour to distribute data more

B.Change the output format to Avro to improve compression

C.Increase the 'Maximum events per batch' setting in the Stream Analytics output to ADLS Gen2

D.Decrease the 'Maximum events per batch' setting to reduce latency

AnswerC

Buffering more events per batch produces larger files.

Why this answer

Option C is correct because increasing the 'Maximum events per batch' setting in the Stream Analytics output to ADLS Gen2 allows more events to be accumulated before writing a file, resulting in fewer, larger Parquet files. This directly addresses the small-file problem while maintaining low latency, as the batching is time-bound and does not introduce excessive delay.

Exam trap

The trap here is that candidates may confuse partitioning granularity with file sizing, incorrectly assuming finer partitioning (Option A) or format changes (Option B) will solve the small-file problem, when the actual solution is to adjust the batching threshold in the output sink.

How to eliminate wrong answers

Option A is wrong because partitioning by minute instead of hour would create even more partitions, leading to more small files and worsening the performance issue. Option B is wrong because changing the output format to Avro does not inherently reduce the number of files; it only changes the serialization format, and Avro may not provide better compression than Parquet for this scenario. Option D is wrong because decreasing the 'Maximum events per batch' setting would cause more frequent writes with fewer events, increasing the number of small files and degrading read performance.

Practice this question →

150

MCQhard

A data engineering team uses Azure Data Factory to load data from Azure SQL Database to Azure Data Lake Storage Gen2. They notice that the pipeline runs fail intermittently due to transient errors. They need to implement a retry policy with exponential backoff. What is the most efficient way to achieve this?

A.Use a 'Validation' activity before the copy to check source availability

B.Create a custom .NET activity to handle retries

C.Add a 'Until' loop with a wait activity in the pipeline

D.Configure the 'Retry' property on the copy activity with a count and exponential backoff interval

AnswerD

Built-in retry with exponential backoff.

Why this answer

Option D is correct because Azure Data Factory natively supports configuring a 'Retry' property on activities, including Copy activities, with an exponential backoff interval. This built-in mechanism automatically retries the activity upon transient failures without requiring custom logic, making it the most efficient and maintainable approach for handling intermittent errors.

Exam trap

The trap here is that candidates may overcomplicate the solution by choosing a custom loop or validation activity, overlooking that Azure Data Factory's native 'Retry' property with exponential backoff is the simplest and most efficient built-in mechanism for handling transient errors.

How to eliminate wrong answers

Option A is wrong because a 'Validation' activity only checks source availability before the copy starts; it does not retry the copy operation itself if a transient error occurs during data transfer. Option B is wrong because creating a custom .NET activity introduces unnecessary complexity, development overhead, and maintenance burden when Azure Data Factory already provides a native retry feature. Option C is wrong because an 'Until' loop with a wait activity requires manual implementation of retry logic and exponential backoff, which is less efficient and more error-prone than using the built-in 'Retry' property.

Practice this question →