DP-900Chapter 96 of 101Objective 3.1

PolyBase and External Tables in Synapse

This chapter covers PolyBase and external tables in Azure Synapse Analytics, a critical feature for querying data stored outside the dedicated SQL pool without moving it. For the DP-900 exam, this topic appears in roughly 10-15% of questions under objective 3.1 (Analyze data in a relational data warehouse). You will need to understand how PolyBase enables T-SQL queries against external data sources, the types of external data sources supported, and how external tables are created and used. Mastering this will help you answer questions about data virtualization and hybrid querying scenarios.

25 min read
Intermediate
Updated May 31, 2026

PolyBase as a Multilingual Librarian

Imagine a large library with a central catalog system that can query books stored in different wings. The main hall (Synapse SQL pool) has a single reference desk (PolyBase engine). A patron (user query) asks for information from a book stored in the rare books wing (Azure Blob Storage) and a book from the digital archives (Azure Data Lake Storage). The librarian doesn't physically bring every book into the main hall; instead, she sends a request to the rare books wing, which scans its index and returns only the relevant pages. Similarly, the digital archives system processes the query locally and returns results. The librarian then combines these results at the reference desk and presents them to the patron. This is efficient because she never moves entire books—only the needed data. If the rare books wing uses a different cataloging system (like Hadoop), the librarian uses a special protocol (external table definition) to communicate. The key is that the librarian knows exactly where each book is stored and how to access it without moving it to the main hall. This mirrors PolyBase's ability to query external data in Azure Blob Storage, Azure Data Lake Storage, or Hadoop via external tables, using T-SQL statements that push down computation where possible.

How It Actually Works

What is PolyBase and Why Does It Exist?

PolyBase is a technology that enables Azure Synapse Analytics (formerly SQL Data Warehouse) to query data from external sources using standard T-SQL statements. It was originally developed by Microsoft for SQL Server 2016 and later integrated into Azure Synapse. The primary purpose is to allow data virtualization—querying data without moving or copying it into the dedicated SQL pool. This is critical for scenarios where data resides in Azure Blob Storage, Azure Data Lake Storage (Gen1 or Gen2), or Hadoop (via Hortonworks Data Platform or Cloudera). PolyBase avoids expensive data movement and enables hybrid architectures.

How PolyBase Works Internally

When you issue a T-SQL query that references an external table, the PolyBase engine in Synapse SQL pool does not pull the entire external dataset into the pool. Instead, it: 1. Parses the query and identifies which parts involve external tables. 2. Contacts the external data source using the connection information defined in the external data source object. 3. If possible, pushes down parts of the query (filtering, projection) to the external source (e.g., Hadoop or Azure Storage) to reduce data movement. 4. Retrieves only the necessary data from the external source, typically in a compressed format (e.g., ORC, Parquet, or delimited text). 5. Processes the data in the SQL pool and returns results.

PolyBase supports two modes for reading external data: - Direct read: For data in Azure Blob Storage or Azure Data Lake Storage, PolyBase reads the files directly. - Pushdown: For Hadoop or Azure HDInsight, PolyBase can push down filter and join operations to the MapReduce or Spark layer.

Key Components

To use PolyBase, you must create several objects in the Synapse SQL pool:

- External Data Source: Defines the location and connection string for the external data. Supported types: - HADOOP for Hadoop or Azure HDInsight. - SHARD_MAP_MANAGER for Azure SQL Database sharding (not commonly tested). - RDBMS for remote SQL Server (not in Synapse, only in SQL Server 2019+). Example syntax:

CREATE EXTERNAL DATA SOURCE MyDataSource WITH (
    TYPE = HADOOP,
    LOCATION = 'wasbs://container@storageaccount.blob.core.windows.net',
    CREDENTIAL = MyCredential
);

- External File Format: Defines the format of the external data (delimited text, Parquet, ORC, RCFile, etc.). Example:

CREATE EXTERNAL FILE FORMAT MyFileFormat WITH (
    FORMAT_TYPE = PARQUET,
    DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'
);

- External Table: Similar to a regular table but points to data in the external source. It includes the schema and references the external data source and file format. Example:

CREATE EXTERNAL TABLE dbo.ExternalSales (
    SaleID INT,
    SaleDate DATE,
    Amount DECIMAL(10,2)
)
WITH (
    LOCATION = '/sales/',
    DATA_SOURCE = MyDataSource,
    FILE_FORMAT = MyFileFormat
);

Supported Data Formats and Locations

PolyBase supports reading from: - Azure Blob Storage: via wasbs:// or abfss:// (for ADLS Gen2). - Azure Data Lake Storage Gen1: via adl://. - Azure Data Lake Storage Gen2: via abfss://. - Hadoop (HDFS): via hdfs://. - Azure HDInsight: as a Hadoop cluster.

File formats:

Delimited text (CSV, TSV) – with optional header row.

ORC (Optimized Row Columnar).

Parquet.

RCFile.

Defaults and Considerations

By default, PolyBase attempts to read all files in the specified location and its subdirectories.

For delimited text, if the first row contains column names, you must specify FIRST_ROW = 2 in the external file format or use a rejection rule.

PolyBase can handle up to 200 external tables per database.

The maximum number of external data sources is 10.

The maximum number of external file formats is 50.

PolyBase queries have a timeout of 10 minutes by default, configurable via QUERY_TIMEOUT.

Performance and Pushdown

PolyBase can push down filter predicates and projections to the external source if the source supports it (e.g., Hadoop with ORC/Parquet). For Azure Blob Storage, pushdown is not supported because Blob Storage is a flat file system; all data must be read and processed in the SQL pool. However, using columnar formats like Parquet reduces I/O because only relevant columns are read.

Interaction with Related Technologies

Azure Synapse Pipelines: Can orchestrate PolyBase queries.

COPY INTO: A newer, simpler method for loading data into dedicated SQL pool, but PolyBase is still used for external tables.

Azure Data Factory: Can use PolyBase as a sink or source.

Security

PolyBase uses database-scoped credentials to authenticate to external storage. These credentials can be based on:

Storage account key (shared key).

Shared Access Signature (SAS) token.

Managed identity (recommended for security).

Service principal (for ADLS Gen2).

Example credential creation:

CREATE DATABASE SCOPED CREDENTIAL MyCredential WITH
    IDENTITY = 'SHARED ACCESS SIGNATURE',
    SECRET = 'sv=2018-03-28&ss=b&...';

Querying External Tables

Once created, you can query external tables like regular tables:

SELECT * FROM dbo.ExternalSales WHERE SaleDate > '2023-01-01';

You can also join external tables with local tables:

SELECT l.Region, SUM(e.Amount) AS TotalSales
FROM dbo.LocalRegions l
JOIN dbo.ExternalSales e ON l.RegionID = e.RegionID
GROUP BY l.Region;

Limitations

External tables are read-only (no INSERT, UPDATE, DELETE).

Cannot create indexes on external tables.

Statistics are not automatically created; you must manually create them using CREATE STATISTICS on external table columns for optimal performance.

PolyBase does not support all T-SQL functions; for example, GETDATE() is supported, but RAND() is not.

Verification Commands

To check existing external objects:

SELECT * FROM sys.external_data_sources;
SELECT * FROM sys.external_file_formats;
SELECT * FROM sys.external_tables;

To view execution plan for a PolyBase query:

EXPLAIN WITH_RECOMMENDATIONS SELECT * FROM dbo.ExternalSales;

Walk-Through

1

Define External Data Source

Create an external data source object that points to the location of your external data. This includes the type (HADOOP for Azure Blob/ADLS/Hadoop), the location URL (e.g., 'wasbs://container@storageaccount.blob.core.windows.net' for Blob Storage), and optionally a credential for authentication. The credential must be created beforehand using CREATE DATABASE SCOPED CREDENTIAL. The external data source tells PolyBase where to find the data and how to connect. This step is mandatory before creating external tables.

2

Define External File Format

Create an external file format object that describes the structure of the data files. Specify FORMAT_TYPE (e.g., PARQUET, ORC, DELIMITEDTEXT), and for delimited text, set FIELD_TERMINATOR, STRING_DELIMITER, FIRST_ROW if header exists, and optionally DATA_COMPRESSION (e.g., 'org.apache.hadoop.io.compress.SnappyCodec'). For columnar formats like Parquet, compression is often built-in. This object ensures PolyBase correctly interprets the file contents.

3

Create External Table

Define the external table schema with column names and data types. Use the CREATE EXTERNAL TABLE statement, specifying LOCATION (the folder or file path relative to the data source root), DATA_SOURCE, and FILE_FORMAT. For example, LOCATION='/sales/' will read all files in the 'sales' folder. The external table is a metadata-only object; no data is moved. After creation, you can query it like a regular table. Note that external tables are read-only.

4

Query External Table with T-SQL

Write SELECT statements against the external table. PolyBase reads the data from the external location and processes it in the SQL pool. You can apply filters, projections, joins with local tables, aggregations, etc. For optimal performance, use columnar formats (Parquet, ORC) and filter early. You can also create statistics on external table columns to help the optimizer. Use EXPLAIN to see if pushdown occurs.

5

Manage and Monitor External Objects

Monitor external tables using system views like sys.external_tables, sys.external_data_sources, and sys.external_file_formats. To drop external objects, use DROP EXTERNAL TABLE, DROP EXTERNAL DATA SOURCE, DROP EXTERNAL FILE FORMAT. Note that dropping an external table does not delete the underlying data. For troubleshooting, check error logs; common issues include incorrect credentials, wrong file format, or missing files.

What This Looks Like on the Job

Enterprise Scenario 1: Data Lake Querying for Reporting

A retail company stores historical sales data in Azure Data Lake Storage Gen2 as Parquet files, partitioned by year and month. The data warehouse team uses Synapse dedicated SQL pool for daily reporting. Rather than copying terabytes of data into the pool, they create external tables pointing to the Parquet files. Analysts can then query sales data from 2020-2023 using standard T-SQL, joining with dimension tables in the pool. Performance is acceptable because Parquet is columnar and filters push down to file-level pruning. The team set up a managed identity for the Synapse workspace to authenticate to ADLS Gen2. One common issue: if the external files are not in a consistent schema, queries fail with 'File format error'. They enforce schema-on-read by ensuring all files in the location have the same columns.

Enterprise Scenario 2: Hybrid Hadoop Migration

A financial services firm has an on-premises Hadoop cluster with historical trade data. They are migrating to Azure and want to use Synapse for analytics. They set up an Azure HDInsight cluster as a staging environment and create external tables in Synapse pointing to HDFS via the HADOOP data source type. PolyBase pushes down filter operations to the HDInsight cluster, reducing data transfer. The migration team gradually moves data to ADLS Gen2 and updates the external data source URL. They discovered that PolyBase pushdown works only with ORC and Parquet files on HDInsight; CSV files do not support pushdown, causing slower queries. They also had to configure firewall rules to allow Synapse to communicate with HDInsight.

Enterprise Scenario 3: Real-time Ingestion with PolyBase

A logistics company ingests IoT device data into Azure Blob Storage as CSV files every 5 minutes. They use PolyBase external tables to query the latest data for dashboards. However, they found that PolyBase does not automatically detect new files; they must recreate or refresh metadata. They solved this by using a stored procedure that drops and recreates the external table daily. They also set up alerts for when PolyBase queries fail due to missing files or authentication issues. A misconfiguration they encountered: using a SAS token that expired, causing all external queries to fail until the credential was updated.

How DP-900 Actually Tests This

What DP-900 Tests on PolyBase and External Tables

The exam objective 3.1 includes: 'Describe how to query data using PolyBase and external tables in Azure Synapse Analytics.' Specifically, you need to know:

The purpose of PolyBase (query external data without moving it).

Supported external data sources: Azure Blob Storage, Azure Data Lake Storage (Gen1 and Gen2), Hadoop/HDInsight.

Supported file formats: Delimited text, ORC, Parquet, RCFile.

The three objects required: external data source, external file format, external table.

That external tables are read-only.

That PolyBase can push down computation to Hadoop/HDInsight.

How to create credentials (database-scoped credentials).

Common Wrong Answers and Why

1.

'PolyBase can write data to external sources.' – Wrong. External tables are read-only. PolyBase cannot INSERT, UPDATE, or DELETE external data.

2.

'PolyBase requires data to be loaded into the SQL pool first.' – Wrong. The whole point is to query data in place.

3.

'PolyBase supports Azure SQL Database as an external data source in Synapse.' – Wrong. In Synapse, PolyBase only supports Hadoop-type sources (Blob, ADLS, Hadoop). Remote SQL Server is supported only in SQL Server 2019+ PolyBase, not in Synapse.

4.

'You can create indexes on external tables.' – Wrong. No indexes on external tables.

Exact Values and Terms on the Exam

External data source types: HADOOP, SHARD_MAP_MANAGER (rarely tested).

File format types: DELIMITEDTEXT, PARQUET, ORC, RCFILE.

Location prefix: wasbs:// for Blob Storage, abfss:// for ADLS Gen2, adl:// for ADLS Gen1.

Maximum external tables per database: 200.

Maximum external data sources: 10.

PolyBase is used in Azure Synapse Analytics dedicated SQL pool (not serverless SQL pool, which uses OPENROWSET).

Edge Cases and Exceptions

If the external data is in CSV with header row, you must set FIRST_ROW = 2 in the external file format or use a rejection rule.

PolyBase does not support nested data types like arrays or maps; you must flatten them.

For authentication, managed identity is recommended over storage account keys or SAS tokens.

PolyBase queries can timeout; default is 10 minutes.

How to Eliminate Wrong Answers

If an answer says 'load data into the warehouse' or 'copy data', it is likely wrong because PolyBase queries in place.

If an answer mentions 'write' or 'modify' external data, it is wrong.

If an answer mentions 'Azure SQL Database' as a supported external source in Synapse, it is wrong (only Hadoop sources).

If an answer says 'no authentication needed', it is wrong; credentials are required.

Key Takeaways

PolyBase enables T-SQL queries against external data in Azure Blob Storage, ADLS, or Hadoop without moving data.

Three objects required: external data source, external file format, external table.

External tables are read-only and cannot be indexed.

Supported file formats: Delimited text, ORC, Parquet, RCFile.

Maximum 200 external tables per database, 10 external data sources.

Authentication via database-scoped credentials: storage key, SAS, managed identity, or service principal.

PolyBase can push down filters and projections to Hadoop/HDInsight for columnar formats.

Use 'wasbs://' for Blob Storage, 'abfss://' for ADLS Gen2, 'adl://' for ADLS Gen1.

For CSV with header row, set FIRST_ROW = 2 in file format.

PolyBase is specific to dedicated SQL pool in Synapse; serverless uses OPENROWSET.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

PolyBase External Tables

Used in dedicated SQL pool only.

Requires pre-creation of external data source, file format, and table.

Supports querying data in Blob Storage, ADLS, and Hadoop.

Read-only; no DML operations.

Supports pushdown to Hadoop/HDInsight.

OPENROWSET (Serverless SQL Pool)

Used in serverless SQL pool (Synapse serverless).

No pre-creation needed; query directly using OPENROWSET with BULK option.

Supports Blob Storage and ADLS (Gen1/Gen2).

Read-only; used for ad-hoc exploration.

No pushdown; data is read and processed in serverless pool.

Watch Out for These

Mistake

PolyBase can be used to insert data into external files.

Correct

PolyBase external tables are read-only. You cannot perform INSERT, UPDATE, or DELETE operations on them. To write data, you must use other tools like Azure Data Factory, COPY INTO, or PolyBase in a different context (e.g., SQL Server 2019 PolyBase can insert into external tables, but not in Synapse).

Mistake

PolyBase works with any file format.

Correct

PolyBase supports only delimited text (CSV, TSV), ORC, Parquet, and RCFile. Other formats like JSON, Avro, or Excel are not supported. You must convert data to a supported format before using PolyBase.

Mistake

External tables automatically update when new files are added to the storage location.

Correct

External tables are metadata snapshots; they do not automatically detect new files. If you add new files, the external table still sees only the files that existed at creation time. You need to refresh the metadata, e.g., by dropping and recreating the external table, or using a partitioned table with partition switching.

Mistake

PolyBase can query data from Azure SQL Database directly in Synapse.

Correct

In Azure Synapse Analytics, PolyBase only supports external data sources of type HADOOP (Blob, ADLS, Hadoop). For querying Azure SQL Database, you must use other methods like Linked Servers (not available in Synapse) or Azure Data Factory. PolyBase in SQL Server 2019 supports RDBMS sources, but that is not part of DP-900.

Mistake

You can create indexes on external tables to improve performance.

Correct

External tables do not support indexes. Performance optimization relies on file format (columnar), query pushdown, and creating statistics on external table columns. You can create statistics using CREATE STATISTICS, but not indexes.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is PolyBase in Azure Synapse Analytics?

PolyBase is a technology that allows you to query external data stored in Azure Blob Storage, Azure Data Lake Storage, or Hadoop using T-SQL without moving the data. You define external data sources, file formats, and external tables. It is used in the dedicated SQL pool of Synapse Analytics. The main benefit is data virtualization—you can join external data with local tables as if it were in the database.

How do I create an external table in Synapse?

First, create a database-scoped credential for authentication. Then create an external data source with the location and credential. Next, create an external file format specifying the format (e.g., PARQUET). Finally, create the external table with CREATE EXTERNAL TABLE, specifying LOCATION, DATA_SOURCE, and FILE_FORMAT. Example: CREATE EXTERNAL TABLE dbo.Sales (SaleID INT) WITH (LOCATION='/sales/', DATA_SOURCE=MyDS, FILE_FORMAT=MyFF);

Can PolyBase write data to external files?

No. In Azure Synapse Analytics, PolyBase external tables are read-only. You cannot perform INSERT, UPDATE, or DELETE. To write data externally, use the COPY INTO command or Azure Data Factory. Note that SQL Server 2019 PolyBase can insert into external tables, but that is not relevant for DP-900.

What file formats does PolyBase support?

PolyBase supports delimited text (CSV, TSV), ORC (Optimized Row Columnar), Parquet, and RCFile. For delimited text, you can specify field and string delimiters, and whether the first row is a header. For columnar formats, compression is often built-in. JSON, Avro, and Excel are not supported.

What is the difference between PolyBase and OPENROWSET in Synapse?

PolyBase is used in the dedicated SQL pool and requires pre-defined external objects. OPENROWSET is used in the serverless SQL pool and allows ad-hoc querying of files directly without creating external objects. PolyBase supports pushdown to Hadoop; OPENROWSET does not. Both are read-only. For DP-900, know that PolyBase is for dedicated pool, OPENROWSET for serverless.

How does PolyBase authenticate to Azure Storage?

PolyBase uses database-scoped credentials. You create a credential with an identity and secret. Supported identity types include 'Storage Account Key' (shared key), 'Shared Access Signature' (SAS token), 'Managed Identity' (recommended), and 'Service Principal' (for ADLS Gen2). The credential is referenced in the external data source definition.

Can I query external tables with joins to local tables?

Yes. You can join external tables with regular tables in the dedicated SQL pool. For example: SELECT * FROM LocalTable L JOIN ExternalTable E ON L.ID = E.ID. PolyBase will read the external data and perform the join in the SQL pool. For optimal performance, filter and aggregate early to reduce data movement.

Terms Worth Knowing

Ready to put this to the test?

You've just covered PolyBase and External Tables in Synapse — now see how well it sticks with free DP-900 practice questions. Full explanations included, no account needed.

Done with this chapter?