SC-200Chapter 29 of 101Objective 2.2

Advanced KQL: Joins, Summarize, and Aggregations

This chapter covers advanced KQL operations essential for the SC-200 exam, focusing on joins, summarize, and aggregations. These topics appear in approximately 15-20% of exam questions, especially in the 'Create and manage analytics rules' and 'Perform threat hunting' objectives. Mastering these operators allows you to correlate data across tables, detect patterns, and generate security insights efficiently. The chapter will explain the mechanics of each operator, provide syntax examples, and highlight common pitfalls tested on the exam.

25 min read
Intermediate
Updated May 31, 2026

KQL Joins Like Library Card Catalogs

Imagine a library with two separate card catalogs: one for books (title, author, genre) and one for borrowing records (member ID, book ID, due date). A 'join' is like a librarian who wants to find out which books are overdue. She takes the borrowing records catalog and, for each borrowing record, walks to the books catalog, looks up the book ID, and writes the book title and author next to the borrowing record. If she uses an 'inner join', she only writes down titles for book IDs that exist in both catalogs—if a borrowing record has a book ID that doesn't match any book card, she discards that record. If she uses a 'left outer join', she writes down the title when it exists, but if the book ID is missing, she writes 'Unknown' and still keeps the borrowing record. The 'summarize' operation is like a different librarian who takes the borrowing records, groups them by member ID, and counts how many books each member has borrowed. She produces a new card: one per member with the count. 'Aggregations' are the math she does—counting, summing due dates, averaging loan duration. KQL's 'summarize' is exactly that: it takes a table, splits it into groups based on one or more columns (the 'by' clause), and for each group computes aggregate functions like count(), sum(), avg(), min(), max(), dcount(), and percentiles(). The result is a new table with one row per group. Understanding this analogy helps you see that joins combine rows from two tables based on a key, while summarize reduces many rows into summary rows per group.

How It Actually Works

What are Joins in KQL?

Joins in KQL combine rows from two tables based on a related column (key). The SC-200 exam tests your ability to choose the correct join type for a given scenario. The basic syntax is:

Table1 | join kind=<kind> Table2 on <key>

KQL supports the following join kinds: inner, innerunique, leftouter, rightouter, fullouter, leftanti, rightanti, leftsemi, rightsemi, and cross. The exam focuses on innerunique, leftouter, and anti joins.

How Joins Work Internally

When you execute a join, KQL performs a hash-match algorithm by default. It builds a hash table from the right-side table (the 'inner' table) based on the join key. Then it scans the left-side table (the 'outer' table) and for each row, hashes the key and looks up matching rows in the hash table. The result is a new table with columns from both sides. If there are multiple matches, the row from the left side is duplicated for each match. The join kind determines which rows are kept:

innerunique: Only one row from the right side is matched to each left row, even if multiple matches exist. The first match is used. This is the default join kind. It's efficient but can lose data if you expect multiple matches.

inner: All matching rows from both sides are included. If a left row matches multiple right rows, all combinations appear.

leftouter: All rows from the left table are kept. If no match, right-side columns are null.

rightouter: All rows from the right table are kept. Left-side columns are null if no match.

fullouter: All rows from both tables are kept, with nulls for non-matching sides.

leftanti: Returns only rows from the left table that have no match in the right table.

rightanti: Returns only rows from the right table that have no match in the left table.

leftsemi: Returns rows from the left table that have at least one match in the right table (like inner but only left columns).

rightsemi: Returns rows from the right table that have at least one match in the left table.

cross: Cartesian product of both tables (every row from left combined with every row from right). This is expensive and rarely used in security scenarios.

Key Components and Defaults

Join key: The column(s) used to match rows. You can use a single column or a comma-separated list. If column names differ between tables, use the syntax $left.ColumnName == $right.ColumnName.

Default join kind: innerunique. If you omit kind=, KQL uses innerunique.

Performance: Joins are memory-intensive. For large tables, use hint.strategy=broadcast to broadcast a small table to all nodes, or hint.strategy=shuffle to redistribute data. The exam may test that broadcast is best when one table is much smaller than the other.

Time window: For time-based joins, you can use a time window to reduce memory. Example: on $left.Timestamp between($right.Timestamp-1h .. $right.Timestamp+1h).

Configuration and Verification

Joins are not 'configured' but used in queries. To verify join behavior, you can use the | count operator before and after to see row count changes. The | getschema operator can show column structure. Example:

SecurityEvent
| where TimeGenerated > ago(1d)
| join kind=innerunique (
    SigninLogs
    | where TimeGenerated > ago(1d)
) on AccountUpn
| take 10

How Joins Interact with Other Technologies

Joins are often combined with where, project, extend, and summarize. For example, you might join SecurityEvent with SigninLogs to correlate sign-in events with security alerts. The exam frequently presents scenarios where you need to join two tables and then summarize results.

What is Summarize?

The summarize operator groups rows by one or more columns (the 'by' clause) and computes aggregate functions for each group. It is the KQL equivalent of SQL's GROUP BY. Syntax:

Table | summarize [AggregateFunction(Column) [as Alias], ...] by GroupColumn [, GroupColumn2, ...]

How Summarize Works Internally

KQL partitions the data by the 'by' columns, creating a group for each unique combination of values. For each group, it evaluates the aggregate functions. The result is a new table with one row per group and columns for the 'by' keys and the aggregated values. The order of groups is arbitrary unless you use | order by or | sort by.

Key Aggregate Functions for SC-200

count(): Number of rows in the group. count() returns an integer. countif(predicate) counts rows where predicate is true.

sum(): Sum of a numeric column. Returns a decimal. Works only on numeric types.

avg(): Average of a numeric column. Returns a double.

min() / max(): Minimum or maximum value of a column. Works on any comparable type (numbers, dates, strings).

dcount(): Estimated distinct count using HyperLogLog algorithm. Faster than distinct | count but approximate (error margin ~1%). Use for high-cardinality columns.

make_list(): Creates a JSON array of all values in the group. Useful for building lists of IPs or accounts.

make_set(): Creates a JSON array of distinct values in the group. Similar to make_list but removes duplicates.

percentiles(): Calculates percentile values. Syntax: percentiles(Column, 50, 95, 99) returns an array or multiple columns.

arg_min() / arg_max(): Returns the row that has the min/max value of a column. Syntax: arg_max(Column, OtherColumn1, OtherColumn2) returns the other columns for the row with max Column.

Defaults and Limitations

summarize does not guarantee order. Use | sort by after summarize if needed.

If no 'by' clause is specified, the entire table becomes one group. This is useful for overall counts.

summarize can produce a huge number of groups if the 'by' columns have high cardinality. This can consume memory. Use hint.shufflekey=<key> to distribute the load for high-cardinality keys.

The bin() function is often used with summarize to group time into buckets. Example: summarize count() by bin(TimeGenerated, 1h).

Configuration and Verification

Use | count before and after summarize to see the reduction in rows. Use | render timechart to visualize aggregated time series. Example:

SecurityEvent
| where TimeGenerated > ago(7d)
| summarize EventCount = count() by EventID, bin(TimeGenerated, 1d)
| sort by TimeGenerated asc
| render timechart

How Summarize Interacts with Other Technologies

Summarize is often used after joins to reduce joined results. For example, after joining SecurityEvent with SigninLogs, you might summarize by AccountUpn to get a count of security events per user. Summarize is also used in analytics rules to detect anomalies (e.g., count of failed logins per user per hour exceeding a threshold).

Aggregations in Detail

Aggregations are the functions used within summarize. The SC-200 exam expects you to know which aggregation to use for a given question. Common patterns:

Count of events: count()

Sum of bytes: sum(Size)

Average latency: avg(Latency)

Earliest/latest event: min(TimeGenerated) / max(TimeGenerated)

Distinct IPs: dcount(IPAddress) or make_set(IPAddress)

Top N: Use summarize with top-nest or combine with | top operator.

Advanced: Using summarize with join

A common exam scenario: first join two tables, then summarize. Example:

SecurityEvent
| join kind=leftouter (
    SigninLogs
    | where TimeGenerated > ago(1d)
) on AccountUpn
| summarize FailedLogins = countif(EventID == 4625), SuccessfulLogins = countif(EventID == 4624) by AccountUpn
| where FailedLogins > 10

This query joins security events with sign-in logs, then summarizes per user to find users with many failed logins.

Performance Considerations

Join order: Put the smaller table on the right side for better performance (hash table is built from the right side).

Use `where` before join: Filter both tables as much as possible before joining to reduce rows.

Use `project` to select only needed columns: Reduces memory footprint.

Avoid cross joins: They produce massive results.

Use `hint.strategy=broadcast` when one table is very small (e.g., a lookup table with <1000 rows).

Use `hint.shufflekey=<key>` for summarize on high-cardinality keys to parallelize.

Common Exam Traps

Forgetting that `innerunique` is the default: If you need all matches, you must specify kind=inner.

Using `leftouter` when you mean `leftanti`: leftouter keeps all left rows with nulls for non-matches; leftanti keeps only non-matching left rows.

Not using `bin()` with time in summarize: Without bin(), each unique timestamp becomes its own group, which is usually not intended.

Confusing `dcount()` with `count(distinct ...)`: dcount() is approximate but fast; distinct | count is exact but slower.

Assuming summarize preserves row order: It does not; use | sort by.

Interaction with KQL Functions

Joins and summarize can be combined with let statements to create reusable queries. For example:

let Signins = SigninLogs | where TimeGenerated > ago(1d);
let Alerts = SecurityAlert | where TimeGenerated > ago(1d);
Alerts
| join kind=innerunique Signins on AccountUpn
| summarize AlertCount = count() by AccountUpn

This modular approach is common in hunting queries and analytics rules.

Summary of Exam-Relevant Syntax

Join: Table1 | join kind=<kind> Table2 on <key>

Summarize: Table | summarize <Aggregate> by <GroupColumn>

Bin: bin(TimeGenerated, 1h)

Countif: countif(EventID == 4625)

Make_set: make_set(IPAddress)

Arg_max: arg_max(TimeGenerated, *) (returns the entire row with the latest timestamp)

Master these patterns to answer join and summarize questions quickly on the exam.

Walk-Through

1

Identify Tables and Keys

First, determine which tables contain the data you need. For example, SecurityEvent for Windows events and SigninLogs for Azure AD sign-ins. Identify the common key that links them, such as AccountUpn (user principal name) or IPAddress. Ensure the key exists in both tables and has the same data type. If column names differ, use the $left/$right syntax. This step is critical because a mismatch in key names or types causes an error or empty results. On the exam, you may be given two tables and asked to choose the correct key.

2

Choose Join Kind

Select the appropriate join kind based on the desired output. If you need only rows that exist in both tables, use innerunique (default) or inner. If you need all rows from the left table with nulls for non-matches, use leftouter. To find rows in the left table with no match in the right, use leftanti. The exam often tests your ability to choose between leftouter and leftanti. For example, to find users who signed in but had no security events, you would use leftanti on SecurityEvent with SigninLogs.

3

Filter and Project Before Join

Apply where clauses to both tables to reduce the number of rows. Use project to select only the columns you need (including the join key). This minimizes memory usage and speeds up the join. For example: `SecurityEvent | where TimeGenerated > ago(1d) | project AccountUpn, EventID, TimeGenerated`. On the exam, you might be asked to optimize a slow query; filtering before join is a common answer.

4

Execute the Join

Write the join statement with the correct syntax. Place the smaller table on the right side for better performance. Use hint.strategy=broadcast if one table is very small. Example: `SecurityEvent | join kind=leftouter (SigninLogs | where TimeGenerated > ago(1d)) on AccountUpn`. The output is a new table with columns from both sides. If there are duplicate column names, KQL appends a suffix (e.g., AccountUpn1). Use project-away to remove duplicates.

5

Summarize the Joined Data

After joining, use summarize to aggregate results. Group by a relevant column (e.g., AccountUpn) and apply aggregate functions like count(), sum(), or dcount(). Use bin() to group time into buckets. Example: `summarize FailedLogins = countif(EventID == 4625) by AccountUpn, bin(TimeGenerated, 1h)`. This step reduces many rows into summary rows. The exam may ask you to calculate a specific metric, such as number of failed logins per user per hour.

6

Sort and Filter Results

Finally, sort the results using `| sort by` or `| order by` to make them readable. Apply additional filters using `| where` to focus on anomalies, such as users with more than 10 failed logins. Example: `| where FailedLogins > 10 | sort by FailedLogins desc`. This step is often required to answer the exam question (e.g., 'Which user had the most failed logins?'). Remember that summarize does not guarantee order, so sorting is essential.

What This Looks Like on the Job

Scenario 1: Correlating Azure AD Sign-ins with Security Events

A SOC analyst needs to investigate a potential brute-force attack. They have two data sources: Azure AD SigninLogs (contains successful and failed sign-ins) and Windows Security Events (Event ID 4625 for failed logons, 4624 for successful). The goal is to find users who had many failed sign-ins from different IPs but no successful sign-in. The analyst writes a query that joins SigninLogs with SecurityEvent on AccountUpn, then summarizes count of failed logins per user and uses leftanti to find users with no successful sign-in. In production, these tables can have millions of rows per day, so filtering to the last hour and using project to select only needed columns is crucial. A common misconfiguration is using innerunique instead of inner, which may miss multiple failed sign-ins from the same user if there are multiple matches. The correct approach uses kind=inner to capture all failed events.

Scenario 2: Detecting Lateral Movement Using Network Logs

A security engineer monitors network traffic logs (CommonSecurityLog) and Windows Event Logs (SecurityEvent) to detect lateral movement. They join CommonSecurityLog (source IP, destination IP) with SecurityEvent (AccountUpn, IPAddress) to find which user logged into a machine after network traffic from that machine to another. The join key is IPAddress (or DestinationIP). They use a time window of 5 minutes to correlate events. The summarize step groups by user and counts distinct destination IPs. A common issue is that the join key may have different formats (e.g., IPv4 vs IPv6, or with/without port). The engineer must normalize keys using extend with parse_ipv4 or substring. In a large enterprise, this query can run for hours if not optimized; using hint.shufflekey on the IP column improves performance.

Scenario 3: Aggregating Alert Trends for Reporting

A manager wants a weekly report of security alerts by severity. The analyst uses SecurityAlert table, summarizes count by bin(TimeGenerated, 1d) and AlertSeverity, then renders a timechart. They use make_set(AlertName) to list alert names per day. A pitfall is that summarize without bin on time creates a group per unique timestamp, which is every millisecond, resulting in millions of groups. Always use bin() with a meaningful interval. Also, dcount(AlertName) is used for distinct alert types, but the exam may test that dcount is approximate; for exact counts, use distinct | count. The report is scheduled using a Kusto function and exported to a Power BI dashboard.

How SC-200 Actually Tests This

What SC-200 Tests

The SC-200 exam covers joins, summarize, and aggregations under objective 2.2 (Create and manage analytics rules) and objective 4.1 (Perform threat hunting). You are expected to:

Choose the correct join kind for a given scenario.

Write summarize queries with appropriate aggregate functions.

Use bin() for time-based grouping.

Optimize queries using filters and projections.

Interpret results from joined/summarized data.

Common Wrong Answers and Why Candidates Choose Them

1.

Using innerunique when inner is needed: Candidates see that innerunique is the default and assume it works for all cases. However, innerunique only keeps the first match from the right side. If a user has multiple failed logins, innerunique will count only one. The exam tests this by asking for the number of failed logins per user; many choose innerunique and get a lower count. The correct answer is inner.

2.

Using leftouter instead of leftanti to find non-matching rows: Candidates think leftouter returns all left rows, but they forget that leftanti explicitly returns only non-matching rows. If the question asks 'which users have no sign-ins?', leftouter would include users with sign-ins (with nulls). The exam expects leftanti.

3.

Forgetting bin() in summarize: When grouping by time, candidates write summarize count() by Timestamp. This creates one group per unique timestamp (milliseconds), not per hour. The exam includes a trap where the result has too many rows; the correct fix is to add bin(Timestamp, 1h).

4.

Confusing dcount() with count(distinct ...): dcount() is approximate and fast. If the question asks for an exact count of distinct IPs, dcount() may be wrong. The exam may specify 'exact' or 'approximate' in the stem.

Specific Numbers and Terms That Appear on the Exam

Default join kind: innerunique.

Time bin intervals: 1h, 1d, 5m are common.

Aggregate functions: count, sum, avg, min, max, dcount, make_list, make_set, arg_min, arg_max, percentiles.

Common Event IDs: 4624 (successful logon), 4625 (failed logon), 4688 (process creation).

Hints: hint.strategy=broadcast, hint.shufflekey.

Edge Cases and Exceptions

Null keys: If the join key contains nulls, they do not match other nulls. Use isnotempty() to filter nulls.

Multiple keys: Use on Key1, Key2 for composite keys. The order matters for performance.

Time window in join: Use on $left.Time between($right.Time-1h .. $right.Time+1h). This is often tested for correlation queries.

Summarize with no by clause: Returns a single row with aggregates for the entire table. Useful for overall counts.

How to Eliminate Wrong Answers

Read the question carefully: Look for keywords like 'all matching', 'only non-matching', 'exact count', 'approximate', 'per hour'.

Check join kind: If the question says 'include all rows from the left table even if no match', it's leftouter. If it says 'only rows that have a match in both tables', it's inner.

Check for bin(): If grouping by time, bin() must be present.

Check aggregate: If counting distinct values, is exactness required? If yes, use distinct | count; if approximate is fine, use dcount().

Performance: If the question mentions slow query, look for missing filters before join or missing project.

Key Takeaways

The default join kind in KQL is innerunique, not inner. Use kind=inner to include all matching rows.

Use leftanti to find rows in the left table that have no match in the right table.

Always use bin() with summarize when grouping by time to avoid one group per millisecond.

dcount() provides an approximate distinct count; use distinct | count for exact counts.

Filter and project tables before joining to improve performance.

Use hint.strategy=broadcast when one table is much smaller than the other.

summarize without a by clause aggregates the entire table into a single row.

Multiple aggregate functions can be used in a single summarize, separated by commas.

Time windows in joins are specified using the syntax: on $left.Time between($right.Time-1h .. $right.Time+1h).

Null values in join keys do not match each other; filter nulls using isnotempty().

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

innerunique join

Default join kind in KQL

Only one match from the right table is kept per left row

Faster because it stops after first match

Used when you only need to know if a match exists, not all matches

Can lose data if multiple matches are expected

inner join

Must specify kind=inner explicitly

All matching rows from both tables are included

Slower because it finds all matches

Used when you need all combinations, e.g., multiple failed logins per user

Produces more rows; may need deduplication

dcount()

Approximate count of distinct values

Fast and memory-efficient

Uses HyperLogLog algorithm

Error margin ~1%

Best for high-cardinality columns with large data

distinct | count

Exact count of distinct values

Slower and more memory-intensive

First retrieves all distinct values, then counts

No approximation error

Best when exactness is required and data is small

Watch Out for These

Mistake

The default join kind is inner, so it includes all matching rows from both tables.

Correct

The default join kind is innerunique, which only includes the first match from the right side for each left row. To include all matches, you must specify kind=inner.

Mistake

leftouter and leftanti are the same because both return rows from the left table.

Correct

leftouter returns all left rows, with nulls for non-matches. leftanti returns only left rows that have no match in the right table. They are opposite in behavior.

Mistake

summarize without a by clause is invalid.

Correct

summarize without a by clause is valid and aggregates the entire table into a single row. For example, `Table | summarize count()` returns the total row count.

Mistake

dcount() returns an exact count of distinct values.

Correct

dcount() uses HyperLogLog algorithm and returns an approximate count. The error margin is typically around 1%. For exact counts, use `| distinct | count`.

Mistake

You cannot use multiple aggregate functions in a single summarize.

Correct

You can use multiple aggregate functions separated by commas. Example: `summarize Total = count(), DistinctIPs = dcount(IP) by User`.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between innerunique and inner join in KQL?

innerunique is the default join kind in KQL. It matches each row from the left table to only the first matching row from the right table. If there are multiple matches, only the first is used. inner join, on the other hand, matches all rows: if a left row matches multiple right rows, all combinations appear in the result. Use inner when you need all matches (e.g., multiple failed logins per user). Use innerunique when you just need to know if a match exists (e.g., did the user have any sign-in?). On the exam, if the question asks for 'all failed logins', choose inner.

How do I find rows in one table that have no match in another table?

Use a leftanti join. For example, to find users in SecurityEvent who have no corresponding record in SigninLogs: `SecurityEvent | join kind=leftanti SigninLogs on AccountUpn`. This returns only rows from the left table (SecurityEvent) that have no match in the right table. Alternatively, you could use a leftouter join and then filter where right-side columns are null, but leftanti is more efficient and clearer.

Why does my summarize query return millions of rows when I group by time?

You likely forgot to use the bin() function on the time column. Without bin(), summarize creates a group for each unique timestamp value, which can be every millisecond. For example, `summarize count() by Timestamp` will produce a row per millisecond. Instead, use `summarize count() by bin(Timestamp, 1h)` to group by hour. Common intervals are 1h, 1d, and 5m. Always bin time columns in summarize.

What is the difference between dcount() and count(distinct ...) in KQL?

dcount() uses the HyperLogLog algorithm to estimate the number of distinct values. It is fast and memory-efficient but approximate (error ~1%). `Table | distinct Column | count` gives an exact count but is slower and uses more memory. Use dcount() for large datasets where approximation is acceptable. Use distinct | count when you need exact numbers, such as for compliance reporting. On the exam, look for keywords like 'approximate' or 'exact' in the question.

How can I improve the performance of a join query?

First, filter both tables using where clauses to reduce rows before the join. Second, use project to select only the columns you need (including the join key). Third, place the smaller table on the right side of the join (KQL builds a hash table from the right side). Fourth, consider using hint.strategy=broadcast if the right table is very small (e.g., a lookup table). Finally, use hint.shufflekey if the join key has high cardinality to distribute the load. On the exam, performance optimization questions often focus on filtering before join.

Can I use multiple aggregate functions in a single summarize?

Yes, you can specify multiple aggregate functions separated by commas. For example: `summarize TotalEvents = count(), DistinctIPs = dcount(IPAddress), FirstEvent = min(TimeGenerated) by AccountUpn`. This returns one row per AccountUpn with three aggregated values. This is a common pattern on the exam to compute multiple metrics in one query.

What is the purpose of the bin() function in summarize?

bin() rounds a value down to the nearest multiple of a specified bin size. It is used to group continuous values (like timestamps) into discrete buckets. For example, bin(TimeGenerated, 1h) rounds each timestamp to the start of its hour. This ensures that all events within the same hour are grouped together. Without bin(), each unique timestamp becomes its own group, which is usually not intended. The exam often tests that bin() is required for time-based summarize.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Advanced KQL: Joins, Summarize, and Aggregations — now see how well it sticks with free SC-200 practice questions. Full explanations included, no account needed.

Done with this chapter?