CCNA Knowledge Mining Doc Intelligence Questions

MCQeasy

A company uses this skillset in an Azure AI Search enrichment pipeline. They notice that the enrichment pipeline fails when processing a document larger than 5000 characters. What is the most likely cause?

A.The maximum page length is too small

B.The default language code is not supported

C.The text split mode should be 'sentences'

D.The output field mapping is missing or incorrect

AnswerD

The output 'pages' must be mapped to a collection field in the index.

Why this answer

The enrichment pipeline fails because the output field mapping is missing or incorrect. When a skillset processes documents, the output of each skill must be explicitly mapped to an index field; if this mapping is absent or misconfigured, the pipeline cannot store the enriched data and fails, especially for larger documents that produce more output data.

Exam trap

The trap here is that candidates often attribute pipeline failures to text splitting or language settings, but the real issue is the missing output field mapping, which is a common misconfiguration in skillset definitions.

How to eliminate wrong answers

Option A is wrong because the maximum page length setting in the text split skill controls chunk size, not the overall document size limit; a 5000-character document is well within typical limits. Option B is wrong because an unsupported language code would cause a language detection or translation skill error, not a generic pipeline failure tied to document size. Option C is wrong because the text split mode (e.g., 'pages' vs 'sentences') affects how text is chunked, but does not cause a pipeline failure solely due to document size; the failure is related to output mapping, not splitting logic.

Practice this question →

Drag & Dropmedium

Drag and drop the steps to set up Azure AI Content Safety for content moderation into the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

First create the resource, get credentials, call the API, analyze, then act.

Practice this question →

MCQmedium

A law firm uses Azure Document Intelligence to extract clauses from legal contracts. They have a custom model trained on 15 labeled contracts. The model extracts clauses with high confidence on similar documents but fails to extract correct clauses from a new batch of contracts that have a different font and layout. The firm needs to improve extraction accuracy without retraining the model from scratch. The solution must minimize manual effort and cost. What should they do?

A.Use the prebuilt-layout model to extract clauses instead

B.Increase the OCR confidence threshold in the analysis request

C.Label 15 more contracts with the original layout and retrain the model

D.Create a composed model that includes the existing model and a new model trained on 5 contracts with the new layout

AnswerD

A composed model can handle multiple layouts by combining models.

Why this answer

Option D is correct because creating a composed model in Azure Document Intelligence allows you to combine the existing model (trained on the original layout) with a new model trained on just 5 labeled contracts from the new layout. This approach improves accuracy on the new layout without retraining from scratch, minimizing manual effort and cost by leveraging the composed model's ability to route documents to the appropriate sub-model based on layout similarity.

Exam trap

The trap here is that candidates often assume retraining with more data (Option C) is always the best solution, but they overlook the composed model feature which is specifically designed to handle layout variations with minimal additional labeling and cost.

How to eliminate wrong answers

Option A is wrong because the prebuilt-layout model is designed for extracting text and structure (like tables and selection marks), not for custom clause extraction from legal contracts, and it would not leverage the firm's existing labeled data. Option B is wrong because increasing the OCR confidence threshold only filters out low-confidence text recognition results; it does not improve the model's ability to correctly classify or extract clauses from a different font and layout. Option C is wrong because labeling 15 more contracts with the original layout and retraining the model would not address the new layout variation; it would only reinforce the existing model's performance on the original layout, wasting effort and cost.

Practice this question →

MCQmedium

A company is building a knowledge mining solution using Azure AI Search. They need to extract key phrases from a large set of documents in multiple languages. Which skill should they add to the skillset?

A.Key Phrase Extraction skill

B.Sentiment Analysis skill

C.Language Detection skill

D.Entity Recognition skill

AnswerA

Key Phrase Extraction is designed to extract key phrases from text.

Why this answer

The Key Phrase Extraction skill is the correct choice because it is specifically designed to identify and extract the most important phrases from text, which directly supports the requirement to extract key phrases from documents. Azure AI Search's built-in Key Phrase Extraction skill leverages natural language processing to analyze text and return a list of key phrases, making it the appropriate skill for this knowledge mining solution.

Exam trap

The trap here is that candidates may confuse Entity Recognition (which extracts single-word entities like 'Microsoft') with Key Phrase Extraction (which extracts multi-word phrases like 'Azure AI Search'), leading them to choose Option D instead of A.

How to eliminate wrong answers

Option B (Sentiment Analysis skill) is wrong because it evaluates the emotional tone or sentiment (positive, negative, neutral) of text, not the extraction of key phrases. Option C (Language Detection skill) is wrong because it identifies the language of the text but does not extract key phrases from the content. Option D (Entity Recognition skill) is wrong because it identifies and categorizes named entities (e.g., people, organizations, locations) rather than extracting multi-word key phrases that summarize the document's main topics.

Practice this question →

Matchingmedium

Match each Azure AI tool to its purpose.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Drag-and-drop ML model building

Interactive code development

Command-line management of Azure resources

Programmatic access to Azure services

Run AI services on-premises

Why these pairings

These are tools used in Azure AI development.

Practice this question →

Multi-Selecthard

A company uses Azure Document Intelligence to extract data from tax forms. They need to improve accuracy for a specific field. Which TWO actions should they take?

Select 2 answers

A.Label more examples of the specific field in the training set

B.Increase the batch size in the analysis request

C.Reduce the image resolution to 200 DPI

D.Use the prebuilt-tax.us model

E.Train a custom model using 10 similar forms

AnswersA, E

More labeled examples improve model accuracy for that field.

Why this answer

Option A is correct because labeling more examples of the specific field in the training set directly provides the custom model with additional ground-truth annotations for that field. This increases the model's ability to learn the variations in handwriting, formatting, and layout for that field, which is the most effective way to improve extraction accuracy for a targeted field in Azure Document Intelligence custom models.

Exam trap

The trap here is that candidates often confuse prebuilt models with custom models, assuming that prebuilt models can be retrained or fine-tuned, when in fact they are static and cannot be customized for specific field accuracy improvements.

Practice this question →

Multi-Selecthard

A company uses Azure Cognitive Search to index customer support emails. They need to implement a custom skill that extracts the sentiment of the email body and also identifies the primary product mentioned. The custom skill is a Python function deployed as an Azure Function. They want to ensure the skill can process multiple documents concurrently and handle errors gracefully. Which THREE configurations should they apply?

Select 3 answers

A.Set continueOnError to false to ensure that if the skill fails for one document, the entire indexer run stops.

B.Set the inputs of the skill to include the 'text' field from the enriched document and the 'product' field from a previous skill.

C.Set the uri of the skill to the HTTP endpoint of the Azure Function, including the function key.

D.Set the context property to '/document/pages/*' to process each page of the email individually.

E.Set the batchSize of the custom skill definition to 10 to allow parallel processing of multiple documents.

AnswersB, C, E

Inputs define the data passed to the skill; they should reference fields from the enrichment pipeline.

Why this answer

Option B is correct because the custom skill must receive the email body text for sentiment analysis and the product field from a prior skill (e.g., a key phrase extraction or entity recognition skill) to identify the primary product. This ensures the Azure Function has all required inputs to produce the desired outputs (sentiment score and product name).

Exam trap

The trap here is that candidates often confuse the 'context' property with the 'inputs' property, incorrectly assuming that setting context to '/document/pages/*' is necessary for page-level processing, when in fact the context should match the granularity at which the skill should operate—here, the entire document for overall sentiment.

Practice this question →

MCQmedium

You are a developer for a legal firm. The firm uses Azure Cognitive Search to index legal documents. They have a custom skill that performs OCR on scanned PDFs using Azure Form Recognizer. The skill is implemented as an Azure Function. Recently, the indexer has been failing with the error: "The request was canceled due to the configured HttpClient.Timeout of 100 seconds elapsing." The documents are large (up to 200 pages each). The skill calls the Form Recognizer API asynchronously. You need to resolve the timeout issue without losing the ability to process large documents. Current configuration: batchSize = 1, maxPageSize = 100, timeout = 100 seconds. You cannot change the execution time of the Form Recognizer API. What should you do?

A.Increase the batchSize to 10 to process more documents per invocation, reducing the number of total invocations.

B.Increase the maxPageSize to 500 to reduce the number of API calls to Form Recognizer.

C.Decrease the batchSize to 1 and increase the timeout property in the skillset skill definition to PT300S.

D.Remove the custom skill and use the built-in OCR skill in the skillset.

AnswerC

Increasing the skill timeout to 300 seconds allows the function more time to process the document. The function's timeout must also be increased in host.json.

Why this answer

Option C is correct because the error indicates the HttpClient timeout (100 seconds) is too short for large documents processed by the custom skill. Increasing the timeout to PT300S (300 seconds) in the skill definition allows the Azure Function to wait longer for the asynchronous Form Recognizer API to complete, while keeping batchSize=1 ensures each invocation handles one document at a time, preventing overload and maintaining reliability for large PDFs.

Exam trap

The trap here is that candidates may think increasing batchSize or maxPageSize will reduce the number of invocations and thus fix the timeout, but these parameters do not affect the per-invocation timeout; the correct approach is to increase the timeout duration while keeping batchSize small to avoid overloading the skill.

How to eliminate wrong answers

Option A is wrong because increasing batchSize to 10 would cause the custom skill to process multiple large documents per invocation, increasing the total processing time per call and likely exacerbating the timeout issue, not resolving it. Option B is wrong because maxPageSize controls how many documents are returned per page in indexer results, not the size of data sent to the custom skill; it does not affect the timeout or the Form Recognizer API call duration. Option D is wrong because the built-in OCR skill in Azure Cognitive Search cannot handle scanned PDFs with the same accuracy or layout analysis as Form Recognizer, and removing the custom skill would lose the specialized OCR capability required for legal documents.

Practice this question →

MCQeasy

A healthcare organization uses Azure Document Intelligence to process patient intake forms. They notice that the confidence scores for field extraction are low. What is the most likely cause?

A.The document resolution is too low

B.The document layout is not analyzed

C.The custom model was trained with only 10 labeled forms

D.The batch processing size is too large

AnswerC

Custom models require at least 5 labeled forms; more samples improve confidence.

Why this answer

Custom models in Azure Document Intelligence require a minimum of five labeled forms for training, but low confidence scores typically indicate insufficient training data. With only 10 labeled forms, the model lacks enough examples to generalize well across variations in handwriting, formatting, and field values, leading to poor extraction confidence.

Exam trap

The trap here is that candidates often confuse low confidence with OCR or resolution issues, but the exam tests the specific requirement for sufficient labeled training data in custom models, not generic document quality problems.

How to eliminate wrong answers

Option A is wrong because low resolution can reduce OCR accuracy, but Azure Document Intelligence handles a wide range of resolutions and the question specifically points to field extraction confidence, not OCR failure. Option B is wrong because layout analysis is automatically performed by the prebuilt layout model and is not a prerequisite for custom extraction models; the issue is with training data quantity, not layout processing. Option D is wrong because batch processing size affects throughput and latency, not the confidence scores of individual field extractions; confidence is determined by the model's training and the input document quality, not batch size.

Practice this question →

MCQmedium

A company builds a knowledge mining solution using Azure AI Search with a custom skillset that includes an OCR skill. They want to ensure that images embedded in PDFs are processed. What should they configure?

A.Set the 'defaultLanguageCode' to 'en'

B.Set the 'textExtractionAlgorithm' to 'printed'

C.Set the 'imageAction' parameter to 'generateNormalizedImages'

D.Set the 'lineEnding' parameter to 'space'

AnswerC

This parameter enables extraction of images from documents.

Why this answer

Option C is correct because the 'imageAction' parameter in Azure AI Search's OCR skill controls whether images embedded in documents (including PDFs) are extracted and processed. Setting it to 'generateNormalizedImages' ensures that images within PDFs are normalized and passed to the OCR skill for text extraction, which is essential for processing embedded images.

Exam trap

The trap here is that candidates may confuse parameters that affect OCR output formatting (like 'lineEnding' or 'defaultLanguageCode') with the parameter that actually enables image extraction from PDFs, leading them to overlook the 'imageAction' setting.

How to eliminate wrong answers

Option A is wrong because 'defaultLanguageCode' specifies the language for text recognition, not whether images are extracted from PDFs; it does not enable image processing. Option B is wrong because 'textExtractionAlgorithm' determines the OCR algorithm (e.g., 'printed' or 'handwritten') but does not control the extraction of images from PDFs; it only affects how text is recognized once images are available. Option D is wrong because 'lineEnding' parameter controls the line break character in OCR output (e.g., 'space', 'carriageReturn'), which is irrelevant to enabling image extraction from PDFs.

Practice this question →

MCQeasy

Refer to the exhibit. You have a skillset with two skills. You run the indexer and find that the output field 'organizations' is empty for documents that clearly contain organization names. The 'keyPhrases' output is populated correctly. What is the most likely cause of the issue?

A.The skill's 'name' property is set to '#1', which is invalid.

B.The skill is not configured with a 'defaultLanguageCode' and the documents are not in English.

C.The 'categories' property is misspelled; it should be 'entityCategories'.

D.The input source '/document/content' is incorrect; it should be '/document/text'.

AnswerB

EntityRecognitionSkill needs a language hint to perform correctly for non-English languages.

Why this answer

The 'keyPhrases' output is populated correctly, indicating the text extraction and overall pipeline are functional. The Entity Recognition skill requires a 'defaultLanguageCode' to correctly identify entities; if it is not set and the documents are not in English, the skill may fail to extract organizations, resulting in an empty 'organizations' field. This is a known behavior where the skill defaults to English and cannot process other languages without explicit configuration.

Exam trap

The trap here is that candidates assume the 'keyPhrases' skill working correctly implies all skills are fine, overlooking that Entity Recognition is language-dependent and requires explicit 'defaultLanguageCode' configuration, while 'keyPhrases' is more robust across languages.

How to eliminate wrong answers

Option A is wrong because the skill's 'name' property being set to '#1' is not invalid; skill names can include special characters and are only used for identification within the skillset, not for functionality. Option C is wrong because the 'categories' property is correctly spelled for the Entity Recognition skill (it uses 'categories' to specify which entity types to extract, such as 'organization'); there is no property named 'entityCategories'. Option D is wrong because '/document/content' is the correct default input path for content extracted from most data sources (e.g., Azure Blob Storage), while '/document/text' is not a standard field in the enriched document tree unless explicitly mapped.

Practice this question →

MCQhard

A company uses Azure Document Intelligence to extract data from invoices. They deploy the model to a container for on-premises processing. After deployment, they notice that the container consumes more memory than expected. What should they do to optimize memory usage?

A.Set the 'Memory' environment variable to a lower value in the container configuration

B.Use the 'Read' model instead of the 'Layout' model

C.Use the cloud API instead of the container

D.Reduce the batch size in the client application

AnswerA

The container's memory usage can be controlled via the 'Memory' setting.

Why this answer

Option A is correct because Azure Document Intelligence containers expose a 'Memory' environment variable that allows you to limit the container's memory allocation. By setting this variable to a lower value, you constrain the container's memory usage, which directly addresses the issue of higher-than-expected consumption. This is the recommended approach for optimizing memory in containerized deployments.

Exam trap

The trap here is that candidates may think memory optimization is achieved by changing the model or client-side settings, when in fact container memory is controlled by environment variables or runtime resource limits specific to the container orchestrator.

How to eliminate wrong answers

Option B is wrong because switching from the 'Layout' model to the 'Read' model changes the extraction capabilities (e.g., losing table and structure extraction) but does not directly control or reduce the container's memory allocation; memory usage is governed by container runtime settings, not model selection. Option C is wrong because moving to the cloud API would change the deployment model entirely, not optimize memory usage of the existing container; the question specifically asks about optimizing the container deployment. Option D is wrong because reducing the batch size in the client application affects throughput and request size, not the memory footprint of the container itself; the container's memory consumption is primarily determined by its internal processes and allocated resources, not client-side batching.

Practice this question →