CCNA Data For Ai Questions

75 of 163 questions · Page 1/3 · Data For Ai topic · Answers revealed

1
Multi-Selecthard

Which THREE are key considerations for data privacy when using AI models that process customer data? (Choose three.)

Select 3 answers
A.Store data indefinitely for future use
B.Limit data access to authorized personnel only
C.Obtain user consent for data usage
D.Encrypt data in transit and at rest
E.Anonymize personally identifiable information (PII)
AnswersB, C, E

Access controls reduce privacy risks.

Why this answer

Options A, B, and D are correct. Anonymizing PII protects individual identity, obtaining consent ensures legal compliance, and limiting data access reduces exposure risk. Option C is wrong because storing data indefinitely violates privacy principles.

Option E, while good practice, is more about security than privacy specifically, and not always mandatory.

2
Multi-Selecthard

Which three practices help maintain data quality for AI models in Salesforce? (Choose three.)

Select 3 answers
A.Monitor data freshness with Data Check
B.Disable duplicate matching rules for faster load
C.Use Excel for manual data updates
D.Schedule regular data audits
E.Implement validation rules on critical fields
AnswersA, D, E

Data Check alerts on stale or outdated data that could affect model accuracy.

Why this answer

Option A is correct because Data Check in Salesforce monitors data freshness by tracking when records were last updated, ensuring that AI models use current data. Stale data can degrade model accuracy, so this practice directly supports data quality for AI.

Exam trap

The trap here is that candidates may think disabling duplicate rules speeds up data loading, but they overlook that duplicate records severely degrade AI model performance by introducing bias and noise.

3
MCQmedium

A company uses Einstein Prediction Builder to predict customer churn. The data includes account creation date, number of support cases, and average payment delay. After training, the model shows low confidence scores. What is the most likely cause?

A.The training dataset includes fewer than 500 records.
B.The data contains many missing values or outliers for the selected fields.
C.The prediction field is set to a numeric type instead of a picklist.
D.The model was trained on data refreshed daily instead of weekly.
AnswerB

Missing values and outliers degrade model performance, leading to low confidence scores.

Why this answer

Option B is correct because low confidence scores in Einstein Prediction Builder often stem from data quality issues such as missing values or outliers. These anomalies distort the model's ability to learn meaningful patterns, leading to uncertain predictions. Clean, complete data is essential for the model to produce high-confidence scores.

Exam trap

Salesforce often tests the misconception that low confidence is caused by dataset size or refresh frequency, when in reality data quality issues like missing values or outliers are the primary culprit in Einstein Prediction Builder.

How to eliminate wrong answers

Option A is wrong because Einstein Prediction Builder does not require a minimum of 500 records; it can work with smaller datasets, though more data generally improves accuracy. Option C is wrong because the prediction field type (numeric vs. picklist) affects the type of prediction (regression vs. classification), not the confidence score directly. Option D is wrong because the refresh frequency (daily vs. weekly) impacts timeliness, not the inherent confidence of the trained model.

4
MCQmedium

A company is deploying an AI model that recommends next best actions for sales reps. They notice that the model's recommendations are biased towards high-revenue opportunities. Which data-related action can help reduce this bias?

A.Use a larger neural network model
B.Encrypt the data before training
C.Oversample the underrepresented segments in the training data
D.Remove all low-revenue opportunities from the training data
AnswerC

Oversampling helps balance the representation.

Why this answer

Oversampling underrepresented segments in the training data directly addresses the class imbalance that causes the model to favor high-revenue opportunities. By increasing the frequency of low-revenue examples, the model learns to treat all segments more equally, reducing bias in its recommendations. This is a standard data-level technique for mitigating bias in AI models.

Exam trap

Salesforce often tests the misconception that model architecture changes (like larger networks) can fix data bias, when in fact the root cause is often data imbalance that must be addressed at the data level.

How to eliminate wrong answers

Option A is wrong because using a larger neural network model does not fix data imbalance; it may even amplify bias if the majority class dominates training. Option B is wrong because encrypting data protects privacy but has no effect on model bias or data distribution. Option D is wrong because removing all low-revenue opportunities would worsen the imbalance, making the model even more biased toward high-revenue opportunities.

5
MCQhard

A company uses Salesforce Data Cloud to unify customer data from multiple sources. After connecting a data stream, they notice that records are missing from the unified profile. What is the most likely cause?

A.The data stream object is not a standard Salesforce object.
B.The data stream is not activated for identity resolution.
C.The data source is not from Salesforce, so it cannot be unified.
D.The reconciliation rule is not configured for the data source.
AnswerD

Reconciliation rules are needed to match records across sources.

Why this answer

Option D is correct because reconciliation rules in Salesforce Data Cloud define how records from different data sources are matched and merged into a unified profile. If a reconciliation rule is not configured for a data source, records from that source may not be properly linked to existing profiles, leading to missing records in the unified view. This is a common configuration step that must be completed after connecting a data stream.

Exam trap

The trap here is that candidates may confuse identity resolution (matching) with reconciliation (merging), assuming that activating identity resolution alone is sufficient to unify profiles, when in fact a reconciliation rule is required to complete the merge process.

How to eliminate wrong answers

Option A is wrong because Data Cloud supports both standard and custom objects as data stream objects; the object type does not inherently cause records to be missing from unified profiles. Option B is wrong because identity resolution activation is required for matching records across sources, but missing records are more directly caused by the lack of a reconciliation rule that defines how to merge matched records. Option C is wrong because Data Cloud is designed to unify data from any source, including non-Salesforce sources, as long as the data stream is properly configured.

6
MCQeasy

Refer to the exhibit. A dataflow is set up to prepare data for a prediction model. The model is expected to predict close probability for all open opportunities. What is wrong with this dataflow?

A.The output target should be a dataset, not a model.
B.The filter on StageName is too restrictive; it excludes non-won opportunities needed for training.
C.The source should be Lead, not Opportunity.
D.The dataflow is missing a transform node to remove null values.
AnswerB

To predict close probability, the model needs examples of both won and lost deals.

Why this answer

The filter excludes all opportunities that are not 'Closed Won'. The model should be trained on both won and lost opportunities to predict close probability. The filter should be removed or include all stages.

7
MCQeasy

A Salesforce admin is training an Einstein Bot to answer customer questions. Which data source should the bot use to provide accurate responses?

A.Chatter posts from the product team.
B.Knowledge articles with a published status.
C.Case records from the last 30 days.
D.Lead and contact reports.
AnswerB

Knowledge articles are designed for self-service.

Why this answer

Knowledge articles with a published status are the correct data source because they contain curated, approved, and structured information that Einstein Bot can reliably use to generate accurate responses. The bot leverages natural language processing to match customer questions against these articles, ensuring answers are based on verified content rather than unstructured or transient data.

Exam trap

Salesforce often tests the distinction between structured, authoritative data sources (like Knowledge articles) and unstructured or operational data (like Chatter or Cases), trapping candidates who assume any Salesforce data can be used for AI responses.

How to eliminate wrong answers

Option A is wrong because Chatter posts are informal, unstructured conversations that lack governance and may contain outdated or incorrect information, making them unsuitable for providing accurate, consistent responses. Option C is wrong because Case records from the last 30 days are transactional, often incomplete, and may include unresolved or duplicate issues, which would lead to unreliable answers. Option D is wrong because Lead and contact reports are designed for sales analytics and customer segmentation, not for answering product or service questions, and they lack the detailed, factual content needed for a knowledge base.

8
MCQeasy

Refer to the exhibit. What is the most likely cause of the pipeline failure?

A.Data type mismatch between source and target
B.Insufficient permissions to access the field
C.Connection timeout during data transfer
D.Picklist value does not exist in the target picklist field
AnswerD

The error clearly states the value is not found in the picklist values.

Why this answer

Option C is correct because the error explicitly states "value not found in picklist values" for CustomLeadField__c. This indicates a picklist value mismatch. Insufficient permissions (A) would generate a different error; data type mismatch (B) would show conversion error; connection timeout (D) would show timeout.

9
Multi-Selectmedium

A company is training a customer service chatbot using historical conversation logs. Which TWO data preparation practices should be followed to ensure data quality?

Select 2 answers
A.Exclude all user identifiers to protect privacy
B.Include answers with varied phrasing to enhance language variety
C.Include only successful interactions that were resolved
D.Filter only English conversations for consistency
E.Use conversation logs with complete transcripts
AnswersB, E

Varied phrasing improves model generalization.

Why this answer

Option B is correct because training a chatbot on varied phrasing (e.g., synonyms, different sentence structures) improves its ability to understand and generate natural language responses. This practice enhances the model's robustness and generalization, preventing overfitting to a narrow set of expressions and ensuring it can handle the diverse ways customers phrase their queries.

Exam trap

Salesforce often tests the distinction between data quality practices (e.g., completeness, diversity, accuracy) and data governance practices (e.g., privacy, security), so candidates mistakenly select privacy-related options like Option A when the question explicitly asks about data quality.

10
MCQmedium

Refer to the exhibit. A data file for click-through model training has the above content. Which data quality issue is most critical to address before training?

A.The header row is missing a column name for the last field
B.Missing value in the Conversions column for the third row
C.Inconsistent date formats across rows
D.Clicks column is an integer but may need scaling
AnswerB

Missing target values cannot be used for supervised learning and must be handled.

Why this answer

Option B is correct because missing values in the Conversions column directly impact the supervised learning target variable. If the label (conversion) is missing for a training instance, the model cannot learn the correct mapping from features to outcome, leading to biased or incomplete training. This is a critical data quality issue that must be addressed before training, typically via imputation or row removal.

Exam trap

Salesforce often tests the distinction between data quality issues that prevent training (like missing target values) versus issues that are merely preprocessing concerns (like scaling or date formatting), leading candidates to overthink minor formatting problems.

How to eliminate wrong answers

Option A is wrong because the header row missing a column name for the last field is a metadata issue, not a data quality issue; the model can still parse the data as long as the values are present and correctly ordered. Option C is wrong because inconsistent date formats across rows, while potentially problematic for feature engineering, do not directly prevent model training; date parsing can be handled during preprocessing. Option D is wrong because the Clicks column being an integer does not inherently require scaling; scaling is a preprocessing step applied to features to improve convergence, not a data quality issue that must be addressed before training.

11
MCQhard

A company is using Einstein Discovery to predict customer churn. The model was created six months ago and has been making predictions. Recently, the model's accuracy has dropped significantly. The data scientist confirms that the data schema has not changed. What is the most likely reason for the drop in accuracy?

A.The data source is not being refreshed daily
B.The model's features have become irrelevant
C.The model is suffering from concept drift
D.The model needs to be retrained weekly instead of monthly
AnswerC

Concept drift happens when the statistical properties of the target variable change over time.

Why this answer

Concept drift occurs when the statistical properties of the target variable change over time, causing the model's predictions to become less accurate even though the data schema remains unchanged. In Einstein Discovery, models are trained on historical data, and if the underlying patterns of customer churn evolve (e.g., due to market shifts or new competitor behavior), the model's learned relationships become stale. Since the data schema is confirmed unchanged, concept drift is the most likely cause of the accuracy drop.

Exam trap

Salesforce often tests the distinction between data schema changes (which would affect feature availability) and concept drift (which affects the relationship between features and the target), leading candidates to incorrectly choose options about data freshness or feature relevance when the real issue is a shift in the underlying data distribution.

How to eliminate wrong answers

Option A is wrong because the data source not being refreshed daily would cause predictions to be based on outdated records, but the question states the model's accuracy dropped significantly and the schema hasn't changed—concept drift is a more fundamental issue than refresh frequency. Option B is wrong because features becoming irrelevant is a form of feature drift, but the question specifies the data schema hasn't changed, meaning the same features are still available; concept drift refers to the relationship between features and the target changing, not the features themselves. Option D is wrong because retraining weekly instead of monthly might help with drift, but the core reason for the drop is that the model's learned patterns no longer match current behavior—simply increasing retraining frequency without addressing the drift source is a band-aid, not the root cause.

12
MCQmedium

A financial institution must ensure that customer data used for AI models does not expose personally identifiable information (PII) to unauthorized users. Which Data Cloud feature should be applied to the data model?

A.Delete PII fields from the data model
B.Use Calculated Insights to aggregate sensitive data only
C.Apply data masking and field-level security on sensitive fields
D.Rely on user permissions to restrict access to the entire object
AnswerC

Protects PII while preserving data utility.

Why this answer

Option B is correct because data masking and field-level security can obscure PII. Option A is wrong because deleting fields removes valuable predictors. Option C is wrong because user permissions alone are insufficient for field-level protection.

Option D is wrong because aggregations don't hide underlying raw data.

13
Multi-Selectmedium

A data analyst is evaluating data quality for an Einstein model. Which TWO dimensions are most critical for model accuracy?

Select 2 answers
A.Uniqueness
B.Accuracy
C.Consistency
D.Completeness
E.Timeliness
AnswersB, D

Incorrect values directly degrade model predictions.

Why this answer

Completeness (no missing values) and accuracy (correct values) are fundamental to model performance.

14
Multi-Selectmedium

Which THREE of the following are required when setting up a data stream from Salesforce to Data Cloud?

Select 3 answers
A.Data Stream object definition
B.Data Transform
C.Data Source connection
D.Data Model mapping
E.Data Action
AnswersA, C, D

Defines the stream's schema and source type.

Why this answer

A is correct because a Data Stream object definition is required to specify the schema and fields for the data being ingested from Salesforce into Data Cloud. Without this definition, Data Cloud cannot interpret the structure of the incoming records, making it impossible to map or transform the data.

Exam trap

Salesforce often tests the distinction between mandatory configuration steps and optional enhancements, so the trap here is that candidates mistake Data Transform or Data Action as required because they are commonly used in data pipelines, but they are not prerequisites for establishing the data stream itself.

15
MCQhard

A company uses Einstein Forecasting for revenue prediction. The historical data shows seasonal spikes every quarter. The model consistently underestimates peak periods. What is the best data preparation step to improve accuracy?

A.Increase the forecast horizon to 12 months.
B.Add a 'quarter' index field (1-4) to the dataset.
C.Remove the spike data points as outliers.
D.Use only the last 6 months of data to reduce noise.
AnswerB

Providing explicit seasonality indicators helps the model learn periodic behavior.

Why this answer

Einstein Forecasting can detect seasonality if the data contains enough history and a seasonality marker. Adding a 'quarter' feature explicitly helps the model capture recurring patterns.

16
MCQhard

Refer to the exhibit. A data analyst receives an error when trying to use this model configuration for Einstein AI predictions. Which issue is most likely causing the error?

A.The feature field "Usage__c" does not exist in the data source.
B.The split ratio of 0.8 is not allowed for classification.
C.The prediction window is shorter than the training window.
D.The target field "Churn__c" is a text field instead of an integer.
AnswerA

Referencing a non-existent field causes a configuration error.

Why this answer

Option C is correct because the feature field "Usage__c" does not exist in the data source, causing a configuration error. Option A is incorrect because a prediction window shorter than training window is normal. Option B is incorrect because classification targets can be text labels, though numeric is common; this alone would not cause an error.

Option D is incorrect because a split ratio of 0.8 is standard for classification.

17
MCQhard

A data integration specialist is using Data Pipelines to combine Salesforce data with an external CSV file. The CSV has a header row but some rows have extra commas, causing parsing errors. What should the specialist do?

A.Use a Data Transform recipe to clean the data before ingestion
B.Edit the CSV manually
C.Increase the pipeline timeout
D.Reject the entire file and request a corrected version
AnswerA

Data Transform recipes can standardize rows, handle extra delimiters, and log errors.

Why this answer

Option B is correct because a Data Transform recipe can handle malformed rows by stripping extra commas or parsing with a delimiter that accommodates quoted fields. Manual editing is inefficient; rejecting the whole file loses data; increasing timeout does not fix parsing.

18
MCQeasy

What is being performed in this command?

A.Feature engineering
B.Batch prediction
C.Model training
D.Data validation
AnswerB

The command predicts on new CSV data.

Why this answer

Option A is correct because the command uses the 'predict' argument to generate predictions on new data using an existing model. Option B is wrong because model training would use 'train' instead of 'predict'. Option C is wrong because data validation is not indicated.

Option D is wrong because feature engineering would produce features, not predictions.

19
MCQeasy

Refer to the exhibit. A developer runs a SOQL query. What does the output indicate?

A.The query returned 10 records in total.
B.The query is still processing.
C.The output is incomplete.
D.The query failed.
AnswerA

totalSize shows the number of records returned, and done=true means the query finished.

Why this answer

The SOQL query output shows '10 records returned' with no error or partial result indicator, confirming that the query completed successfully and returned exactly 10 records. In Salesforce SOQL, the query result includes a 'totalSize' field that reflects the total number of records matching the query criteria, and here it matches the number of records displayed, indicating a complete and successful retrieval.

Exam trap

Salesforce often tests the misconception that a small result set might be incomplete or that the query is still running, but the presence of a record count matching the displayed records and no error or pagination indicator confirms a complete and successful query.

How to eliminate wrong answers

Option B is wrong because SOQL queries are synchronous and either complete or fail; there is no 'still processing' state in the output—if processing were ongoing, the query would not return a result set. Option C is wrong because the output explicitly states '10 records returned' and shows all records, with no truncation or 'more records available' indicator; SOQL uses query locators for large result sets, but here the count matches the displayed records, so the output is complete. Option D is wrong because a failed query would return an error message or exception, not a list of records; the presence of a result set with a record count confirms success.

20
Multi-Selecteasy

Which TWO of the following are valid methods to improve data quality in Data Cloud before training an AI model?

Select 2 answers
A.Increase data stream ingestion frequency
B.Use a Data Transform to filter out invalid records
C.Implement data retention policies
D.Use Calculated Insights to detect anomalies
E.Enable data profiling on data streams
AnswersB, E

Directly removes poor-quality data.

Why this answer

Options B and D are correct. Data profiling helps understand data quality issues, and data transforms can filter invalid records. Option A is wrong because calculated insights detect anomalies but don't directly improve quality.

Option C is wrong because retention policies manage data lifecycle, not quality. Option E is wrong because increasing frequency does not fix existing quality issues.

21
MCQmedium

A large enterprise needs to integrate data from Salesforce CRM, an external ERP, and marketing automation to train an AI model for cross-sell recommendations. Which data storage strategy is most aligned with Salesforce's AI capabilities?

A.Use only Salesforce CRM data and ignore external sources
B.Store each source separately in Data Cloud and train models on each
C.Export all data to an external data lake and build a custom model
D.Use Salesforce Data Cloud to unify the datasets
AnswerD

Data Cloud provides harmonization, governance, and native Einstein integration.

Why this answer

Salesforce Data Cloud is designed to unify data from multiple sources into a single platform for AI and analytics. Exporting to a data lake adds complexity, using only Salesforce objects limits data scope, and storing flat files lacks governance.

22
MCQmedium

A dataset contains a 'date' column. Which feature engineering technique would best capture both long-term trends and seasonal patterns?

A.Extract year, month, day as separate features.
B.Use only the day of week.
C.Create cyclic features (sin/cos of month, day).
D.Drop the date column.
AnswerC

Cyclic encoding preserves the periodic nature of time.

Why this answer

Option C is correct because cyclic features using sine and cosine transformations preserve the circular nature of temporal data (e.g., month 12 and month 1 are adjacent, not far apart). This allows a model to learn both long-term trends (via the year component) and seasonal patterns (via the cyclic encoding of month and day) without imposing a false linear ordering. In contrast, simple numeric extraction treats time as linear, which can misrepresent seasonal cycles.

Exam trap

Salesforce often tests whether candidates recognize that simple numeric extraction (e.g., month as 1–12) fails to model cyclical continuity, leading them to mistakenly choose Option A over the correct cyclic encoding.

How to eliminate wrong answers

Option A is wrong because extracting year, month, and day as separate numeric features introduces a linear ordering that fails to capture the cyclical relationship between months (e.g., December and January are treated as far apart). Option B is wrong because using only the day of week ignores long-term trends and seasonal patterns across months or years, capturing only weekly periodicity. Option D is wrong because dropping the date column discards all temporal information, making it impossible for the model to learn any time-based patterns.

23
MCQmedium

A data engineer is troubleshooting a predictive model that stopped updating. The data flow from Data Cloud shows 'Data Transform Failed' with error: 'Field Amount cannot be null'. What is the most likely cause?

A.The data transform includes a filter that removes records with null Amount.
B.The source object has a validation rule.
C.The data flow schedule is incorrect.
D.The target field in the model requires a non-null value but source data has nulls.
AnswerD

This directly matches the error: the transform requires non-null input.

Why this answer

The error 'Field Amount cannot be null' indicates that the target field in the predictive model is configured to require a non-null value. When the data flow attempts to write records with null Amount values into that field, the transform fails. This is a common schema constraint mismatch where the source data contains nulls that violate the target field's nullability requirement.

Exam trap

Salesforce often tests the distinction between source-side constraints (validation rules) and target-side constraints (field nullability in the model schema), leading candidates to incorrectly choose Option B when the error actually originates from the target field requirement.

How to eliminate wrong answers

Option A is wrong because a filter that removes records with null Amount would prevent nulls from reaching the target, not cause a 'cannot be null' error. Option B is wrong because validation rules apply at the source object level during record creation or update, not during a data flow transform that reads data. Option C is wrong because an incorrect schedule would cause the data flow to run at the wrong time or not at all, not produce a specific transform error about a null field.

24
MCQhard

A retail company uses Einstein Next Best Action with customer data from Data Cloud. The recommendations are not personalized. The admin checks the data quality dashboard and finds that the 'Customer_Profile' object has 40% records with missing 'PreferredChannel' field. What is the best course of action?

A.Remove the field from the model.
B.Impute the missing values using the mode of the field.
C.Increase the data refresh frequency.
D.Train the model with only records that have non-null PreferredChannel.
AnswerB

Imputation is a standard data cleaning technique that maintains dataset size and field utility.

Why this answer

Option B is correct because imputing missing values using the mode (most frequent value) of the 'PreferredChannel' field is a standard data preprocessing technique that preserves the dataset size and statistical distribution. In Einstein Next Best Action, missing categorical data can degrade model personalization, and mode imputation is a simple, effective way to handle this without losing records or altering the model structure.

Exam trap

The trap here is that candidates might think removing the field or filtering out incomplete records is simpler, but Salesforce often tests the understanding that imputation is a standard, non-destructive method to handle missing data in AI models, especially when the missing rate is high.

How to eliminate wrong answers

Option A is wrong because removing the field entirely discards potentially valuable signal from the 'PreferredChannel' feature, which could reduce model accuracy and personalization. Option C is wrong because increasing data refresh frequency does not address the root cause of missing data; it only updates the data more often without fixing the quality issue. Option D is wrong because training the model only on records with non-null 'PreferredChannel' reduces the training dataset size by 40%, which can lead to biased or less robust models and loss of valuable customer information.

25
Multi-Selecthard

Which THREE factors should be considered when evaluating the quality of a dataset for an AI model?

Select 3 answers
A.Total number of records available for training.
B.Presence of outliers that may skew the model.
C.Number of distinct labels in the outcome field.
D.Percentage of missing values in key fields.
E.Number of duplicate records in the dataset.
AnswersB, D, E

Outliers can distort the model's understanding.

Why this answer

Option B is correct because outliers can disproportionately influence model training, especially in algorithms like linear regression or k-means clustering, leading to biased predictions. Evaluating the presence and impact of outliers is critical for ensuring the model generalizes well to unseen data.

Exam trap

Salesforce often tests the misconception that dataset size (option A) is a primary quality metric, whereas the exam emphasizes that completeness, consistency, and absence of bias (e.g., missing values, duplicates, outliers) are more critical for model reliability.

26
MCQmedium

An admin is configuring Einstein Vision and wants to train a model to identify product defects from images. The admin has uploaded 500 images of defective products and 500 images of non-defective products. However, the model training fails with an error about data quality. What is the most likely cause?

A.The images are in JPEG format
B.The dataset has only one label per category
C.The images are larger than 10 MB each
D.The dataset does not have enough images
AnswerB

Einstein requires at least 2 unique labels per category to avoid overfitting.

Why this answer

The error about data quality in Einstein Vision typically occurs when the dataset has only one label per category. For binary classification (defective vs. non-defective), each category must contain at least two distinct labels to allow the model to learn meaningful patterns. With only one label per category, the model cannot differentiate between variations within a class, leading to a data quality error.

Exam trap

Salesforce often tests the misconception that more images automatically solve training failures, when the real issue is insufficient label diversity within categories.

How to eliminate wrong answers

Option A is wrong because JPEG is a supported image format in Einstein Vision, and the format itself does not cause data quality errors. Option C is wrong because while image size limits exist, Einstein Vision accepts images up to 10 MB, and the error message specifically mentions data quality, not file size. Option D is wrong because 500 images per category meets the minimum requirement (typically 10-50 images per label), so insufficient quantity is not the issue here.

27
MCQhard

You are a data scientist at a retail company. The company uses Einstein Discovery to analyze customer purchase patterns. The model is built on a dataset of 50,000 transactions. The model's R-squared is 0.85, but the predictions for new customers are consistently off by a large margin. The data includes features like 'Customer Age', 'Income', 'Previous Purchases', and 'Product Category'. The model was trained on data from the past two years. However, six months ago, the company launched a new loyalty program that significantly changed purchasing behavior. You suspect the model is not generalizing to new customers. What should you do to validate your hypothesis?

A.Create a holdout set of transactions from the last six months and compare model performance on it vs. older data
B.Exclude new customers from the dataset entirely
C.Increase the training data size to include older transactions
D.Remove the 'Product Category' feature to simplify the model
AnswerA

If performance is worse on recent data, concept drift is confirmed.

Why this answer

Option A is correct because creating a holdout set of transactions from the last six months directly tests whether the model's performance has degraded due to the loyalty program's impact on purchasing behavior. By comparing the R-squared or other metrics on this recent holdout set versus older data, you can quantify the drop in predictive accuracy and confirm that the model fails to generalize to the new data distribution. This approach is a standard method for detecting concept drift in machine learning models, especially when external changes (like a loyalty program) alter the underlying patterns.

Exam trap

Salesforce often tests the misconception that improving model performance (e.g., by adding more data or simplifying features) is the correct response to poor generalization, rather than first validating the hypothesis of concept drift through a time-based holdout evaluation.

How to eliminate wrong answers

Option B is wrong because excluding new customers entirely would remove the very data needed to detect the generalization failure, and it does not validate the hypothesis about model performance on new customers. Option C is wrong because increasing training data with older transactions would only reinforce the model's bias toward pre-loyalty-program patterns, making it even less adaptable to the new behavior. Option D is wrong because removing the 'Product Category' feature simplifies the model but does not address the root cause of concept drift; it may reduce accuracy further and does not test whether the loyalty program caused the shift.

28
Multi-Selectmedium

A Salesforce admin is reviewing data sources for Einstein Recommendation Builder. Which two data types are required for training? (Choose two.)

Select 2 answers
A.User profile data
B.User-item interactions
C.Sales reports
D.External product prices
E.Item metadata
AnswersB, E

Interaction data (e.g., clicks, purchases) forms the basis of recommendations.

Why this answer

Options B and D are correct. Item metadata (B) and user-item interactions (D) are essential for recommendation models. User profile data (A) is optional; external product prices (C) are not required; sales reports (E) are not a standard input.

29
MCQhard

A company is building a text classification model for customer support tickets. They have a dataset of 10,000 tickets. The team decides to use active learning for labeling. Which approach best aligns with active learning principles?

A.Randomly select 2,000 tickets and label them manually.
B.Train a preliminary model and prioritize labeling tickets with low prediction confidence.
C.Use a pre-trained model to label all tickets automatically.
D.Have subject matter experts label all 10,000 tickets.
AnswerB

Active learning focuses on uncertain samples.

Why this answer

Active learning iteratively selects the most informative unlabeled data points for labeling, typically those with low prediction confidence from a preliminary model. This minimizes labeling effort while maximizing model performance, which is the core principle of active learning.

Exam trap

Salesforce often tests the distinction between active learning and passive learning (random sampling) or semi-supervised learning, and the trap here is assuming that any automated labeling (like using a pre-trained model) qualifies as active learning, when in fact active learning requires iterative human feedback based on model uncertainty.

How to eliminate wrong answers

Option A is wrong because random selection ignores model uncertainty, wasting labeling effort on data that may not improve the model. Option C is wrong because using a pre-trained model to auto-label all tickets bypasses the human-in-the-loop feedback essential for active learning and may propagate errors. Option D is wrong because labeling all 10,000 tickets defeats the purpose of active learning, which is to reduce labeling cost by focusing only on informative samples.

30
MCQeasy

A marketer wants to use Einstein Segment Creation to build a segment for a campaign. Which data source can be used?

A.Standard report snapshots.
B.Data Cloud unified profile data.
C.External web analytics.
D.Einstein Activity Capture data.
AnswerB

Unified profiles contain the data needed for segmentation.

Why this answer

Einstein Segment Creation works with Data Cloud unified profiles.

31
MCQhard

Refer to the exhibit. What is the most likely cause of this error?

A.The source data is missing required fields.
B.The data transform has a recursive formula.
C.The target field expects an integer but source provides null.
D.A division formula in the data transform is dividing by a field that contains zero values.
AnswerD

Directly matches the ArithmeticException: / by zero.

Why this answer

The error shown in the exhibit is a division-by-zero runtime error, which occurs when a formula in a data transform attempts to divide a value by a field that contains zero. In Pega, data transforms execute field-level calculations, and if a divisor field holds a zero, the system throws a 'Divide by zero' exception. Option D correctly identifies this as the most likely cause because the error message explicitly indicates a division operation failed due to a zero divisor.

Exam trap

Salesforce often tests the distinction between null values and zero values in arithmetic operations, so candidates mistakenly choose Option C (null) when the actual error is caused by a zero divisor, not a missing value.

How to eliminate wrong answers

Option A is wrong because missing required fields typically produce validation or commit errors, not a division-by-zero runtime error. Option B is wrong because a recursive formula would cause a stack overflow or infinite loop error, not a division-by-zero exception. Option C is wrong because a target field expecting an integer but receiving null would result in a 'null value' or 'type mismatch' error, not a division-by-zero error.

32
MCQhard

A global company uses Salesforce Einstein Discovery to predict customer churn. They have a dataset with fields: Customer_Since__c (date), Last_Interaction_Date__c (date), Support_Cases__c (number), Product_Usage__c (percentage), Region__c (picklist), and Churned__c (boolean target). The model was trained and deployed, but predictions show bias against customers in the "EMEA" region. The data scientist notices that in the training data, 80% of EMEA customers are labeled as churned, while only 20% of other regions. Additionally, the Product_Usage__c field has many missing values for EMEA customers. The company wants to retrain the model to reduce bias. What is the best course of action?

A.Oversample EMEA churned customers and undersample non-churned from other regions
B.Increase the sample size of EMEA customers by adding synthetic data
C.Remove the Region__c field from the model and retrain
D.Preprocess the data to impute missing Product_Usage__c values using region-specific averages, and then rebalance the dataset using stratified sampling
AnswerD

Region-specific imputation preserves regional characteristics, and stratified sampling ensures each region is proportionally represented in training, reducing bias.

Why this answer

Option D is correct because it addresses both the missing data (impute using region-specific averages to preserve regional patterns) and the class imbalance (stratified sampling ensures balanced representation across regions during training). Option A removes Region, losing valuable information; Option B uses synthetic data which may introduce artificial patterns; Option C only rebalances but does not fix missing data which could still bias the model.

33
Multi-Selecteasy

Which TWO are common data quality issues that can negatively impact AI model performance?

Select 2 answers
A.Missing values in critical fields
B.Low model accuracy during validation
C.Insufficient storage space for data
D.Inconsistent data governance policies
E.Duplicate records in the dataset
AnswersA, E

Missing data is a common quality issue.

Why this answer

Missing values in critical fields (Option A) are a common data quality issue because many AI models, particularly those relying on statistical or gradient-based optimization, cannot handle null or NaN inputs without imputation or removal. If missing values are not addressed, the model may learn biased patterns or fail to converge, leading to degraded predictive performance.

Exam trap

Salesforce often tests the distinction between data quality issues (problems with the data itself) and model performance issues or infrastructure constraints, so candidates mistakenly select options like low accuracy or insufficient storage as data quality problems.

34
MCQmedium

A data scientist needs to prepare data for Einstein Discovery. The dataset includes a field 'Customer_Status__c' with values 'Active', 'Inactive', and 'Churned'. How should this field be treated?

A.Create separate boolean fields for each value to improve model accuracy.
B.Remove the field because text fields cannot be used in Einstein Discovery.
C.Keep as a text field and let Einstein Discovery handle it as a categorical predictor.
D.Convert to numeric values 1, 2, 3 to preserve order.
AnswerC

Einstein Discovery automatically treats text fields as categorical predictors.

Why this answer

Option C is correct because Einstein Discovery natively supports text fields as categorical predictors, automatically encoding them for model training. The platform handles string values like 'Active', 'Inactive', and 'Churned' without requiring manual transformation, preserving the semantic meaning and cardinality of the data.

Exam trap

The trap here is that candidates assume text fields must be converted to numbers or one-hot encoded for machine learning, but Einstein Discovery abstracts this preprocessing, and manual conversion can introduce ordinal bias or unnecessary complexity.

How to eliminate wrong answers

Option A is wrong because creating separate boolean fields for each value (one-hot encoding) is unnecessary and can introduce multicollinearity or increase feature dimensionality without benefit, as Einstein Discovery's internal preprocessing already handles categorical encoding optimally. Option B is wrong because text fields are fully supported in Einstein Discovery as categorical predictors; the platform does not require numeric-only inputs and can process string values directly. Option D is wrong because converting to numeric values 1, 2, 3 implies an ordinal relationship that does not exist among 'Active', 'Inactive', and 'Churned', which would mislead the model into treating the categories as ordered, degrading prediction accuracy.

35
MCQeasy

A team is building a pipeline to train a model daily. The source data arrives in CSV files but needs to be converted to Parquet for efficiency. Which pipeline step should perform this conversion?

A.Feature engineering step
B.Model deployment step
C.Data validation step
D.Data ingestion step
AnswerD

Ingestion can transform data into a more efficient format.

Why this answer

Option D is correct because the data ingestion step is responsible for bringing raw data into the pipeline, including format conversions like CSV to Parquet. Converting to Parquet at ingestion improves storage efficiency and query performance for downstream processing, as Parquet uses columnar storage and compression.

Exam trap

Salesforce often tests the distinction between data ingestion (raw data handling) and data validation (quality checks), leading candidates to confuse format conversion with validation steps.

How to eliminate wrong answers

Option A is wrong because feature engineering transforms existing data into features for model training, not raw format conversion. Option B is wrong because model deployment serves the trained model for inference, not data preprocessing. Option C is wrong because data validation checks data quality and schema compliance, but does not perform format conversion.

36
Multi-Selecthard

Which THREE actions are recommended when preparing data for Einstein Next Best Action? (Choose 3)

Select 3 answers
A.Provide data on which actions were offered and whether they were accepted
B.Include at least 10 different action types per strategy
C.Record rejections (actions not taken) as negative examples
D.Use only historical data from the last 30 days
E.Retrain the model weekly with fresh interaction data
AnswersA, C, E

This is essential for reinforcement learning.

Why this answer

Option A is correct because Einstein Next Best Action (NBA) requires historical interaction data showing which actions were offered and whether they were accepted to train the predictive model. This feedback loop enables the AI to learn which actions are most effective for specific customer contexts, directly improving recommendation accuracy.

Exam trap

Salesforce often tests the misconception that more action types or recent data alone improve model performance, when in fact the key requirements are balanced positive/negative examples, sufficient historical depth, and regular retraining with fresh interaction data.

37
Multi-Selectmedium

A data analyst is troubleshooting Einstein Article Recommendations that are not showing up on the site. Which TWO checks should be performed first? (Choose 2)

Select 2 answers
A.Ensure at least 100 articles are in the knowledge base
B.Confirm that article authors have the correct profile permissions
C.Check that article view events are being captured in the data
D.Increase the recommendation frequency from daily to hourly
E.Verify that the recommendation model is published and active
AnswersC, E

Without view data, the model has no basis to recommend.

Why this answer

Option C is correct because Einstein Article Recommendations rely on user interaction data, specifically article view events, to generate personalized recommendations. If these events are not being captured, the model has no input to learn from, and recommendations will not appear. Checking event capture is a fundamental first step in troubleshooting data pipeline issues.

Exam trap

Salesforce often tests the misconception that increasing data volume or frequency (Options A and D) will fix recommendation issues, when in fact the core problem is usually missing event data or an inactive model.

38
MCQeasy

You are a Salesforce admin at a nonprofit organization. The organization uses Einstein Engagement Scoring to prioritize donors for outreach. The model is based on donation history and event attendance. Recently, the model stopped generating new scores for recently added donors. You check the data source and see that the model's data includes the 'Contact' and 'Opportunity' objects. The data refresh is scheduled daily. The model status is 'Active'. What should you investigate first to resolve the issue?

A.Check if the model has reached its scoring capacity and needs retraining
B.Add the 'Lead' object to the data source
C.Increase the data refresh frequency to hourly
D.Check if the model was deactivated automatically
AnswerA

Engagement Scoring models have a limit on scored records; after reaching it, new records are not scored until retraining.

Why this answer

Option A is correct because Einstein Engagement Scoring models have a maximum scoring capacity (e.g., 2 million scored records per model). When new donors are added but the model stops generating scores, the most likely cause is that the model has reached this capacity and requires retraining to incorporate new records. Retraining resets the scoring queue and allows the model to score newly added donors.

Exam trap

The trap here is that candidates assume the issue is data freshness or object configuration, but Cisco tests the specific behavior that Einstein models have a scoring capacity limit that requires retraining, not just data refresh or object inclusion.

How to eliminate wrong answers

Option B is wrong because the model is already based on 'Contact' and 'Opportunity' objects, which are the correct objects for donor scoring; adding the 'Lead' object is irrelevant since leads are not donors and would not resolve the scoring stoppage. Option C is wrong because increasing the data refresh frequency addresses data latency, not the model's inability to score new records due to capacity limits; the model is already refreshing daily and the issue is scoring capacity, not data freshness. Option D is wrong because the model status is explicitly stated as 'Active', so deactivation is not the cause; automatic deactivation would change the status to 'Inactive' or 'Error', which is not the case here.

39
Multi-Selecteasy

Which THREE types of data sources are commonly integrated into Salesforce Data Cloud for AI use cases?

Select 3 answers
A.Third-party demographic data
B.Web and mobile app engagement data
C.CRM transaction records
D.Model training logs
E.Data transformation scripts
AnswersA, B, C

External data enhances AI models.

Why this answer

Option A is correct because Salesforce Data Cloud can ingest third-party demographic data from external sources (e.g., data enrichment providers) to enrich customer profiles. This data, when combined with first-party data, enables AI models to generate more accurate predictions and segmentations. Data Cloud’s Data Streams and Data Lake objects support structured ingestion of such external datasets.

Exam trap

Salesforce often tests the distinction between data sources (raw inputs) and data processing artifacts (logs, scripts), leading candidates to mistakenly select model training logs or transformation scripts as valid data sources.

40
Multi-Selecthard

Data quality is critical for AI model performance. Which three data quality dimensions should be monitored? (Choose three.)

Select 3 answers
A.Completeness
B.Consistency
C.Uniqueness
D.Timeliness
E.Volume
AnswersA, B, D

Ensures no missing values that could bias the model.

Why this answer

Completeness, timeliness, and consistency are fundamental data quality dimensions. Volume is not a quality dimension; uniqueness is related to consistency but not always required.

41
MCQeasy

An administrator is configuring a Salesforce AI model that uses historical sales data. The data includes fields like 'Amount', 'Close_Date', and 'Lead_Source'. What is the primary purpose of data preprocessing in this context?

A.To generate visualizations for business stakeholders
B.To increase the storage capacity of the database
C.To enforce data access permissions for different user roles
D.To clean and transform data into a format suitable for model training
AnswerD

Preprocessing ensures data quality and format.

Why this answer

Data preprocessing is essential for AI models because raw historical sales data often contains missing values, inconsistent formats, and noise. Cleaning (e.g., handling nulls in 'Amount') and transforming (e.g., encoding 'Lead_Source' into numerical features) ensure the model can learn patterns effectively, directly impacting training accuracy and convergence.

Exam trap

Salesforce often tests the distinction between data preprocessing and other data management tasks; the trap here is that candidates confuse preprocessing with reporting (visualizations) or security (permissions), when the core goal is to prepare data for model ingestion.

How to eliminate wrong answers

Option A is wrong because generating visualizations is a downstream analytics task, not the primary purpose of preprocessing for model training. Option B is wrong because preprocessing does not increase storage capacity; it may reduce data size through cleaning but does not affect database storage limits. Option C is wrong because enforcing data access permissions is a security and governance concern, handled by Salesforce's sharing and permission settings, not by data preprocessing steps.

42
MCQhard

Refer to the exhibit. The data pipeline is failing. What is the most likely cause?

A.Network timeout.
B.Missing required field in source data.
C.Insufficient memory.
D.Schema mismatch between source and target.
AnswerD

The field 'account_id' is expected in the schema but is not found, indicating a schema mismatch.

Why this answer

Option D is correct because a schema mismatch between source and target is the most common cause of pipeline failures in data integration workflows. When the source data structure (e.g., column names, data types, or nested fields) does not match the target schema, the pipeline cannot map or transform the data correctly, leading to errors during ingestion or transformation stages. This is especially relevant in tools like Apache NiFi, AWS Glue, or Azure Data Factory, where schema validation is enforced at runtime.

Exam trap

Salesforce often tests the misconception that pipeline failures are always due to network or resource issues, but the trap here is that schema mismatch is a subtle, configuration-level error that is frequently overlooked in favor of more obvious causes like timeouts or memory limits.

How to eliminate wrong answers

Option A is wrong because a network timeout would typically produce a connection error or retry failure, not a schema-related failure, and most pipelines have retry mechanisms to handle transient network issues. Option B is wrong because a missing required field in source data would cause a validation error or null constraint violation, but the question describes a pipeline failure that is more likely due to structural incompatibility rather than missing data. Option C is wrong because insufficient memory would manifest as an out-of-memory error or performance degradation, not a schema mismatch error, and modern pipelines are designed to handle memory constraints gracefully.

43
Multi-Selecteasy

When preparing data for Einstein Next Best Action, which two aspects must be considered for compliance with data privacy regulations? (Choose two.)

Select 2 answers
A.Data compression
B.Color coding fields in the dataset
C.Indexing speed for real-time recommendations
D.Consent management
E.Data masking of personally identifiable information (PII)
AnswersD, E

Obtaining and tracking consent is a fundamental privacy requirement.

Why this answer

Options A and D are correct. Consent management (A) ensures legal basis; data masking of PII (D) protects sensitive data. Data compression (B) and color coding (C) are not privacy measures; indexing speed (E) is performance related.

44
MCQhard

A financial services company uses Salesforce AI to detect fraudulent transactions. The dataset has 1 million legitimate transactions and only 1,000 fraudulent ones. The model trained with default parameters achieves 99.9% accuracy but identifies no fraud (precision and recall of 0). The data scientist wants to maximize fraud detection (recall) while minimizing false positives. Which approach is most effective?

A.Increase the weight of the majority class in the loss function.
B.Use SMOTE to generate synthetic fraud samples to balance the dataset.
C.Train multiple models on different random subsets and average predictions.
D.Use a simpler model to avoid overfitting on the majority class.
AnswerB

SMOTE creates synthetic instances of the minority class, allowing the model to learn fraud patterns effectively and improve recall.

Why this answer

With extreme imbalance, oversampling the minority class (e.g., SMOTE) generates synthetic fraud examples, helping the model learn fraud patterns and improve recall without discarding legitimate data.

45
MCQeasy

A Salesforce admin wants to use Einstein Prediction Builder to predict case resolution time. What type of data is most critical for training this model?

A.Customer satisfaction survey responses
B.Historical case records including resolution time
C.Product inventory levels
D.Employee work schedules
AnswerB

Historical data is essential for training.

Why this answer

Einstein Prediction Builder requires historical data with known outcomes to train a supervised machine learning model. Historical case records containing actual resolution times provide the labeled examples needed for the model to learn patterns and predict future case resolution times. Without this ground truth data, the model cannot be trained to make accurate predictions.

Exam trap

The trap here is that candidates may confuse factors that influence resolution time (like employee schedules or inventory) with the actual labeled outcome data required to train a supervised prediction model.

How to eliminate wrong answers

Option A is wrong because customer satisfaction survey responses measure post-resolution sentiment, not the actual resolution time, and they lack the precise timestamp data required for regression-based time prediction. Option C is wrong because product inventory levels are unrelated to case resolution time; they might be relevant for supply chain predictions but not for service case duration. Option D is wrong because employee work schedules, while potentially influencing resolution time, are not the historical outcome data needed to train the model — the model needs actual resolution times from past cases, not staffing inputs.

46
MCQmedium

While building a prediction model in Einstein Studio, the system warns about "high cardinality" for a categorical field. What should the admin do?

A.Convert the field to a numeric type
B.Use frequency encoding or binning to reduce cardinality
C.Remove the field from the model
D.Increase the model complexity by adding more trees
AnswerB

Frequency encoding (replace with count) or binning groups rare values into categories, reducing cardinality while preserving signal.

Why this answer

Option C is correct because high cardinality (many unique values) can hurt model performance. Frequency encoding or binning reduces cardinality while retaining information. Removing the field or converting to numeric may lose information; increasing model complexity is not recommended.

47
MCQmedium

A company is preparing customer data for a predictive model. They notice that many records have missing values for the 'annual income' field. Which approach is best to handle this issue while minimizing bias?

A.Remove all records with missing values.
B.Use model-based imputation considering other features.
C.Replace missing values with the mean.
D.Set missing values to zero.
AnswerB

Model-based imputation leverages other features to predict missing values, preserving relationships and minimizing bias.

Why this answer

Model-based imputation (Option B) is best because it uses relationships between features (e.g., education, job role) to predict missing 'annual income' values, preserving data distribution and minimizing bias. This approach avoids the distortion caused by simple mean/zero imputation and retains sample size better than deletion.

Exam trap

Salesforce often tests the misconception that mean imputation is a safe default, but the trap here is that it ignores feature dependencies and can artificially shrink variance, leading to overconfident model predictions and biased coefficients.

How to eliminate wrong answers

Option A is wrong because removing all records with missing values can introduce selection bias and reduce sample size, potentially discarding valuable patterns in the data. Option C is wrong because replacing missing values with the mean ignores feature correlations, artificially compresses variance, and can bias relationships in the predictive model. Option D is wrong because setting missing values to zero is arbitrary and unrealistic for income data, likely creating a skewed distribution and misleading model coefficients.

48
MCQhard

A data pipeline fails intermittently when processing large CSV files. The error log shows 'OutOfMemoryError'. Which configuration change is most likely to resolve this?

A.Use a smaller file size limit.
B.Increase the number of worker threads.
C.Switch to XML format.
D.Increase the heap memory for the processing application.
AnswerD

Increasing heap memory provides more space for large file processing.

Why this answer

The OutOfMemoryError indicates that the Java Virtual Machine (JVM) heap space is exhausted while processing large CSV files. Increasing the heap memory (e.g., using -Xmx flag) allocates more memory to the application, allowing it to handle larger datasets without crashing. This directly addresses the root cause of insufficient memory for the data pipeline's processing workload.

Exam trap

Salesforce often tests the misconception that increasing parallelism (worker threads) solves memory issues, but in reality, more threads increase memory pressure and can trigger OutOfMemoryError faster.

How to eliminate wrong answers

Option A is wrong because using a smaller file size limit is a workaround that avoids the problem rather than solving it, and it may not be feasible if large files are required by the business. Option B is wrong because increasing worker threads typically increases memory consumption and contention, which would worsen the OutOfMemoryError, not resolve it. Option C is wrong because switching to XML format would likely increase memory usage due to verbose markup and parsing overhead, making the error more likely, not less.

49
MCQhard

A company has international customers and wants Einstein Prediction Builder to forecast deal closure probability. The data includes fields like 'region', 'product line', and 'deal amount'. What is a best practice to ensure the model works for all regions?

A.One-hot encode the region field using 50+ dummy variables.
B.Remove the region field to avoid bias.
C.Use region as a numeric rank based on past conversion rates.
D.Group regions into broader categories like 'Americas', 'EMEA', 'APAC'.
AnswerD

Grouping reduces noise and improves generalizability while maintaining regional distinction.

Why this answer

Option D is correct because grouping regions into broader categories like 'Americas', 'EMEA', and 'APAC' reduces high cardinality and sparsity in categorical features, which improves model stability and prevents overfitting in Einstein Prediction Builder. This approach ensures each region group has sufficient training data to learn meaningful patterns, enabling the model to generalize better across all regions without introducing bias from rare categories.

Exam trap

Salesforce often tests the misconception that more granular data (like one-hot encoding with many categories) always improves model accuracy, when in fact it can harm performance due to sparsity and overfitting in prediction builder tools.

How to eliminate wrong answers

Option A is wrong because one-hot encoding a region field with 50+ dummy variables introduces high cardinality and sparsity, which can cause the model to overfit to rare categories and degrade prediction performance in Einstein Prediction Builder. Option B is wrong because removing the region field entirely discards valuable geographic information that can significantly influence deal closure probability, leading to a less accurate model. Option C is wrong because using region as a numeric rank based on past conversion rates introduces ordinal bias and assumes a linear relationship that may not exist, which can misrepresent the true categorical nature of the data and reduce model interpretability.

50
MCQeasy

A company wants to build a sentiment analysis model using customer feedback. What is the best practice for labeling the training data?

A.Ignore labeling and use unsupervised learning
B.Have a single domain expert label all data
C.Employ a diverse set of human labelers with clear guidelines
D.Use automated keyword matching to assign sentiment
AnswerC

Human labeling with guidelines provides accurate, consistent labels.

Why this answer

Using diverse human labelers with clear guidelines ensures label consistency and reduces bias. Automated keyword matching is error-prone, a single expert may introduce personal bias, and using only positive labels would create an unbalanced dataset.

51
MCQeasy

A company wants to train an AI model to predict customer churn using historical data that contains many missing values. What is the best practice for handling missing data?

A.Use only features without missing values.
B.Ignore missing values as they do not affect AI training.
C.Impute missing values using mean or median.
D.Remove all records with missing values.
AnswerC

Imputation preserves data and reduces bias.

Why this answer

Option C is correct because imputing missing values using mean or median is a standard practice that preserves the dataset size and statistical properties, allowing the AI model to learn from all available features without introducing bias from data removal. This approach is particularly effective for numerical features in customer churn prediction, where missing values are often random and imputation maintains the distribution for algorithms like logistic regression or gradient boosting.

Exam trap

Salesforce often tests the misconception that removing missing data is safe, but the trap here is that candidates overlook how data removal can shrink the dataset and introduce bias, while imputation is a more balanced and widely accepted practice in AI workflows.

How to eliminate wrong answers

Option A is wrong because discarding features with missing values can remove valuable predictors, reducing model accuracy and ignoring the fact that missingness itself may carry predictive signal. Option B is wrong because ignoring missing values causes most AI algorithms to fail or produce incorrect results, as they cannot process null or NaN entries, leading to runtime errors or biased learning. Option D is wrong because removing all records with missing values can drastically reduce the dataset size, introduce selection bias, and discard potentially useful patterns in the remaining data.

52
MCQmedium

A retailer's AI model for recommendation is producing poor results. Analysis shows that the customer entity has many duplicate records with slight variations. Which Data Cloud feature should be used to address this?

A.Create a Data Transformation to merge duplicates using rules
B.Increase the Data Stream frequency to get fresher data
C.Ignore the duplicates and use all records as-is
D.Set up a Data Action to deduplicate at the source
AnswerA

Data Transformations can deduplicate records effectively.

Why this answer

Option C is correct because Data Transformations can apply fuzzy matching to merge records. Option A is wrong because ignoring duplicates leaves poor data. Option B is wrong because Data Actions trigger external actions, not dedup.

Option D is wrong because increasing stream frequency does not fix existing duplicates.

53
MCQhard

After deploying an AI model in Salesforce, the data scientist notices high accuracy on the training set but poor accuracy on new incoming data. What is this phenomenon called?

A.Data leakage
B.Overfitting
C.Underfitting
D.Concept drift
AnswerB

Overfitting causes high training accuracy but low test accuracy.

Why this answer

Option A is correct because overfitting occurs when the model learns noise instead of the underlying pattern, performing well on training but poorly on new data. Option B is wrong because underfitting would show poor performance on both. Option C is wrong because data leakage gives unrealistically high performance on training but does not cause poor generalization.

Option D is wrong because concept drift refers to changing data distribution over time, not immediate poor generalization.

54
MCQeasy

In Salesforce CRM Analytics (formerly Einstein Analytics), what is the primary purpose of a dataset?

A.To prepare data for AI and analytics
B.To run SQL queries directly
C.To store raw, unprocessed records
D.To create dashboards only
AnswerA

Datasets are the building blocks for AI modeling, dashboards, and analytical queries.

Why this answer

In Salesforce CRM Analytics, a dataset is the foundational data structure that transforms raw data into an optimized, columnar format for analytics and AI features like Einstein Discovery. It is created by extracting, cleaning, and aggregating data from sources such as Salesforce objects or external connectors, enabling efficient querying, dashboarding, and machine learning model training. This makes option A correct because the primary purpose is to prepare data specifically for AI and analytics workloads.

Exam trap

Salesforce often tests the misconception that datasets are simply raw storage containers, but the trap here is that candidates overlook the 'preparation for AI' aspect and choose 'store raw records' because they confuse datasets with database tables or data lakes.

How to eliminate wrong answers

Option B is wrong because datasets do not support direct SQL query execution; instead, they use SAQL (Salesforce Analytics Query Language) or lens-based exploration for querying. Option C is wrong because datasets store processed, flattened, and indexed data, not raw, unprocessed records—raw data is typically held in dataflows or external systems before transformation. Option D is wrong because while datasets can be used to build dashboards, their primary purpose is broader, encompassing AI, analytics, and data preparation, not just dashboard creation.

55
MCQmedium

What is the most likely cause of the error?

A.Authentication failure
B.Data quality threshold violation
C.Data schema mismatch
D.Network timeout
AnswerB

Null values exceed acceptable threshold.

Why this answer

Option B is correct because the error mentions a high percentage of null values in a critical field, which violates a data quality threshold. Option A is wrong because schema mismatch would show field type inconsistencies. Option C is wrong because authentication failure would show a different error.

Option D is wrong because network timeout would mention connection issues.

56
Multi-Selecthard

Which TWO techniques are commonly used to handle missing values in a dataset for AI training?

Select 2 answers
A.L1 regularization
B.Deletion of rows with missing values
C.One-hot encoding
D.Min-max normalization
E.Imputation with mean or median
AnswersB, E

Simple but valid method.

Why this answer

Option B is correct because deleting rows with missing values is a straightforward technique to handle missing data, especially when the missingness is random and the dataset is large enough that removing a few rows does not significantly impact model performance. This approach avoids introducing bias from imputation methods but can lead to loss of valuable information if too many rows are removed.

Exam trap

Salesforce often tests the distinction between data preprocessing techniques (like handling missing values) and model regularization or feature engineering, so candidates may confuse L1 regularization or one-hot encoding as methods for missing data when they serve entirely different purposes.

57
MCQeasy

When training an Einstein Discovery model, which data type is not supported as a predictor field?

A.Multi-select picklist
B.Numeric
C.Picklist
D.Date
AnswerA

Multi-select picklists have multiple values per record and cannot be used directly as predictors.

Why this answer

Option A is correct because multi-select picklists are not supported as predictors in Einstein Discovery. Numeric, picklist, and date fields are supported.

58
Multi-Selectmedium

A company is preparing their Salesforce Data Cloud for Einstein AI predictions. They need to ensure data quality and governance. Which TWO actions should they take? (Choose two.)

Select 2 answers
A.Declare uniqueness rules on calculated insights.
B.Create profiling and auditing dashboards to monitor data health.
C.Set role-based access controls on data model objects.
D.Use Data Cloud's data model to establish relationships between objects.
E.Enable automatic field mapping for all data sources.
AnswersB, D

Monitoring data health is essential for ongoing data quality and governance.

Why this answer

Option B is correct because establishing relationships in the data model is fundamental for accurate predictions. Option C is correct because profiling and auditing dashboards help monitor data health and governance. Option A is incorrect because uniqueness rules on calculated insights are not a standard data quality practice.

Option D is incorrect because automatic field mapping may introduce errors without validation. Option E is incorrect although role-based access contributes to governance, it is not the primary action for data quality.

59
Multi-Selecteasy

Which TWO are common data quality issues that negatively impact AI model performance? (Choose two.)

Select 2 answers
A.Multicollinearity
B.Missing values
C.Outliers
D.Data volume
E.High dimensionality
AnswersB, C

Missing values can lead to biased or incomplete training.

Why this answer

Options A and B are correct. Missing values and outliers can skew model training. Option C is wrong because high dimensionality is more about feature count than quality.

Option D is wrong because multicollinearity affects interpretability but not necessarily quality. Option E is wrong because data volume alone is not a quality issue.

60
MCQeasy

A marketing team wants to use Einstein Recommendations to personalize product offers on their e-commerce site. They have a dataset of 50,000 customers with purchase history. However, 40% of customers have no purchase history (new registrations). The model performs well for returning customers but gives generic recommendations for new ones. The team wants to improve recommendations for new customers. What data preparation step should they take?

A.Remove all customers with missing purchase history from the training set.
B.Assign a random purchase frequency to each new customer to add variety.
C.Impute missing purchase history with the average purchase frequency across all customers.
D.Use only customers with complete purchase history to train a more accurate model.
AnswerC

Imputation provides a baseline signal for new customers, enabling the model to make reasonable recommendations.

Why this answer

Imputing missing purchase data with a sensible default (e.g., average purchase frequency) gives the model signal for new customers, improving recommendations without discarding data.

61
MCQmedium

A company uses Einstein Discovery to identify factors that increase case resolution time. After training, the model shows that 'Case_Origin__c' has high importance. What action should the company take?

A.Remove the field from the model to reduce complexity.
B.Create interaction terms between Case_Origin and other fields.
C.Increase the data quality threshold for Case_Origin records.
D.Investigate the categories within Case_Origin to understand their impact.
AnswerD

Understanding which origins cause delays helps in process improvement.

Why this answer

Option C is correct because the model identifies 'Case_Origin__c' as important; analyzing its categories can reveal which origins cause delays. Option A is wrong because removing the field loses information. Option B is wrong because the model already accounts for interactions.

Option D is wrong because the origin is not necessarily a data quality issue.

62
MCQmedium

A large retail company uses Data Cloud to consolidate customer data from e-commerce, POS, and loyalty programs. They plan to use Einstein Studio to build a churn prediction model. The data architect notices that the churn model's accuracy is below expectations. Upon investigation, they find that the customer entity in Data Cloud has multiple records for the same customer with slightly different spellings and addresses. The data comes from different streams. What should the data architect do to improve the model?

A.Create a Data Transform to merge duplicate records based on fuzzy matching on name and address fields
B.Increase the data stream frequency to get more recent data
C.Change the primary key in the data model to use a different identifier
D.Use a Calculated Insight to aggregate customer behavior over time
AnswerA

Directly addresses the duplicate issue and creates a unified view.

Why this answer

Option A is the best course of action because creating a Data Transform with fuzzy matching merges duplicates into a single clean record, improving data quality for the model. Option B is flawed because increasing frequency does not fix existing duplicates. Option C aggregates but doesn't resolve the duplication.

Option D changes the primary key but duplicates remain.

63
MCQmedium

Refer to the exhibit. A data scientist tries to query the dataset but receives an error. Which of the following is the most likely cause?

A.The requested fields are not included in the policy.
B.The condition filters out records with amount=5000.
C.The data scientist is not listed in the allowedUsers array.
D.The policy format is invalid JSON.
AnswerA

If the query requests a field not listed (e.g., customer_name), it would be denied.

Why this answer

Option A is correct because 'data_scientist' is in allowedUsers, so they are allowed. Option B is not in policy, C the policy filters amounts >0 and <10000, so 5000 is included, D the fields are in the policy, so they should be accessible.

64
Multi-Selecthard

Before training an Einstein Prediction model, a data analyst must perform data quality checks. Which THREE checks are most critical?

Select 3 answers
A.Confirm that label distribution matches the target baseline
B.Remove duplicate records that could cause data leakage
C.Verify consistent data types across records (e.g., all dates as Date)
D.Ensure all features follow a normal distribution
E.Check for missing values in key fields
AnswersB, C, E

Duplicates can over-represent certain patterns.

Why this answer

Option B is correct because duplicate records can cause data leakage by allowing the model to see the same or highly similar data in both training and validation splits, leading to overfitting and inflated performance metrics. Removing duplicates ensures that the model generalizes to unseen data rather than memorizing repeated instances.

Exam trap

Salesforce often tests the misconception that all features must be normally distributed, which is a requirement for some statistical tests but not for machine learning models like those in Einstein Prediction Builder, which can handle non-normal data via tree-based or ensemble methods.

65
MCQeasy

For a real-time AI application that requires low-latency access to customer interaction data, which storage solution is most appropriate?

A.Flat files on a network drive.
B.In-memory data store.
C.Relational database with complex joins.
D.Data lake with batch processing.
AnswerB

In-memory storage offers microsecond latency, ideal for real-time AI.

Why this answer

In-memory data stores (e.g., Redis, Memcached) store data in RAM rather than on disk, providing sub-millisecond read/write latencies essential for real-time AI applications that need immediate access to customer interaction data. This eliminates disk I/O bottlenecks and enables high-throughput, low-latency data retrieval for time-sensitive inference or decision-making.

Exam trap

Salesforce often tests the misconception that relational databases are always the best for structured data, but the trap here is that candidates overlook the strict latency requirement and choose a relational database (Option C) without considering that complex joins and disk-based storage make it too slow for real-time AI workloads.

How to eliminate wrong answers

Option A is wrong because flat files on a network drive introduce high latency due to network overhead and disk I/O, and they lack the indexing and concurrency control needed for real-time access. Option C is wrong because relational databases with complex joins incur significant query processing overhead and disk-based storage, making them unsuitable for low-latency requirements despite ACID compliance. Option D is wrong because data lakes with batch processing are designed for high-throughput, periodic analytics (e.g., hourly/daily) and cannot provide the sub-second response times required for real-time AI interactions.

66
Multi-Selectmedium

Which TWO data preparation steps are required before using Einstein Discovery for sales forecasting? (Choose 2)

Select 2 answers
A.Convert all text fields to numeric using one-hot encoding
B.Remove duplicate records
C.Include a date or timestamp field for time series analysis
D.Ensure all predictor fields have no missing values
E.Normalize numeric fields to a 0-1 scale
AnswersC, D

For forecasting, a date field is needed to order records.

Why this answer

Einstein Discovery requires a date or timestamp field to perform time series analysis, which is essential for identifying trends, seasonality, and patterns in historical sales data. Without this field, the model cannot properly order observations or forecast future values based on temporal dependencies.

Exam trap

Salesforce often tests the misconception that manual data preprocessing steps like normalization or one-hot encoding are required, when in fact Einstein Discovery automates these steps, and the key prerequisite is ensuring a proper date/timestamp field exists for time-based analysis.

67
MCQmedium

Refer to the exhibit. In the JSON configuration above, which data preparation step could introduce bias?

A.Excluding rows with missing Stage
B.Ignoring missing Description
C.Filling missing Amount with median
D.Using default for missing CreatedDate
AnswerA

Excluding rows can systematically remove cases if missing is not random, especially if Stage is related to the target.

Why this answer

Option B is correct because excluding rows with missing Stage (a picklist that may correlate with outcome) can introduce selection bias. Filling with median (A) or default (C) are common imputation methods; ignoring Description (D) is generally safe as it treats missing as information.

68
Multi-Selectmedium

A company is implementing Einstein Prediction Builder to predict whether a support case will escalate. Which TWO data preparation steps should the admin take to improve model accuracy?

Select 2 answers
A.Include as many fields as possible to provide more context
B.Ensure missing values are handled appropriately (e.g., imputed or excluded)
C.Encrypt all fields containing personally identifiable information
D.Exclude cases that were closed without escalation
E.Remove fields that have a one-to-one relationship with the outcome
AnswersB, E

Missing values can bias the model; proper handling improves accuracy.

Why this answer

Correct: Removing redundant fields (like record IDs) and handling missing values are crucial for model accuracy. Option A is wrong because more fields can introduce noise. Option C is wrong because data encryption is about security, not accuracy.

Option D is wrong because all cases should be included to represent the full pattern.

69
Multi-Selectmedium

Which TWO of the following are common dimensions of data quality that must be addressed for AI training?

Select 2 answers
A.Storage efficiency
B.Accuracy of values
C.Encryption strength
D.Consistency with external benchmarks
E.Completeness of records
AnswersB, E

Accuracy ensures data correctly represents real-world entities.

Why this answer

Accuracy of values (Option B) is a fundamental dimension of data quality because AI models learn patterns from training data; if the data contains incorrect values, the model will learn and propagate those errors, leading to unreliable predictions. For example, in a dataset of customer ages, a single erroneous entry of '200' can skew the model's understanding of age distributions, directly impacting model performance.

Exam trap

Salesforce often tests the distinction between data quality dimensions (accuracy, completeness, consistency) and operational or security attributes (storage efficiency, encryption strength), tricking candidates into selecting options that sound technical but are irrelevant to data quality for AI training.

70
MCQhard

Refer to the exhibit. A developer runs this SOQL query to prepare data for Einstein Lead Scoring. The query returns an error. What is the most likely issue?

A.The alias 'TotalAmount' is not allowed in the HAVING clause.
B.The query misses a GROUP BY clause.
C.The SUM(Amount) cannot be used in the HAVING clause.
D.The WHERE clause condition is invalid.
AnswerA

In SOQL, HAVING must use the full aggregate expression, not an alias.

Why this answer

The HAVING clause references alias TotalAmount, but SOQL does not allow aliases in HAVING; the aggregated expression must be repeated.

71
MCQhard

A company has set up Einstein Next Best Action with a recommendation strategy. They want to ensure that recommendations are personalized based on the customer's recent behavior. What data should be used?

A.Event data from the website tracked via Google Analytics.
B.Streaming data from Data Cloud that includes recent website interactions.
C.Static profile fields like customer age and location.
D.Historical data from a data warehouse updated daily.
AnswerB

Data Cloud can ingest streaming events and make them available for real-time decisions.

Why this answer

Option B is correct because Einstein Next Best Action requires real-time or near-real-time data to personalize recommendations based on recent customer behavior. Streaming data from Data Cloud captures website interactions as they happen, enabling the recommendation engine to use the most current signals (e.g., page views, clicks) to adjust offers dynamically.

Exam trap

The trap here is that candidates often confuse 'historical data' or 'static profile data' as sufficient for personalization, but Cisco tests the understanding that real-time behavior requires streaming data, not batch or static sources.

How to eliminate wrong answers

Option A is wrong because Google Analytics event data is not natively integrated with Einstein Next Best Action; the platform requires data through Salesforce Data Cloud or connected Salesforce sources, not third-party analytics tools. Option C is wrong because static profile fields like age and location do not reflect recent behavior, which is essential for real-time personalization; they are useful for segmentation but not for dynamic, behavior-driven recommendations. Option D is wrong because historical data updated daily is too stale for real-time personalization; Einstein Next Best Action needs streaming or near-real-time data to respond to a customer's latest actions, not batch-loaded historical records.

72
MCQhard

A large enterprise uses Data Cloud to power an Einstein model for lead scoring. The model's feature pipeline includes dozens of fields from multiple data streams. Performance has degraded, and the team suspects slow feature retrieval. What is the most efficient way to speed up feature computation in Data Cloud?

A.Increase the parallelism of the data streams
B.Implement external caching in the application layer
C.Use Calculated Insights to pre-compute and cache common features
D.Move all data to a single data lake object
AnswerC

Reduces on-the-fly computation by storing results.

Why this answer

Option B is correct because Calculated Insights can pre-aggregate and store frequently used features, reducing computation. Option A is wrong because parallelism isn't always the bottleneck. Option C is wrong because storage location affects latency but not computation.

Option D is wrong because caching at the application layer bypasses Data Cloud optimizations.

73
MCQmedium

A multinational corporation uses Salesforce AI to analyze customer feedback across multiple languages. They have 10,000 English reviews, 2,000 Spanish reviews, and 500 French reviews. The sentiment model performs well on English (F1=0.85) but poorly on French (F1=0.40). The data scientist wants to improve French sentiment performance without collecting new data. What should they do?

A.Translate all French reviews to English and train only on English data.
B.Use a multilingual pre-trained model without any additional French data.
C.Remove French data and use only English and Spanish to avoid imbalance.
D.Apply data augmentation to the French reviews using back-translation (translate to another language and back) to create more training examples.
AnswerD

Back-translation generates realistic paraphrases, augmenting the French dataset and improving model performance.

Why this answer

Data augmentation techniques like back-translation generate synthetic French samples, effectively increasing the minority language's representation and helping the model learn better.

74
Multi-Selectmedium

Which THREE factors should be considered when selecting features for a predictive model in Salesforce?

Select 3 answers
A.Volume of data available for each feature
B.Correlation between features to avoid multicollinearity
C.Relevance of the feature to the target variable
D.Compliance with data privacy regulations
E.Business interpretability of the feature
AnswersB, C, E

Multicollinearity can harm model stability.

Why this answer

Option B is correct because multicollinearity occurs when two or more features are highly correlated, which can destabilize model coefficients and reduce interpretability. In Salesforce's predictive models, such as those built with Einstein Discovery, correlated features can inflate variance and lead to unreliable predictions. Avoiding multicollinearity ensures that the model's feature importance estimates are trustworthy and that the model generalizes well to new data.

Exam trap

Salesforce often tests the distinction between feature selection criteria (predictive power, correlation, interpretability) and broader data management concerns (privacy, volume), leading candidates to mistakenly include compliance or data volume as direct feature selection factors.

75
MCQeasy

A company is building a chatbot using Einstein Bot's AI capabilities. They want to train intent recognition using historical chat transcripts. The transcripts contain many typos (e.g., 'hellp' instead of 'help') and slang (e.g., 'gonna' instead of 'going to'). The initial model performs poorly, misclassifying many intents. What data cleaning step is most important?

A.Use a spell-checker only for words that appear infrequently.
B.Keep the raw text as is because it reflects real user behavior.
C.Normalize text by applying spell-correction and replacing slang with standard terms.
D.Remove all messages that contain typos or slang to clean the dataset.
AnswerC

Normalization reduces noise and variability, enabling the model to focus on meaningful patterns.

Why this answer

Normalizing text by correcting common typos and expanding slang reduces vocabulary sparsity and helps the model learn consistent word associations, improving intent recognition.

Page 1 of 3 · 163 questions totalNext →

Ready to test yourself?

Try a timed practice session using only Data For Ai questions.