Knowledge + Practice

CCNA Data for AI Questions

75 of 163 questions · Page 2/3 · Data for AI · Answers revealed

Practice these questions Domain overview All questions

76

MCQeasy

A machine learning team is preparing a dataset for a supervised learning task. They have 100,000 labeled samples. Which data preparation step is essential before splitting into train/test sets?

A.Normalize all features to the same scale.

B.Remove all outliers from the dataset.

C.Shuffle the dataset randomly.

D.Visualize the data distribution for each feature.

AnswerC

Shuffling prevents biased splits.

Why this answer

Option C is correct because shuffling the dataset randomly before splitting into train/test sets ensures that the data distribution is similar across both subsets. Without shuffling, the split might inadvertently separate ordered or grouped data (e.g., time-series or batches), leading to biased model evaluation. This step is essential for supervised learning to prevent data leakage and ensure the test set is representative of the overall population.

Exam trap

Salesforce often tests the misconception that normalization or outlier removal must be done before splitting, but the trap here is that candidates overlook the fundamental need to randomize the data order to avoid temporal or structural bias in the train/test split.

How to eliminate wrong answers

Option A is wrong because normalizing features to the same scale is a preprocessing step typically applied after splitting the data, using statistics (e.g., mean and standard deviation) computed only from the training set to avoid data leakage into the test set. Option B is wrong because removing all outliers before splitting can introduce bias and reduce the dataset's representativeness; outlier handling should be done with care, often after splitting, and may be domain-specific. Option D is wrong because visualizing data distributions is an exploratory step that helps understand the data but is not essential before splitting; it can be performed after splitting to avoid influencing the split decisions.

Practice this question →

77

MCQeasy

Which method is most suitable for ingesting streaming data from IoT sensors into a data lake?

A.Copying data via FTP.

B.Batch ingestion every 24 hours.

C.Manual upload via web interface.

D.Real-time streaming with Apache Kafka.

AnswerD

Kafka provides high-throughput, fault-tolerant streaming for IoT data.

Why this answer

Apache Kafka is the most suitable option because it is a distributed streaming platform designed for high-throughput, fault-tolerant, real-time data ingestion. IoT sensors generate continuous, high-velocity data streams, and Kafka's publish-subscribe model allows data to be ingested into a data lake with low latency, ensuring near-real-time availability for analytics.

Exam trap

Salesforce often tests the distinction between batch and real-time processing, and the trap here is that candidates may choose batch ingestion (Option B) thinking it is simpler or sufficient, overlooking the fundamental requirement for low-latency streaming in IoT sensor data ingestion.

How to eliminate wrong answers

Option A is wrong because FTP (File Transfer Protocol) is a batch-oriented file transfer protocol that lacks real-time streaming capabilities, introduces latency, and does not handle continuous data streams from IoT sensors efficiently. Option B is wrong because batch ingestion every 24 hours introduces unacceptable latency for streaming IoT data, which often requires immediate processing for time-sensitive applications like anomaly detection or predictive maintenance. Option C is wrong because manual upload via a web interface is impractical for high-frequency sensor data, as it requires human intervention, cannot scale, and introduces significant delays and errors.

Practice this question →

78

MCQmedium

A data scientist notices that the model accuracy drops significantly after retraining with new data. Upon inspection, they find that many records have missing values for a key feature. Which data quality improvement should be prioritized first?

A.Implement imputation for missing feature values.

B.Normalize the feature range.

C.Reduce the number of features.

D.Remove duplicate records.

AnswerA

Imputation addresses missing data, a common cause of accuracy drop.

Why this answer

The core issue is that missing values in a key feature introduce noise and bias, directly degrading model performance. Imputation (option A) is the most direct and impactful first step because it preserves the dataset size and feature set, allowing the model to learn from complete patterns. Without addressing missing data first, other quality improvements like normalization or feature reduction would be applied to corrupted data, failing to resolve the root cause.

Exam trap

Salesforce often tests the misconception that data quality improvements like normalization or feature reduction are universal fixes, when in fact the most urgent step is always to handle missing data, as it directly undermines model training and inference.

How to eliminate wrong answers

Option B is wrong because normalizing the feature range (e.g., scaling to 0-1) does not address missing values; it only adjusts the distribution of existing values, leaving the model to train on incomplete records. Option C is wrong because reducing the number of features may discard the key feature entirely, which could be critical for prediction, and does not fix the missing data problem in the remaining features. Option D is wrong because removing duplicate records addresses redundancy, not missing values; duplicates are not the cause of the accuracy drop, and removing them could even reduce valuable training data.

Practice this question →

79

Multi-Selecthard

Which THREE of the following are best practices for feature engineering in Einstein Studio?

Select 3 answers

A.Remove all records with missing values

B.Apply normalization to numerical features

C.Use raw data directly without any transformation

D.Use domain knowledge to create derived features

E.Use one-hot encoding for categorical variables

AnswersB, D, E

Normalization ensures features are on a similar scale.

Why this answer

Options B, D, and E are correct. Normalization scales features, domain knowledge creates meaningful derived features, and one-hot encoding handles categorical variables. Option A is wrong because raw data often needs processing.

Option C is wrong because removing missing values can lose information; imputation is often better.

Practice this question →

80

Multi-Selecteasy

A company is ingesting data from multiple sources into Data Cloud for Einstein. Which THREE data preparation steps should be performed?

Select 3 answers

A.Normalization

B.Field mapping

C.Encryption

D.Data labeling

E.Deduplication

AnswersA, B, E

Ensures consistent data formats across sources.

Why this answer

Normalization is correct because Data Cloud requires data from multiple sources to be transformed into a consistent format, such as standardizing date formats, units, or naming conventions, to ensure the data can be unified and analyzed effectively. This step is critical for Einstein AI models to process data without inconsistencies that could skew predictions or insights.

Exam trap

Salesforce often tests the distinction between data preparation steps (normalization, field mapping, deduplication) and data security or ML-specific tasks (encryption, data labeling) to see if candidates confuse operational data engineering with security or model training processes.

Practice this question →

81

Multi-Selecteasy

A company is preparing customer data to train a custom AI model for sentiment analysis. Which two data preparation best practices should they follow? (Choose two.)

Select 2 answers

A.Use only data from the last month.

B.Ensure data is representative of all customer demographics.

C.Remove all records with missing values.

D.Label data manually by a single annotator.

E.Anonymize personally identifiable information (PII) before training.

AnswersB, E

Representative data prevents model bias and improves generalization across customer segments.

Why this answer

Ensuring representative data and anonymizing PII are critical for model fairness and privacy. Removing all records with missing values can discard useful information; using only recent data may introduce bias; single-annotator labeling can cause subjective bias.

Practice this question →

82

MCQeasy

A marketing agency needs to ingest real-time social media mentions for a sentiment analysis AI model. Which Data Cloud object type should they use to set up the ingestion?

A.Data Lake Object

B.Calculated Insight

C.Data Stream Object with Event type

D.Data Transform

AnswerC

Specifically designed for real-time streaming ingest.

Why this answer

Option D is correct because Data Stream objects with type 'Event' are designed for real-time data ingestion. Option A is wrong because Data Lake Objects are for batch. Option B is wrong because Calculated Insights aggregate data.

Option C is wrong because Data Transformations process existing data.

Practice this question →

83

MCQhard

A data scientist notices that an Einstein model for predicting customer churn has unusually high accuracy on training data but performs poorly on validation data. Which data issue is the most likely cause?

A.The dataset has an imbalanced class distribution

B.The dataset contains many missing values

C.The model was trained on stale data from a different season

D.A field containing future information (e.g., 'churn_date') was included in features

AnswerD

Data leakage from a field that reveals the outcome causes overfitting and high train accuracy.

Why this answer

Option D is correct because including a field like 'churn_date' in the feature set introduces target leakage, where the model has access to information that would not be available at prediction time. This causes the model to appear highly accurate on training data (since it can directly 'see' the outcome) but fails to generalize to validation data where such future information is absent. In Salesforce Einstein, features must be strictly historical or static to avoid this data leakage issue.

Exam trap

Salesforce often tests the concept of data leakage by presenting it as a scenario where the model performs well on training data but poorly on validation data, and the trap is that candidates may confuse this with overfitting or class imbalance, rather than recognizing the inclusion of a future or target-related field as the root cause.

How to eliminate wrong answers

Option A is wrong because imbalanced class distribution typically causes the model to predict the majority class, leading to high accuracy on training data but poor performance on validation data only if the imbalance is extreme and not handled; however, the question describes 'unusually high accuracy' on training data, which is more characteristic of overfitting or leakage, not class imbalance. Option B is wrong because missing values generally degrade model performance across both training and validation sets, not causing a stark contrast between high training accuracy and low validation accuracy. Option C is wrong because stale data from a different season would cause poor performance on both training and validation data if the validation data is from the same season, or poor performance on validation data if it is from a different season, but it would not explain unusually high training accuracy.

Practice this question →

84

MCQmedium

A sales operations team is training an AI model to forecast quarterly revenue. They have five years of historical data, which includes a strong seasonal pattern but also a significant outlier: during the pandemic year, revenue dropped by 70% from typical values. The model trains with high accuracy on historical data but fails to predict future quarters accurately, consistently overestimating revenue. What should the data scientist do to improve forecast accuracy?

A.Add a binary feature indicating whether each quarter was during the pandemic.

B.Remove the data points corresponding to the pandemic year from the training set.

C.Normalize the entire dataset using Z-scores to reduce the impact of the outlier.

D.Include the outlier data and increase the model capacity to capture the anomaly.

AnswerB

Removing the outlier helps the model focus on typical patterns, improving generalization to future non-pandemic quarters.

Why this answer

Option B is correct because removing the pandemic year data eliminates the extreme outlier that is causing the model to learn a distorted seasonal pattern. The 70% revenue drop is not representative of future quarters, so including it forces the model to overestimate revenue to compensate for the anomaly. By training only on typical data, the model can learn the true seasonal pattern and generalize better to future quarters.

Exam trap

Salesforce often tests the misconception that you should keep all data and adjust the model (e.g., via normalization or capacity increase) rather than removing non-representative outliers, leading candidates to pick options like C or D.

How to eliminate wrong answers

Option A is wrong because adding a binary pandemic feature does not remove the outlier's influence; the model may still overfit to the anomalous drop and fail to generalize, as the feature only labels the outlier without correcting the skewed distribution. Option C is wrong because Z-score normalization scales the data but does not eliminate the outlier's impact on the model's learned weights; the extreme value still distorts the mean and variance, leading to biased forecasts. Option D is wrong because increasing model capacity to capture the anomaly encourages overfitting to the pandemic year's unique pattern, which will not recur, thus worsening generalization and maintaining the overestimation error.

Practice this question →

85

MCQeasy

Which data transformation is most appropriate for converting categorical variables into numerical format for a machine learning model?

A.Normalization.

B.One-hot encoding.

C.Principal component analysis.

D.Standardization.

AnswerB

One-hot encoding creates binary columns for each category, making them usable in models.

Why this answer

One-hot encoding is the correct transformation because it converts categorical variables into a binary vector representation, where each category becomes a separate column with a 1 or 0. This allows machine learning models to interpret categorical data without implying any ordinal relationship, which is essential for algorithms that rely on numerical distances or linear algebra.

Exam trap

Salesforce often tests the distinction between data preprocessing techniques (normalization, standardization) and encoding methods, trapping candidates who confuse scaling with categorical conversion.

How to eliminate wrong answers

Option A is wrong because normalization scales numerical features to a range (e.g., 0 to 1) and is used for continuous data, not for converting categorical variables into numbers. Option C is wrong because principal component analysis (PCA) is a dimensionality reduction technique that transforms existing numerical features into uncorrelated components, not a method for encoding categorical data. Option D is wrong because standardization centers data around a mean of 0 and standard deviation of 1, which is applied to numerical features and would not create meaningful representations for categorical variables.

Practice this question →

86

MCQhard

Refer to the exhibit. A data pipeline fails during the DataTransformation stage. What is the most likely root cause?

A.The pipeline has a network connectivity issue.

B.The data type for 'income' is incorrect.

C.A transformation step references the 'age' column, but it is not present in the input data.

D.The 'age' column contains null values.

AnswerC

The error clearly states 'age' column not found.

Why this answer

Option C is correct because the error occurs during the DataTransformation stage, which processes data after it has been successfully ingested. If a transformation step references the 'age' column but that column is missing from the input data, the pipeline will fail with a column-not-found error. This is a common schema mismatch issue in data pipelines, distinct from connectivity or data quality problems.

Exam trap

Salesforce often tests the distinction between pipeline stages (ingestion vs. transformation) and the specific type of error (missing column vs. data quality issue) to see if candidates understand that a missing column causes an immediate failure, while nulls or type mismatches may be handled differently depending on the pipeline configuration.

How to eliminate wrong answers

Option A is wrong because a network connectivity issue would typically cause the pipeline to fail during the data ingestion or extraction stage, not during the DataTransformation stage. Option B is wrong because an incorrect data type for 'income' would cause a type conversion error, but the question specifically states the failure is during transformation, and the error would be related to type mismatch, not a missing column. Option D is wrong because null values in the 'age' column would not cause a pipeline failure during transformation unless the transformation logic explicitly fails on nulls; most pipelines handle nulls gracefully or can be configured to skip or impute them.

Practice this question →

87

MCQeasy

A fraud detection model is being trained on transaction data where only 1% of transactions are fraudulent. The current model predicts 'non-fraud' for all transactions, achieving 99% accuracy. Which technique should be applied to improve model performance?

A.Remove the minority class to have balanced data

B.Set a lower classification threshold for fraud

C.Add more features like transaction location

D.Oversample the minority class or undersample the majority class

AnswerD

Resampling techniques create a more balanced training set, improving recall for fraud.

Why this answer

Oversampling or undersampling addresses class imbalance, allowing the model to learn minority patterns. Using more features alone doesn't fix imbalance, setting a lower threshold may help but is less common than resampling, and removing minority class is counterproductive.

Practice this question →

88

MCQmedium

A healthcare provider implements Data Cloud to predict patient readmission rates. They have HIPAA compliance requirements. The data includes sensitive patient health information (PHI). The AI model must be trained without exposing PHI to unauthorized users. The data architect uses Data Cloud's data masking on PHI fields. However, model performance drops significantly after masking because the masked values lose predictive value. What additional step should the architect consider to maintain model performance while protecting PHI?

A.Use tokenization for highly predictive fields like diagnosis codes instead of masking

B.Implement differential privacy within Einstein Studio

C.Remove masking and rely on user permissions to restrict access

D.Increase the volume of training data to compensate for masking

AnswerA

Tokenization retains referential integrity while hiding actual values.

Why this answer

Option C is the best approach because tokenization preserves the relationship between values (e.g., diagnosis codes) while obscuring the actual PHI. This allows the model to learn patterns without exposing sensitive data. Option A violates HIPAA.

Option B is not directly available in Einstein Studio as a built-in feature; differential privacy might be complex to implement. Option D does not address the masking issue.

Practice this question →

89

MCQeasy

For an AI project, data must be stored in a way that supports both training and real-time inference. Which storage solution meets this requirement?

A.Data warehouse (e.g., Snowflake)

B.Relational database (e.g., PostgreSQL)

C.Data lake (e.g., Amazon S3 or Azure Data Lake)

D.In-memory cache (e.g., Redis)

AnswerC

Data lakes store raw and processed data for various purposes.

Why this answer

A data lake (e.g., Amazon S3 or Azure Data Lake) is the correct choice because it can store vast amounts of raw, unstructured, and structured data in its native format, making it ideal for training AI models on diverse datasets. At the same time, data lakes support real-time inference by enabling direct access to data via APIs or streaming services (e.g., AWS Lambda or Azure Functions) without the latency of transforming data into a schema-on-write structure. This dual capability—handling both batch processing for training and low-latency reads for inference—is a key requirement that other storage solutions cannot fulfill as effectively.

Exam trap

Salesforce often tests the misconception that a data warehouse or relational database is sufficient for AI workloads because candidates overlook the need for raw, unstructured data storage and the flexibility of schema-on-read, instead focusing only on structured query performance.

How to eliminate wrong answers

Option A is wrong because a data warehouse (e.g., Snowflake) is optimized for structured, aggregated data and analytical queries, not for storing raw, unstructured data needed for AI training, and its schema-on-write approach introduces latency unsuitable for real-time inference. Option B is wrong because a relational database (e.g., PostgreSQL) enforces strict schemas and ACID transactions, which limit the flexibility to store diverse data types (e.g., images, text) required for AI training, and its row-based storage is inefficient for high-throughput, low-latency inference workloads. Option D is wrong because an in-memory cache (e.g., Redis) is designed for ephemeral, high-speed data access but lacks persistent storage and the capacity to hold large-scale training datasets, making it unsuitable for long-term data storage required for AI model training.

Practice this question →

90

MCQhard

A data scientist is building a predictive model for customer churn using Salesforce data. The dataset has 20 features, and the target variable is highly imbalanced (5% churn, 95% non-churn). Which technique should be applied to handle the class imbalance before training?

A.Apply Principal Component Analysis (PCA) for dimensionality reduction

B.Create interaction features between existing variables

C.Use accuracy as the evaluation metric

D.Use Synthetic Minority Over-sampling Technique (SMOTE)

AnswerD

SMOTE creates synthetic examples of the minority class.

Why this answer

SMOTE (Synthetic Minority Over-sampling Technique) is the correct choice because it generates synthetic samples for the minority class (churn) by interpolating between existing minority instances, effectively balancing the dataset without simply duplicating data. This prevents the model from being biased toward the majority class (non-churn) and improves recall for the churn class, which is critical in imbalanced classification problems.

Exam trap

Salesforce often tests the misconception that any data preprocessing technique (like PCA or feature engineering) can fix class imbalance, when in fact only resampling methods (SMOTE, ADASYN) or cost-sensitive learning directly address the skewed target distribution.

How to eliminate wrong answers

Option A is wrong because PCA is a dimensionality reduction technique that does not address class imbalance; it reduces feature space but does not alter the distribution of the target variable. Option B is wrong because creating interaction features may capture non-linear relationships but does not solve the imbalance problem; it can even exacerbate overfitting if the minority class remains underrepresented. Option C is wrong because accuracy is a misleading metric for imbalanced datasets—a model predicting all non-churn would achieve 95% accuracy but fail to identify any churn cases; metrics like precision, recall, F1-score, or AUC-ROC are appropriate instead.

Practice this question →

91

MCQmedium

After applying a log transformation to a numeric feature, an Einstein model’s performance dropped significantly. What is the most likely cause?

A.The data volume was reduced by the transformation

B.The feature was normally distributed after transformation

C.The feature contained zero or negative values

D.The transformation introduced multicollinearity with other features

AnswerC

Log of non-positive values is undefined, causing missing or infinity values.

Why this answer

Log transformation is undefined for zero or negative values because log(0) is negative infinity and log of a negative number is not a real number. In Salesforce Einstein, numeric features with such invalid transformed values can cause the model to fail or produce erratic results, leading to a significant drop in performance. This is the most likely cause given the symptom described.

Exam trap

Salesforce often tests the misconception that log transformation always improves model performance, but the trap here is that candidates overlook the mathematical constraint that log is undefined for non-positive values, causing them to choose a less relevant option like data volume reduction or multicollinearity.

How to eliminate wrong answers

Option A is wrong because log transformation does not reduce data volume; it merely applies a mathematical function to each value, preserving the number of records. Option B is wrong because making a feature normally distributed is typically beneficial for many models, not detrimental; a normal distribution after transformation would likely improve, not degrade, performance. Option D is wrong because log transformation is applied to a single feature and does not introduce multicollinearity, which is a relationship between two or more independent variables; it cannot create collinearity with other features on its own.

Practice this question →

92

MCQhard

A company is using customer support tickets to train a model for auto-classifying issues. The dataset includes fields like 'Case Title', 'Description', 'Product', and 'Customer Name'. Which privacy concern is most critical to address before training?

A.Anonymize personal identifiable information (PII) in Description and Title

B.Encrypt session tokens used in the support system

C.Remove all case numbers to prevent data leakage

D.Ensure all customers have opted in before using their data

AnswerA

PII in text must be removed to comply with privacy regulations and prevent bias.

Why this answer

Anonymizing PII in the text fields is critical to avoid exposing customer information in model artifacts or predictions. Session tokens are irrelevant, and case numbers are not PII. Opt-in is a legal requirement but not directly about data preparation for AI.

Practice this question →

93

MCQmedium

A Salesforce admin is preparing a dataset for Einstein Prediction Builder. The dataset contains a field "Income" with many missing values. The admin wants to minimize bias in the model. What is the best practice?

A.Delete all rows where Income is missing

B.Review the pattern of missingness and document reasons, then decide on imputation

C.Fill missing values with the mean of Income

D.Replace missing values with 0

AnswerB

Understanding why data is missing prevents bias from systematic exclusion or imputation.

Why this answer

Option C is correct because reviewing missingness patterns and documenting reasons helps uncover systemic biases. Option A (fill with mean) may distort relationships; Option B (delete rows) reduces sample and may introduce selection bias; Option D (replace with 0) is arbitrary and can skew results.

Practice this question →

94

MCQeasy

Refer to the exhibit. A data analyst runs a profile on a dataset and sees these statistics. Based on best practices, which action should be taken first?

A.Impute the 500 missing values with the mean

B.Remove the 200 duplicate records

C.Remove the 50 outliers in the Amount field

D.Skip all preprocessing and train the model directly

AnswerB

Duplicates can artificially inflate certain patterns and cause data leakage.

Why this answer

Option B is correct because duplicate records introduce bias and redundancy, leading to overfitting or skewed model performance. Removing duplicates is a standard first step in data preprocessing to ensure data integrity before handling missing values or outliers. In the context of the AI Associate exam, best practices prioritize deduplication early in the data cleaning pipeline.

Exam trap

Salesforce often tests the order of preprocessing steps, trapping candidates who jump to imputation or outlier removal without first cleaning duplicates, which is the foundational step in data preparation.

How to eliminate wrong answers

Option A is wrong because imputing missing values with the mean should only be considered after duplicates are removed, as duplicates can inflate the mean and distort imputation. Option C is wrong because removing outliers should be done after addressing duplicates and missing values, and only after understanding the domain context; premature outlier removal can discard legitimate data. Option D is wrong because skipping all preprocessing ignores fundamental data quality issues (missing values, duplicates, outliers) that degrade model accuracy and reliability, violating best practices for AI workflows.

Practice this question →

95

Multi-Selecteasy

A data scientist is preparing numeric features for a regression model. Which TWO transformations are commonly applied to improve model performance?

Select 2 answers

A.Normalize to a 0-1 range

B.Remove outliers beyond 3 standard deviations

C.Convert numbers to string labels

D.Apply one-hot encoding

E.Standardize to mean 0 and variance 1

AnswersA, E

Scales features to a common range, helpful for distance-based models.

Why this answer

Normalizing features to a 0-1 range (min-max scaling) ensures that all numeric features contribute equally to the model, preventing features with larger magnitudes from dominating the gradient descent optimization. This is especially important for distance-based algorithms like k-nearest neighbors or neural networks, where feature scale directly impacts convergence speed and model accuracy.

Exam trap

Salesforce often tests the distinction between data cleaning (e.g., outlier removal) and feature transformation (e.g., scaling), leading candidates to mistakenly select outlier removal as a transformation that improves model performance.

Practice this question →

96

MCQmedium

An admin created a data stream to bring external customer data into Data Cloud for Einstein. The data stream fails with error 'Schema mismatch: expected 10 fields, got 8'. What is the likely cause?

A.The data flow has a filter that drops fields.

B.The target object has validation rules.

C.The source file has extra columns.

D.The data stream definition expects more fields than the source provides.

AnswerD

Directly matches the error: expected 10, got 8.

Why this answer

The error 'Schema mismatch: expected 10 fields, got 8' indicates that the data stream definition in Data Cloud is configured to map 10 fields from the source, but the actual source file or API response only provides 8 fields. This mismatch occurs when the schema defined in the data stream does not match the source schema, typically because the source has fewer columns than expected. Option D correctly identifies this as the likely cause.

Exam trap

Salesforce often tests the distinction between schema-level errors (field count mismatch) and data-level errors (validation rules, filters), leading candidates to confuse data flow operations with data stream schema definitions.

How to eliminate wrong answers

Option A is wrong because a filter in a data flow drops rows (records), not fields (columns), and the error explicitly mentions a field count mismatch, not a row count issue. Option B is wrong because validation rules on the target object would cause record-level failures during data insertion, not a schema mismatch error during the data stream definition or ingestion phase. Option C is wrong because extra columns in the source file would cause the error to report more fields than expected (e.g., 'expected 10, got 12'), not fewer.

Practice this question →

97

MCQmedium

During data transformation, a data scientist applies one-hot encoding to a categorical feature with 50 unique values. The resulting dataset has 50 new columns. What is a potential drawback of this transformation?

A.Reduction in training time

B.Increased interpretability of the model

C.High cardinality leading to sparse data and overfitting

D.Loss of ordinal information in categories

AnswerC

High cardinality creates many sparse columns, risking overfitting.

Why this answer

One-hot encoding a categorical feature with 50 unique values creates 50 binary columns, each representing one category. This high cardinality leads to a very sparse matrix (most entries are 0), which can cause the model to overfit by learning noise from rare categories, especially when the dataset is not large enough to support such dimensionality.

Exam trap

Salesforce often tests the misconception that one-hot encoding always improves model performance by preserving all information, when in fact high cardinality introduces sparsity and overfitting risks that can degrade model accuracy.

How to eliminate wrong answers

Option A is wrong because one-hot encoding increases the number of features, which typically increases training time due to higher dimensionality, not reduces it. Option B is wrong because adding 50 new binary columns reduces interpretability; the model becomes more complex and harder to explain, especially with many dummy variables. Option D is wrong because one-hot encoding is designed for nominal (unordered) categories; ordinal information is not lost because it was never present — the feature had no inherent order, so no information is lost.

Practice this question →

98

MCQhard

A global company needs to ensure that customer data used for AI models complies with multiple regional regulations (GDPR, CCPA, LGPD). Which data governance practice is most effective?

A.Apply the strictest regulation globally.

B.Use a unified data catalog with tagging and classification.

C.Store all data in a single data warehouse.

D.Allow each region to manage its own data separately.

AnswerB

A data catalog helps track data lineage, apply policies, and ensure compliance per region.

Why this answer

Option D is correct because a unified data catalog with tagging and classification allows the organization to manage data governance and compliance across regions consistently.

Practice this question →

99

MCQeasy

To ensure AI model fairness and avoid biased outcomes, which practice is most critical when preparing training data?

A.Use only recent data

B.Increase model complexity

C.Use balanced training data

D.Use more features

AnswerC

Balanced data reduces bias towards any group.

Why this answer

Option C is correct because using balanced training data across different groups helps prevent bias. Option A is wrong because adding more features can introduce bias. Option B is wrong because increasing model complexity may overfit.

Option D is wrong because using only recent data may not represent all demographics.

Practice this question →

100

Multi-Selectmedium

Which TWO data sources can be used with Einstein Prediction Builder?

Select 2 answers

A.Files uploaded to Salesforce Files.

B.Data Cloud objects using the harmonized data model.

C.Standard Salesforce objects like Account and Opportunity.

D.Chatter feed posts.

E.Dashboard and report snapshots.

AnswersB, C

Data Cloud objects are supported.

Why this answer

Einstein Prediction Builder requires structured data that can be mapped to a prediction objective. Data Cloud objects using the harmonized data model provide a unified, standardized schema that Prediction Builder can consume directly, enabling predictions across multiple Salesforce and external data sources. Standard Salesforce objects like Account and Opportunity are also supported because they contain the fields and relationships needed to train predictive models.

Exam trap

Salesforce often tests the misconception that any data in Salesforce (like files or Chatter posts) can be used directly with Einstein Prediction Builder, when in fact only structured, field-level data from objects or harmonized Data Cloud objects is supported.

Practice this question →

101

MCQmedium

A company uses Einstein Prediction Builder to predict customer churn. The model's accuracy is low. The admin reviews the training data and notices that only 2% of records are churned. What should the admin do to improve the model?

A.Remove the churned records.

B.Increase the amount of training data.

C.Use oversampling techniques.

D.Change the prediction field.

AnswerC

Oversampling balances the classes and improves model sensitivity.

Why this answer

Option C is correct because when a dataset has severe class imbalance (only 2% churned records), the model becomes biased toward predicting the majority class (non-churned), leading to low accuracy despite high apparent performance. Oversampling techniques, such as SMOTE or random oversampling, artificially increase the number of churned records in the training set to balance the classes, allowing Einstein Prediction Builder to learn patterns for the minority class more effectively.

Exam trap

Salesforce often tests the misconception that adding more data always improves model performance, but here the trap is that candidates overlook class imbalance and choose 'Increase the amount of training data' (Option B) without realizing that more data with the same imbalance does not solve the problem.

How to eliminate wrong answers

Option A is wrong because removing the churned records would eliminate the minority class entirely, making it impossible for the model to learn to predict churn, and would result in a model that always predicts non-churn. Option B is wrong because simply increasing the amount of training data without addressing the class imbalance will likely maintain the same 2% churn ratio, providing more majority-class examples but not improving minority-class learning. Option D is wrong because changing the prediction field would alter the target variable itself, which does not fix the underlying class imbalance issue and would require redefining the business problem.

Practice this question →

102

MCQmedium

You are an admin at a financial services firm. The firm wants to use Einstein Next Best Action to offer personalized product recommendations to customers on its service portal. The data includes customer profiles, transaction history, and support case history. The Einstein Next Best Action strategy is configured with a recommendation that shows a 'Savings Account' offer to customers who have a checking account. However, the recommendation is not appearing for any customers. You check the Data Flow and see that the 'Account' object data is flowing correctly. The recommendation's filter condition is: AND( Has_Checking_Account__c = true, Age__c > 18 ). You verify that many customers meet these conditions. What is the most likely reason the recommendation is not appearing?

A.The 'Account' object is not supported by Einstein Next Best Action

B.The recommendation is not activated or published

C.The customer data is not being refreshed in real time

D.The filter condition is syntactically incorrect

AnswerB

Recommendations must be activated and published to be served to customers.

Why this answer

The most likely reason the recommendation is not appearing is that it has not been activated or published. In Einstein Next Best Action, recommendations must be explicitly activated or published to become available for serving to customers; configuration alone does not make them live. Since the data flow is correct and the filter conditions are valid, the missing activation step is the typical cause of a recommendation not showing.

Exam trap

The trap here is that candidates may focus on data flow or filter syntax issues, but the real test is understanding that activation is a required step in Einstein Next Best Action to make recommendations live.

How to eliminate wrong answers

Option A is wrong because the 'Account' object is fully supported by Einstein Next Best Action, as it is a standard Salesforce object that can be used in recommendation strategies. Option C is wrong because Einstein Next Best Action does not require real-time data refresh; it works with batch-synced data, and the Data Flow showing correct data indicates the data is available. Option D is wrong because the filter condition AND( Has_Checking_Account__c = true, Age__c > 18 ) is syntactically correct in Salesforce formula syntax and would not cause the recommendation to fail silently.

Practice this question →

103

Multi-Selecteasy

Which TWO considerations are important when labeling data for a supervised learning model?

Select 2 answers

A.Maintaining consistent guidelines.

B.Labeler expertise.

C.Using automated labeling for all tasks.

D.Ignoring inter-labeler agreement.

E.Labeling only a small sample.

AnswersA, B

Clear guidelines ensure labelers apply the same criteria, reducing variability.

Why this answer

Maintaining consistent guidelines (A) is critical because supervised learning models learn patterns from labeled data; inconsistent labels introduce noise and confuse the model, degrading its accuracy. Labeler expertise (B) ensures that domain-specific nuances are correctly captured, which is especially important for tasks like medical imaging or legal document classification where errors have high cost.

Exam trap

Salesforce often tests the misconception that automated labeling is a complete substitute for human labeling, when in reality it requires careful validation and is typically used to augment, not replace, human effort.

Practice this question →

104

MCQmedium

A data scientist is preparing data for Einstein Discovery. The dataset has 10,000 records with 5 predictors and one outcome. The outcome is binary (1/0). What is the minimum number of positive outcomes typically required for a reliable model?

A.250

B.500

C.100

D.50

AnswerA

50 per predictor * 5 predictors = 250 positive outcomes.

Why this answer

Option A is correct because for binary classification with 10,000 records and 5 predictors, a common rule of thumb in predictive modeling (including Einstein Discovery) is to have at least 10 events per predictor variable (EPV). With 5 predictors, you need at least 50 positive outcomes, but to ensure model stability and reliable training, a minimum of 250 positive outcomes (5% of 10,000) is typically required. This aligns with best practices for avoiding overfitting and achieving adequate statistical power.

Exam trap

Salesforce often tests the 10 events per predictor variable (EPV) rule, but the trap here is that candidates mistakenly apply the EPV rule directly (50 for 5 predictors) without considering the additional requirement for a minimum of 250 positive outcomes to ensure model reliability in Einstein Discovery.

How to eliminate wrong answers

Option B (500) is wrong because it overestimates the minimum requirement; while 500 positive outcomes would certainly be sufficient, the question asks for the minimum typically required, which is lower at 250. Option C (100) is wrong because it underestimates the requirement; with 5 predictors, 100 positive outcomes would only provide 20 events per predictor, which is below the recommended 10 EPV rule for reliable models. Option D (50) is wrong because it represents the bare minimum for 5 predictors under the 10 EPV rule, but in practice, Einstein Discovery and general best practices require a higher minimum (often 250 or 5% of records) to ensure model convergence and avoid instability.

Practice this question →

105

MCQeasy

A news outlet wants to build an AI model that predicts article popularity using real-time social media mentions. Which data source type should they use to ingest tweets?

A.Calculated Insight

B.Data Transform

C.Data Lake Object

D.Data Stream with Ingestion API connector

AnswerD

Enables real-time streaming from APIs.

Why this answer

Option A is correct because Data Stream with Ingestion API from Twitter allows real-time streaming. Option B is wrong because Data Lake Objects are for batch. Option C is wrong because Calculated Insights are for aggregates.

Option D is wrong because Data Transformations process existing data.

Practice this question →

106

MCQeasy

A company wants to use Einstein Article Recommendations to surface relevant knowledge articles to its support agents. What two data components are required to set up this feature?

A.Email-to-case logs and Knowledge Article feedback

B.Knowledge Article View event data and Case records

C.Knowledge Article categories and Case priority

D.Community user activity and Knowledge Article ratings

AnswerB

Article views show which articles were read; Cases provide context for recommendations.

Why this answer

Einstein Article Recommendations uses historical Knowledge Article View event data to understand which articles agents have found useful in the past, and Case records to provide context about the current issue. By analyzing patterns between case attributes and article views, the AI can predict and surface the most relevant articles for a given case. Without both data components, the recommendation engine cannot learn the association between case details and article usefulness.

Exam trap

Salesforce often tests the distinction between optional enhancement data (like ratings or categories) and the mandatory data sources (view events and case records) required to train the recommendation model.

How to eliminate wrong answers

Option A is wrong because Email-to-case logs are used for email-to-case routing and parsing, not for training article recommendations, and Knowledge Article feedback is a secondary signal, not a required data component. Option C is wrong because Knowledge Article categories and Case priority are metadata fields that can influence recommendations but are not the two required data components; the feature specifically needs view event data and case records. Option D is wrong because Community user activity is unrelated to agent-facing article recommendations, and Knowledge Article ratings are optional feedback, not a core requirement.

Practice this question →

107

MCQeasy

A company wants to use Einstein Reply Recommendations in Service Cloud. What data is required to train the model?

A.Knowledge articles only.

B.Email templates.

C.Case comments and chat transcripts.

D.Historical email replies and customer satisfaction ratings.

AnswerC

These capture actual agent-customer interactions used for training.

Why this answer

Einstein Reply Recommendations in Service Cloud uses historical service interactions to suggest relevant replies to agents. The model is trained on case comments and chat transcripts because these contain the natural language patterns and resolutions that agents actually use in real-time service conversations, enabling the AI to learn effective response strategies.

Exam trap

Salesforce often tests the misconception that Einstein Reply Recommendations uses Knowledge articles or email templates, when in fact it relies on unstructured service conversation data like case comments and chat transcripts to learn agent-specific reply patterns.

How to eliminate wrong answers

Option A is wrong because Knowledge articles are structured content used for knowledge-based answers, not the conversational reply patterns needed for Einstein Reply Recommendations. Option B is wrong because email templates are pre-written, static responses that lack the dynamic, contextual language variations found in actual service interactions. Option D is wrong because historical email replies and customer satisfaction ratings are not the primary training data; Einstein Reply Recommendations specifically requires case comments and chat transcripts to learn from direct agent-customer exchanges.

Practice this question →

108

MCQmedium

A manufacturer wants to improve demand forecasting by enriching its CRM orders with external demographic data. The external data is available via a SOAP API. How should the data architect implement this?

A.Configure a Data Action to call the API on a schedule or trigger

B.Use a Data Transform to pull the external data

C.Set up a Data Stream to continuously ingest the external API

D.Create a Calculated Insight to reference the API

AnswerA

Data Actions are built for external API integration.

Why this answer

Option D is correct because a Data Action can call an external API and store the response. Option A is wrong because Data Transform only works on data already in Data Cloud. Option B is wrong because Data Streams are for continuous ingest, not on-demand enrichment.

Option C is wrong because a Calculated Insight cannot fetch external data.

Practice this question →

109

MCQmedium

A telecom company uses Einstein Discovery to predict customer churn. The training dataset contains 100,000 records, but only 5% represent churned customers. The model achieves 95% accuracy on a holdout test set, but the recall for churn is only 20%. The business wants to proactively retain at-risk customers, so they need to identify as many churners as possible. What action should the data scientist take to improve churn recall?

A.Increase the regularization parameter to prevent overfitting.

B.Collect more data, especially of churned customers.

C.Oversample the minority class using SMOTE to create synthetic churn examples.

D.Undersample the majority class to match the minority class size.

AnswerC

SMOTE generates synthetic instances of the minority class, balancing the dataset and improving recall without losing information.

Why this answer

Class imbalance causes the model to favor the majority class. Oversampling the minority class (e.g., using SMOTE) balances the dataset, helping the model learn churn patterns better and improve recall.

Practice this question →

110

Multi-Selectmedium

Which TWO are best practices for data labeling in AI projects? (Choose two.)

Select 2 answers

A.Have multiple labelers cross-check annotations

B.Label all available data immediately

C.Label only training data

D.Use automated labeling tools exclusively

E.Use clear labeling guidelines

AnswersA, E

Cross-checking improves label accuracy.

Why this answer

Options B and D are correct. Multiple labelers cross-checking reduces errors, and clear guidelines ensure consistency. Option A is wrong because relying solely on automated tools can introduce inaccuracies.

Option C is wrong because labeling all data immediately may not be feasible or prioritize quality. Option E is wrong because labeling only training data ignores validation/testing needs.

Practice this question →

111

MCQhard

You are a Salesforce AI Specialist at a mid-sized manufacturing company. The company uses Einstein Lead Scoring to prioritize leads. The model was trained on historical lead data and has been in production for three months. Recently, the sales team reports that high-scoring leads are not converting as expected. You investigate and find that the model's data source includes leads from the past 18 months. However, six months ago, the company changed its lead qualification process: they started requiring a demo before scoring leads as 'qualified.' As a result, the definition of a converted lead changed. What is the best course of action to improve model performance?

A.Manually adjust the model's prediction threshold to account for the new process

B.Retrain the model using only leads from the last six months after the process change

C.Remove the 'Demo Scheduled' field from the model to avoid bias

D.Add more historical leads from before the process change to increase data volume

AnswerB

This ensures the model learns from data that reflects the current conversion criteria.

Why this answer

Option B is correct because the change in lead qualification process six months ago introduced a data distribution shift (concept drift), making older leads no longer representative of the current conversion behavior. Retraining the model on only the last six months of data aligns the training set with the new definition of a 'converted lead,' allowing Einstein Lead Scoring to learn the updated patterns and improve prediction accuracy.

Exam trap

The trap here is that candidates may think adjusting the threshold (Option A) is sufficient, but they fail to recognize that a change in the definition of the target variable requires retraining on a representative dataset, not just tuning a post-processing parameter.

How to eliminate wrong answers

Option A is wrong because manually adjusting the prediction threshold does not address the underlying change in the definition of a converted lead; it only shifts the cutoff for scoring, which cannot compensate for a fundamentally different target variable. Option C is wrong because removing the 'Demo Scheduled' field does not solve the problem—the issue is the change in the conversion definition, not bias from that field; in fact, the field may now be more predictive under the new process. Option D is wrong because adding more historical leads from before the process change would exacerbate the data mismatch, as those leads follow the old qualification rules and would dilute the model's ability to learn the current conversion patterns.

Practice this question →

112

MCQmedium

A data scientist needs to feed customer interaction data into Einstein Discovery for predictive analysis. Which data format is required?

A.CSV

B.XML

C.Parquet

D.JSON

AnswerA

CSV is the required format for Einstein Discovery.

Why this answer

Option C is correct because Einstein Discovery typically requires data in CSV format. Option A is wrong because JSON is not the standard input format for Einstein Discovery. Option B is wrong because XML is not commonly used.

Option D is wrong because Parquet is a columnar storage format not directly supported.

Practice this question →

113

Multi-Selecthard

Which THREE are key dimensions of data quality that directly impact AI model performance?

Select 3 answers

A.Consistency.

B.Timeliness.

C.Accuracy.

D.Data volume.

E.Completeness.

AnswersA, C, E

Inconsistent data (e.g., different formats) confuses models and degrades accuracy.

Why this answer

Consistency is a key dimension of data quality because AI models rely on stable patterns in the data. If the same entity is represented differently across records (e.g., 'NY' vs 'New York'), the model may learn incorrect correlations, leading to degraded prediction accuracy and unreliable outputs.

Exam trap

Salesforce often tests the distinction between data quality dimensions and data quantity metrics, so candidates mistakenly select 'data volume' thinking more data always improves AI performance, when in fact the exam focuses on accuracy, completeness, and consistency as the three critical quality dimensions.

Practice this question →

114

MCQhard

A data architect is designing a data model for Einstein Discovery. The data includes categorical variables with high cardinality (e.g., postal codes). What is the best practice to handle such features?

A.Encode them as one-hot vectors.

B.Exclude them from the model.

C.Use the raw values without transformation.

D.Group them into higher-level categories (e.g., region).

AnswerD

Reduces cardinality while preserving signal.

Why this answer

Grouping high-cardinality categories into broader categories reduces overfitting and improves model stability.

Practice this question →

115

Multi-Selecthard

Which TWO considerations are critical when planning data labeling for a computer vision project in a regulated industry?

Select 2 answers

A.Data storage location for label files

B.Mitigating labeler bias to ensure fairness

C.Compliance with data privacy regulations (e.g., GDPR)

D.Labeling timeline and budget constraints

E.Choosing between bounding boxes and segmentation masks

AnswersB, C

Bias can affect model fairness and regulatory requirements.

Why this answer

Option B is correct because labeler bias can introduce systematic errors into the training data, leading to models that perform unfairly or inaccurately across different demographic groups. In regulated industries, such bias can violate anti-discrimination laws and regulatory standards, making its mitigation a critical planning consideration.

Exam trap

Salesforce often tests the distinction between operational details (like storage location or annotation type) and critical regulatory or ethical considerations, leading candidates to choose technically valid but non-critical options like A or E.

Practice this question →

116

MCQhard

A financial services firm uses Data Cloud to enrich sales data with external credit scores via an API. They set up a Data Action to call the credit bureau API for each new lead. Over time, API costs are rising, and the action is slowing down lead processing. They only need credit scores for leads with a high probability of conversion. What is the best approach to reduce costs and improve performance?

A.Remove the Data Action and manually verify credit scores for top leads

B.Apply a Data Transform to filter leads that have incomplete data before the Data Action

C.Schedule the Data Action to run daily in batch instead of real-time

D.Use a Calculated Insight to score leads based on internal data and only invoke the Data Action for leads with a high probability of conversion

AnswerD

Selectively calls API only for promising leads, reducing costs and load.

Why this answer

Option C is the best solution because it uses a Calculated Insight to compute a conversion probability score based on internal data, then only triggers the Data Action for leads above a threshold. This reduces API calls significantly. Option A filters but still requires an initial call? Actually, a Data Transform filter could be applied before the Data Action, but a Calculated Insight allows dynamic scoring.

Option B runs the action in batch, but still calls for all leads. Option D removes the action entirely, losing the enrichment.

Practice this question →

117

MCQmedium

A data engineer needs to create a feature that represents the average purchase amount per customer over the last 30 days. The transactional data is timestamped. Which feature engineering technique is most appropriate?

A.Sum of all purchase amounts per customer

B.Rolling average of purchase amounts over a 30-day window

C.Count of purchases per customer

D.Minimum purchase amount per customer

AnswerB

Rolling average matches the requirement.

Why this answer

Option B is correct because a rolling average over a 30-day window directly computes the average purchase amount per customer for only the most recent 30 days of transactions, which matches the requirement of a time-sensitive feature. This technique uses a sliding window function (e.g., AVG() with a ROWS or RANGE frame in SQL, or rolling().mean() in pandas) that respects the timestamp order, ensuring only relevant data contributes to the feature.

Exam trap

Salesforce often tests the distinction between simple aggregation (like sum or count) and time-windowed aggregation, trapping candidates who overlook the 'over the last 30 days' temporal constraint and choose a static aggregate instead.

How to eliminate wrong answers

Option A is wrong because summing all purchase amounts per customer ignores the 30-day time constraint and would include historical data outside the window, producing a feature that does not reflect recent behavior. Option C is wrong because counting purchases per customer measures frequency, not the average amount spent, and also lacks the time window restriction. Option D is wrong because the minimum purchase amount per customer is a different aggregate (minimum) that does not capture the central tendency of spending and similarly ignores the 30-day window.

Practice this question →

118

MCQeasy

A company is preparing data for Einstein Article Recommendation. Which data source is most appropriate for training the model?

A.Historical article view and click data.

B.Org metadata.

C.System debug logs.

D.User profile data only.

AnswerA

This captures user preferences directly.

Why this answer

Einstein Article Recommendation uses supervised machine learning to predict which articles users are likely to find relevant. The model must be trained on historical user engagement signals—specifically article view and click data—to learn patterns of relevance. Without this behavioral data, the model cannot establish a correlation between user actions and article content.

Exam trap

Salesforce often tests the misconception that static data like user profiles or org metadata can substitute for behavioral training data, but the model fundamentally requires historical interaction signals to learn relevance.

How to eliminate wrong answers

Option B is wrong because org metadata (e.g., company name, industry) provides only static contextual information and lacks the user-article interaction signals required for training a recommendation model. Option C is wrong because system debug logs contain low-level technical events (e.g., errors, stack traces) that are irrelevant to user content preferences and would introduce noise rather than meaningful training features. Option D is wrong because user profile data alone (e.g., role, department) does not capture which articles users actually viewed or clicked, so the model cannot learn relevance from user behavior.

Practice this question →

119

MCQeasy

A company plans to use Einstein Discovery to analyze sales data. Which data preparation step is essential for time-series forecasting?

A.Remove all outliers in sales amounts

B.Ensure date fields are properly formatted and contain sufficient historical range

C.Remove duplicate records

D.Scale all numeric fields to a 0-1 range

AnswerB

Einstein Discovery relies on date fields for trend detection.

Why this answer

For time-series forecasting in Einstein Discovery, the date field must be properly formatted (e.g., as a date or datetime data type) and contain a sufficient historical range to identify patterns like seasonality and trends. Without adequate historical data, the model cannot learn temporal dependencies, making this step essential.

Exam trap

Salesforce often tests the misconception that data normalization (scaling) is always required for AI models, but for tree-based algorithms like those in Einstein Discovery, scaling is irrelevant, and the trap is that candidates pick Option D thinking it is a universal preprocessing step.

How to eliminate wrong answers

Option A is wrong because removing all outliers in sales amounts can discard legitimate extreme values that represent real-world events (e.g., holiday spikes, promotions), which are critical for accurate time-series forecasting; Einstein Discovery handles outliers through model tuning rather than blanket removal. Option C is wrong because while removing duplicate records is a general data cleaning best practice, it is not specifically essential for time-series forecasting; duplicates in date-indexed data are typically handled by aggregation or deduplication, but this step is not a prerequisite for the forecasting algorithm. Option D is wrong because scaling numeric fields to a 0-1 range is unnecessary for time-series forecasting in Einstein Discovery, as tree-based models (like Gradient Boosted Trees) used internally are invariant to monotonic transformations and do not require normalization.

Practice this question →

120

MCQmedium

During the data preparation phase for an AI model, a data engineer discovers that the 'AnnualRevenue' field contains some negative values. What is the best course of action?

A.Delete all records with negative revenue

B.Replace negative values with the mean of positive values

C.Keep negative values as they might represent returns or refunds

D.Investigate the data source to correct the negative values

AnswerD

Correcting at source ensures data integrity.

Why this answer

Option D is correct because negative revenue values typically indicate data entry errors, system bugs, or incorrect data transformations. The best practice in data preparation is to investigate the source system to understand why negative values were generated and correct them at the origin, ensuring data integrity before any imputation or deletion. Simply deleting or imputing without root-cause analysis can introduce bias or mask underlying data quality issues.

Exam trap

Salesforce often tests the misconception that imputation (e.g., mean replacement) is a safe default for handling invalid data, when in fact the correct first step is always to trace and fix the root cause at the data source.

How to eliminate wrong answers

Option A is wrong because deleting records with negative revenue can introduce selection bias and reduce the dataset size, potentially discarding valid data if negative values represent legitimate business events like refunds. Option B is wrong because replacing negative values with the mean of positive values artificially inflates the central tendency and distorts the distribution, which can degrade model performance, especially for regression tasks. Option C is wrong because keeping negative values as-is without investigation assumes they are valid, but in most financial datasets, revenue is non-negative by definition, and unverified negative values will mislead the model during training.

Practice this question →

121

MCQhard

What is the primary purpose of this policy?

A.Data governance for AI

B.Data integration

C.Data transformation

D.Data backup

AnswerA

Policy controls access, masking, and retention.

Why this answer

Option C is correct because the policy defines data access rules, field allowances, masking, and retention, which are all components of data governance for AI. Option A is wrong because data integration focuses on combining data, not controlling access. Option B is wrong because data backup is about recovery.

Option D is wrong because data transformation changes data format.

Practice this question →

122

MCQhard

A data architect notices that a Data Stream from an external ERP system is failing intermittently with schema mismatch errors. The ERP team says the schema changes occasionally. What is the most effective long-term solution?

A.Implement Data Stream schema validation and flexible mapping

B.Use a different primary key for the data model

C.Increase the number of retries for the Data Stream

D.Ask the ERP team to manually notify of changes

AnswerA

Automates handling of schema changes.

Why this answer

Option C is correct because adding schema validation and mapping checks can detect and handle changes gracefully. Option A is wrong because increasing retries doesn't fix the root cause. Option B is wrong because manual intervention is not sustainable.

Option D is wrong because using a different primary key won't address schema changes.

Practice this question →

123

MCQeasy

An admin is setting up Einstein Article Recommendations. Which type of data is essential for the model to learn which articles are relevant?

A.Article publication dates

B.Article view events from users

C.User job titles

D.Article author names

AnswerB

View events are the primary input for collaborative filtering.

Why this answer

Einstein Article Recommendations uses a collaborative filtering model that learns article relevance from user interaction signals, specifically article view events. The model analyzes patterns of which articles users view together to identify related content, making view events the essential training data for generating recommendations.

Exam trap

Salesforce often tests the distinction between essential training data (user behavior signals like view events) and optional metadata (like publication dates or author names), leading candidates to mistakenly choose metadata that seems relevant but is not required for the collaborative filtering model to learn article relevance.

How to eliminate wrong answers

Option A is wrong because article publication dates are metadata that influence recency but are not used as primary training signals for collaborative filtering; the model learns relevance from user behavior, not timestamps. Option C is wrong because user job titles are demographic attributes that could be used for personalization but are not essential for the core recommendation model, which relies on interaction data like views. Option D is wrong because article author names are content metadata that do not provide the behavioral signals needed for the model to learn which articles are relevant to users.

Practice this question →

124

MCQeasy

A company wants to use Einstein Prediction Builder to predict customer churn. Which data preparation step is essential before building the model?

A.Ensure the data is in a Salesforce connected data source like Data Cloud.

B.Define the prediction objective and the target date field.

C.Create a formula field to calculate the churn probability.

D.Create a new custom object to store the prediction results.

AnswerB

The prediction objective (e.g., churn) is required to train the model.

Why this answer

Option B is correct because Einstein Prediction Builder requires you to define the prediction objective (e.g., 'Will this customer churn?') and specify the target date field that marks the event. This step is essential as it tells the model what to predict and over what time window, enabling the automated feature engineering and model training process.

Exam trap

Salesforce often tests the misconception that data must come from Data Cloud or that you need to pre-create storage objects, when in fact the core prerequisite is simply defining the prediction objective and target date field.

How to eliminate wrong answers

Option A is wrong because while Data Cloud is a supported data source, it is not mandatory; Einstein Prediction Builder can also use standard or custom objects directly in Salesforce. Option C is wrong because formula fields cannot be used to calculate churn probability; the model generates probability scores automatically after training, and you do not pre-compute them. Option D is wrong because prediction results are stored automatically in a standard Salesforce object (PredictionResult) or can be written to a field on the record; you do not need to create a custom object for storage.

Practice this question →

125

MCQhard

A bank uses Einstein Discovery to generate insights about loan approval decisions. After deployment, they notice the model denies loans to a higher percentage of applicants from a certain postal code. Which action should be taken to ensure responsible AI?

A.Ignore the discrepancy because postal code is not a protected attribute

B.Retrain the model using only recent loan data

C.Audit model outcomes for fairness across demographic groups and retrain if needed

D.Remove the postal code field from the model

AnswerC

Bias audit and mitigation is a standard responsible AI practice.

Why this answer

Option C is correct because responsible AI requires auditing model outcomes for fairness across demographic groups, even when the disparity correlates with a non-protected attribute like postal code. In Einstein Discovery, postal code can act as a proxy for protected attributes such as race or socioeconomic status, and ignoring this could lead to discriminatory lending practices. Auditing allows the team to detect and mitigate bias, and retraining with fairness constraints ensures the model aligns with ethical AI principles.

Exam trap

Salesforce often tests the misconception that removing a sensitive feature (like postal code) automatically eliminates bias, when in reality proxy features and correlated variables can still cause unfair outcomes.

How to eliminate wrong answers

Option A is wrong because ignoring the discrepancy is irresponsible; postal code can be a proxy for protected attributes (e.g., race or income), and model fairness must be evaluated even if the field itself is not protected. Option B is wrong because retraining on only recent loan data does not address the root cause of bias; it may even amplify existing disparities if recent data still reflects historical biases or sampling issues. Option D is wrong because simply removing the postal code field does not guarantee fairness; other correlated features (e.g., income, credit history) can still encode the same bias, and the model may still discriminate indirectly through proxy variables.

Practice this question →

126

MCQmedium

To integrate external data into Salesforce for AI, which tool is recommended by Salesforce for building data pipelines?

A.Salesforce Connect

B.Data Export Service

C.Salesforce Data Pipelines

D.Apex Data Loader

AnswerC

Data Pipelines provides a visual interface to build and schedule data transformations for AI.

Why this answer

Option B is correct because Salesforce Data Pipelines is the recommended tool for creating and managing data integration workflows for AI and analytics. The other options are not designed for this purpose.

Practice this question →

127

MCQmedium

A company uses Salesforce Data Platform to store customer data. They want to use this data to train an AI model for lead scoring, but they are concerned about data quality. Which step should they take first to ensure the data is suitable for AI?

A.Profile the data to identify missing values, outliers, and inconsistencies

B.Immediately normalize all numerical features

C.Create a labeled dataset using historical lead outcomes

D.Set up a data pipeline to stream data in real-time

AnswerA

Profiling is the first step to assess data quality.

Why this answer

Profiling the data is the essential first step because it systematically identifies missing values, outliers, and inconsistencies that degrade model performance. Without this baseline assessment, any subsequent normalization or labeling would be applied to flawed data, leading to unreliable lead scoring predictions. Salesforce Data Platform supports profiling via tools like Einstein Analytics or Data Prep, which scan fields for nulls, range violations, and format errors.

Exam trap

Salesforce often tests the misconception that data preparation begins with feature engineering (like normalization) or pipeline setup, rather than with foundational data quality assessment through profiling.

How to eliminate wrong answers

Option B is wrong because normalizing numerical features is a preprocessing step that should only occur after data quality issues (like missing values or outliers) have been identified and resolved; applying normalization prematurely can amplify the impact of corrupt data. Option C is wrong because creating a labeled dataset is a critical step for supervised learning, but it assumes the raw data is already clean and consistent, which is not the case when data quality is a concern. Option D is wrong because setting up a real-time data pipeline addresses data velocity and freshness, not data quality; streaming dirty data into the pipeline would only propagate errors faster.

Practice this question →

128

MCQhard

An organization uses Salesforce Data Cloud to unify customer data from multiple sources. They want to ensure that data lineage is tracked for AI models. Which practice supports data lineage?

A.Use data partitioning to improve query performance.

B.Implement role-based access control on datasets.

C.Maintain metadata that records source, transformations, and dependencies.

D.Regularly run data profiling to check completeness.

AnswerC

Metadata enables lineage tracking.

Why this answer

Maintaining metadata that records source, transformations, and dependencies is the correct practice because data lineage for AI models requires a complete audit trail of where data originated, how it was transformed, and its dependencies. In Salesforce Data Cloud, this metadata is captured through the Data Catalog and Data Lineage feature, which tracks the flow of data from source objects through calculated insights and segments to AI model inputs, ensuring transparency and reproducibility.

Exam trap

Salesforce often tests the distinction between data management practices that improve performance or security versus those that specifically support auditability and traceability, leading candidates to confuse data partitioning or access control with lineage tracking.

How to eliminate wrong answers

Option A is wrong because data partitioning improves query performance by dividing data into smaller segments, but it does not track the origin, transformation steps, or dependencies of data, which are essential for lineage. Option B is wrong because role-based access control (RBAC) governs who can view or modify datasets, but it provides no record of data provenance or transformation history. Option D is wrong because data profiling checks completeness, accuracy, and consistency of data, but it does not capture the sequence of transformations or source-to-target mappings required for lineage.

Practice this question →

129

MCQhard

A team is labeling text data for a sentiment analysis model. To ensure consistency and quality, which practice should they prioritize?

A.Use a single expert labeler for all data.

B.Use majority voting among multiple labelers.

C.Label all data by a single expert labeler.

D.Allow each labeler to interpret guidelines freely.

AnswerB

Majority voting aggregates judgments, improving accuracy and consistency.

Why this answer

Majority voting among multiple labelers reduces individual bias and errors, improving label consistency and quality for training data. This approach is standard in supervised learning for sentiment analysis because it aggregates diverse judgments, leading to more reliable ground truth labels.

Exam trap

Salesforce often tests the misconception that a single expert labeler guarantees higher quality, when in fact multiple labelers with majority voting reduce bias and improve reliability for training data.

How to eliminate wrong answers

Option A is wrong because using a single expert labeler introduces individual bias and lacks error checking, which can degrade model performance due to inconsistent or subjective labels. Option C is wrong because labeling all data by a single expert labeler is identical to Option A and suffers from the same lack of consensus and quality assurance. Option D is wrong because allowing each labeler to interpret guidelines freely leads to high inter-labeler variability, undermining consistency and making the dataset unreliable for training a robust model.

Practice this question →

130

MCQmedium

Refer to the exhibit. What effect does this masking policy have on the data used for training an Einstein model?

A.Only SSN is masked.

B.SSN and CreditCard fields are encrypted.

C.SSN and CreditCard fields are completely removed from training data.

D.SSN and CreditCard fields are partially masked, showing only the last four characters.

AnswerD

Explicitly defined by showLastFour and maskingType partial.

Why this answer

The masking policy in Einstein applies a partial mask to sensitive fields like SSN and CreditCard, showing only the last four characters while obscuring the rest. This ensures that the data used for training retains its structural utility for model learning without exposing full sensitive values, which is why option D is correct.

Exam trap

Salesforce often tests the distinction between masking, encryption, and removal, where candidates mistakenly think masking is equivalent to encryption or complete deletion, but masking specifically preserves partial data for model training while hiding sensitive details.

How to eliminate wrong answers

Option A is wrong because the masking policy applies to both SSN and CreditCard fields, not just SSN, as indicated by the exhibit showing both fields being masked. Option B is wrong because masking is not encryption; encryption transforms data into a ciphertext that requires a key to reverse, whereas masking irreversibly obscures parts of the data for privacy. Option C is wrong because the policy does not completely remove the fields; it partially masks them, leaving the last four characters visible for training purposes.

Practice this question →

131

Multi-Selectmedium

Which TWO data preparation steps are critical for ensuring high-quality training data?

Select 2 answers

A.Increasing dataset size by adding noise.

B.Removing duplicate records.

C.Normalizing all features.

D.Handling missing values appropriately.

E.Using only labeled data.

AnswersB, D

Duplicates can overrepresent certain patterns and skew model training.

Why this answer

Option B is correct because duplicate records in a dataset can cause the model to overfit to repeated patterns, biasing the learned distribution and reducing generalization. Removing duplicates ensures each data point contributes equally to training, which is essential for robust model performance.

Exam trap

Salesforce often tests the distinction between data preparation steps that ensure data quality (like removing duplicates and handling missing values) versus optional preprocessing or augmentation techniques, leading candidates to mistakenly select normalization or noise addition as critical steps.

Practice this question →

132

MCQeasy

A company plans to train an AI model using data from Salesforce CRM and an external marketing automation platform. What is the first step to unify these data sources in Data Cloud?

A.Define a Data Model that maps fields from both sources to a unified customer object

B.Create two separate Data Streams to bring data in

C.Build a Calculated Insight to merge the data

D.Set up a Data Transformation to blend the sources

AnswerA

Unifies the schema before ingestion.

Why this answer

Option A is correct because creating a data model that maps fields from both sources to a common object ensures consistency. Option B is wrong because data streams come after the model. Option C is wrong because Calculated Insights are for aggregations.

Option D is wrong because data transformations are applied later.

Practice this question →

133

MCQhard

A system administrator receives an error when running a Data Cloud data transform: 'Row-level security settings are preventing access to the source data.' The admin has appropriate permissions. What is the most likely cause?

A.The target object has field-level security.

B.The data stream is scheduled during maintenance.

C.The data transform is set to run as the admin's default user.

D.The source object has sharing rules that restrict access for the data transform's running user.

AnswerD

Row-level security is about sharing; the running user may not see all rows.

Why this answer

Option D is correct because Data Cloud data transforms run under a specific running user context, and row-level security (RLS) settings on the source object can restrict that user's access to rows, even if the admin has broad permissions. The error indicates that the running user lacks visibility to certain source data rows due to sharing rules or RLS policies, which is a common cause of this specific error message.

Exam trap

Salesforce often tests the distinction between row-level security and field-level security, and candidates mistakenly choose field-level security (Option A) because they confuse the two concepts, not realizing the error message explicitly points to row-level restrictions.

How to eliminate wrong answers

Option A is wrong because field-level security (FLS) controls access to fields, not rows, and the error explicitly mentions 'row-level security,' not field-level. Option B is wrong because scheduled maintenance would typically cause a different error (e.g., 'service unavailable' or timeout), not a row-level security access denial. Option C is wrong because running as the admin's default user would inherit the admin's permissions, which should have access; the error states the admin has appropriate permissions, so the issue is with the running user context, not the default user setting.

Practice this question →

134

MCQhard

An organization is preparing data for Einstein Next Best Action. They have multiple action types (discounts, product suggestions, content). Which data model approach best ensures accurate recommendations?

A.Train a separate model per customer segment and then merge.

B.Create a separate model for each action type and combine results manually.

C.Build an ensemble of models and average their outputs.

D.Use a single model that includes all action types in the training data.

AnswerD

A unified model captures interactions between actions, leading to better optimization of the next best action.

Why this answer

Option D is correct because Einstein Next Best Action is designed to learn from all action types simultaneously within a single model. By including all action types (discounts, product suggestions, content) in the training data, the model can capture cross-action patterns and relative effectiveness, leading to more accurate and contextually relevant recommendations. A unified model avoids fragmentation and ensures consistent scoring across actions.

Exam trap

Salesforce often tests the misconception that separate models per action or segment improve accuracy, when in fact Einstein Next Best Action requires a single unified model to learn cross-action patterns and deliver coherent recommendations.

How to eliminate wrong answers

Option A is wrong because training a separate model per customer segment and merging results introduces fragmentation and ignores cross-segment patterns, reducing the model's ability to generalize and leading to inconsistent recommendations. Option B is wrong because creating a separate model for each action type and manually combining results loses the interdependencies between actions, such as when a discount makes a product suggestion more effective, and adds unnecessary complexity without leveraging Einstein's built-in multi-action support. Option C is wrong because building an ensemble of models and averaging outputs does not align with Einstein Next Best Action's architecture, which expects a single unified model to optimize across all actions; averaging can dilute the signal from specific action types and reduce recommendation precision.

Practice this question →

135

MCQhard

A company uses Salesforce Data Cloud to unify customer data from multiple sources for AI model training. After adding a new data source, model performance degrades significantly. What is the most likely cause?

A.Insufficient compute resources

B.Data labeling errors

C.Data schema mismatch

D.Data duplication from overlapping sources

AnswerD

Duplication introduces bias and degrades performance.

Why this answer

Option A is correct because data duplication due to overlapping records from multiple sources can bias the model. Option B is wrong because schema mismatch would cause load errors, not just performance degradation. Option C is wrong because compute issues would affect all models.

Option D is wrong because data labeling errors would affect the training process, not the data unification step.

Practice this question →

136

MCQeasy

A company wants to use Einstein Article Recommendations to suggest knowledge articles to support agents. What is a prerequisite for this feature?

A.Articles must be of a specific type, such as FAQ.

B.The org must be enabled for Einstein features.

C.A case must be open for the recommendation to appear.

D.Knowledge articles must be created and published.

AnswerD

Articles must exist to be recommended.

Why this answer

Einstein Article Recommendations requires that knowledge articles are created and published in the Salesforce Knowledge base. The feature uses natural language processing (NLP) to match the context of a case or conversation with published articles, so unpublished or draft articles cannot be recommended. Without published articles, the AI model has no content to analyze or suggest.

Exam trap

Salesforce often tests the distinction between general Einstein enablement and feature-specific prerequisites, so candidates mistakenly select Option B thinking that enabling Einstein is the only requirement, when in fact published articles are the critical prerequisite for Article Recommendations to function.

How to eliminate wrong answers

Option A is wrong because Einstein Article Recommendations does not require articles to be of a specific type like FAQ; it works with any standard or custom article type defined in Salesforce Knowledge. Option B is wrong because while Einstein features generally require the org to be enabled for Einstein, this is a platform-level prerequisite for all Einstein services, not a specific prerequisite for Article Recommendations—the question asks for a prerequisite specific to this feature. Option C is wrong because a case does not need to be open for recommendations to appear; Einstein can also suggest articles in other contexts such as Chat, Email-to-Case, or even in the Knowledge tab without an open case.

Practice this question →

137

MCQeasy

Refer to the exhibit. A data transformation configuration is shown. Which of the following describes the outcome of applying this transformation?

A.The transformation is invalid because one-hot encoding cannot be combined with scaling.

B.Only 'color' is transformed; 'price' and 'weight' are unchanged.

C.'color' is one-hot encoded into multiple binary columns; 'price' and 'weight' are standardized to have mean 0 and variance 1.

D.'color' is scaled to [0,1] and 'price', 'weight' are one-hot encoded.

AnswerC

Correct interpretation of the config.

Why this answer

Option C is correct because the transformation configuration applies a one-hot encoder to the 'color' categorical column, creating multiple binary columns, and applies a standard scaler to the 'price' and 'weight' numerical columns, centering them to mean 0 and scaling to unit variance. This is a common preprocessing pipeline that handles mixed data types appropriately.

Exam trap

Salesforce often tests the ability to distinguish which transformation applies to which column type, trapping candidates who confuse scaling with encoding or assume that different transformations cannot coexist in a single pipeline.

How to eliminate wrong answers

Option A is wrong because one-hot encoding and scaling can be combined in a single transformation pipeline; they are applied to different columns (categorical vs. numerical) and are not mutually exclusive. Option B is wrong because the transformation explicitly applies a standard scaler to 'price' and 'weight', so they are not unchanged; they are standardized. Option D is wrong because it reverses the operations: 'color' is one-hot encoded, not scaled to [0,1], and 'price' and 'weight' are standardized, not one-hot encoded.

Practice this question →

138

MCQeasy

Refer to the exhibit. What data quality issue does the exhibit reveal?

A.The Summer Sale campaign has duplicate records.

B.The Fall Clearance campaign has no response data.

C.The query syntax is incorrect.

D.The data is not normalized.

AnswerB

NonNullResponse is 0, meaning all responses are null.

Why this answer

The Fall Clearance campaign has zero non-null responses, indicating all response data is missing for that campaign.

Practice this question →

139

MCQmedium

An admin is troubleshooting Einstein Sentiment. The model returns high confidence but wrong sentiment (e.g., positive reviews labeled negative). What is the most likely issue?

A.The model was not retrained after the last data load.

B.The training data contains predominantly neutral examples.

C.The training data has incorrect labels for sentiment.

D.The field mapping for the sentiment field is incorrect.

AnswerC

Garbage in, garbage out: mislabeled training data leads to confident but incorrect classifications.

Why this answer

Option C is correct because if the training data contains incorrect labels for sentiment, the model learns from erroneous ground truth, leading to high confidence in wrong predictions. In Einstein Sentiment, the model's accuracy depends directly on the quality and correctness of the labeled training data; mislabeled examples cause the classifier to associate features with the wrong sentiment class, resulting in confident but incorrect outputs.

Exam trap

Salesforce often tests the concept that high confidence does not imply high accuracy; candidates mistakenly assume retraining or data volume issues are the root cause, rather than recognizing that garbage-in (incorrect labels) leads to garbage-out (confident wrong predictions).

How to eliminate wrong answers

Option A is wrong because retraining after a data load does not fix incorrect labels; it would only reinforce the existing mislabeled patterns. Option B is wrong because predominantly neutral examples would bias the model toward neutral predictions, not cause high-confidence wrong sentiment (e.g., positive labeled as negative). Option D is wrong because incorrect field mapping would typically result in missing or misaligned data, not high-confidence misclassification of sentiment; the model would fail to train or predict altogether.

Practice this question →

140

MCQhard

A company wants to use Einstein Next Best Action but needs to ensure data privacy. What is the required step for anonymizing customer data in Data Pipelines?

A.Create a sandbox with scrambled data

B.Rely on Einstein's built-in anonymization

C.Use Permission Sets to hide fields from users

D.Use the Data Mask transformation in Data Pipelines

AnswerD

Data Mask can replace sensitive values with anonymized data at the pipeline level.

Why this answer

Option A is correct because Data Pipelines includes a Data Mask transformation that can anonymize PII fields. Permission Sets control access but do not anonymize; sandbox scrambling is for testing only; Einstein does not automatically anonymize data.

Practice this question →

141

MCQeasy

Which data type is most commonly used for image recognition AI models?

A.Unstructured data

B.Structured data

C.Time-series data

D.Semi-structured data

AnswerA

Images are unstructured data.

Why this answer

Option D is correct because image recognition primarily uses unstructured data (pixel values). Option A is wrong because structured data (tables) is not suitable for images. Option B is wrong because semi-structured data (like JSON) is not typical.

Option C is wrong because time-series data is for sequential measurements.

Practice this question →

142

MCQmedium

A Salesforce admin is troubleshooting an Einstein Prediction Builder model that is not generating predictions. The model was created with a custom object 'Feedback__c'. The admin notices that the model's data source includes records with status 'In Progress' and 'Closed'. What is the most likely cause of the model not generating predictions?

A.The data source is not refreshed daily

B.The object is a custom object

C.The outcome field has more than two unique values

D.The object has fewer than 1000 records

AnswerC

Einstein Prediction Builder requires a binary outcome; status with 'In Progress' creates three distinct values.

Why this answer

Option C is correct because Einstein Prediction Builder requires the outcome field to have exactly two unique values (binary classification). The presence of 'In Progress' and 'Closed' as statuses suggests the outcome field likely contains more than two values, which violates this requirement and prevents the model from generating predictions.

Exam trap

The trap here is that candidates often assume any data quality issue (like stale data or record count) is the cause, but the specific requirement for binary outcome fields in Einstein Prediction Builder is a precise constraint that directly blocks prediction generation.

How to eliminate wrong answers

Option A is wrong because the data source refresh frequency does not affect the model's ability to generate predictions; it only impacts data freshness. Option B is wrong because custom objects are fully supported by Einstein Prediction Builder, and using a custom object does not prevent prediction generation. Option D is wrong because while a minimum of 1000 records is recommended for model training, having fewer records would cause a training failure, not a prediction generation failure after the model is built.

Practice this question →

143

MCQhard

A healthcare AI model uses patient data. The legal team requires that all data used for training be de-identified according to HIPAA Safe Harbor method. Which data handling process satisfies this?

A.Remove all 18 HIPAA identifiers from each record.

B.Generate synthetic data that mimics patient records.

C.Remove patient names and replace with IDs.

D.Anonymize data by aggregating into groups of 10 or more.

AnswerA

Safe Harbor method requires removal of all listed identifiers.

Why this answer

Option A is correct because the HIPAA Safe Harbor method specifically requires the removal of all 18 identifiers listed in the HIPAA Privacy Rule from each patient record. This includes direct identifiers like names, addresses, and Social Security numbers, as well as indirect identifiers such as dates and geographic subdivisions. By removing these 18 identifiers, the data is considered de-identified and no longer subject to HIPAA restrictions, allowing it to be used for AI training.

Exam trap

The trap here is that candidates often confuse de-identification with anonymization or pseudonymization, assuming that removing just names or aggregating data is sufficient, but Cisco tests the specific requirement of removing all 18 HIPAA identifiers under the Safe Harbor method.

How to eliminate wrong answers

Option B is wrong because generating synthetic data that mimics patient records does not satisfy the HIPAA Safe Harbor method; while synthetic data can be useful, it is not a recognized de-identification method under HIPAA Safe Harbor, which requires the removal of specific identifiers from actual data. Option C is wrong because removing patient names and replacing them with IDs alone does not meet the Safe Harbor standard, as it still leaves 17 other identifiers (e.g., dates, ZIP codes) that must be removed. Option D is wrong because anonymizing data by aggregating into groups of 10 or more is not a HIPAA Safe Harbor method; aggregation may reduce re-identification risk but does not guarantee removal of all 18 identifiers, and Safe Harbor requires explicit removal of those identifiers, not statistical aggregation.

Practice this question →

144

MCQmedium

A company wants to use customer purchase history to train a recommendation model. Which action is essential to comply with data privacy regulations?

A.Use only publicly available data.

B.Ignore regulations because data is internal.

C.Obtain explicit consent from customers.

D.Anonymize the data after training.

AnswerC

Explicit consent is legally required for processing personal data for AI training.

Why this answer

Option C is correct because data privacy regulations such as GDPR and CCPA require a lawful basis for processing personal data, and explicit consent is a primary lawful basis when using customer purchase history for training a recommendation model. Without obtaining explicit consent, the company would be processing personal data without a valid legal ground, violating regulations that mandate transparency and user control over their data.

Exam trap

The trap here is that candidates often assume internal data is exempt from privacy regulations or that anonymization after training retroactively fixes compliance, but regulations require a lawful basis before any processing begins.

How to eliminate wrong answers

Option A is wrong because using only publicly available data does not guarantee compliance; the data may still contain personal information subject to privacy regulations, and the model could inadvertently infer private attributes from public data. Option B is wrong because internal data is not exempt from data privacy regulations; laws like GDPR apply to any processing of personal data regardless of whether it is internal or external. Option D is wrong because anonymizing data after training does not address the requirement for a lawful basis at the time of collection and processing; the model may have already learned patterns from identifiable data, and retroactive anonymization does not cure the initial lack of consent.

Practice this question →

145

MCQmedium

A company wants to integrate external customer behavior data into Salesforce to enhance AI predictions. Which Salesforce Data Cloud feature is specifically designed to ingest and map external data?

A.Apex triggers

B.Data Streams

C.Einstein Studio

D.Flow Builder

AnswerB

Data Streams ingest external data into Data Cloud.

Why this answer

Option D is correct because Data Streams in Data Cloud are used to bring in data from various external sources. Option A is wrong because Flow Builder is for automation. Option B is wrong because Apex triggers are custom code.

Option C is wrong because Einstein Studio is for building AI models, not data ingestion.

Practice this question →

146

MCQeasy

A Salesforce admin wants to use Einstein Recommendations to suggest products. What is a key requirement for the data used to train the recommendation model?

A.Product prices must be stored in a custom currency field.

B.User profiles must include demographic data.

C.A minimum of 1,000 user-product interactions must exist.

D.Product descriptions must be at least 100 characters long.

AnswerC

Einstein Recommendations typically requires at least 1,000 interactions to generate meaningful recommendations.

Why this answer

Einstein Recommendations requires a minimum of 1,000 user-product interactions (such as views, clicks, or purchases) to train a statistically significant collaborative filtering model. This threshold ensures the algorithm can identify meaningful patterns in user behavior and generate accurate product suggestions.

Exam trap

The trap here is that candidates often assume Einstein Recommendations requires product metadata (like prices or descriptions) or user demographics, but the core requirement is purely a minimum volume of user-product interaction data.

How to eliminate wrong answers

Option A is wrong because product prices are not required for the collaborative filtering algorithm used by Einstein Recommendations; the model focuses on user-product interactions, not monetary values. Option B is wrong because demographic data is optional and not a key requirement; Einstein Recommendations primarily relies on behavioral data (interactions) rather than user profile attributes. Option D is wrong because product descriptions are irrelevant to the training data; the model uses interaction events, not text content, to generate recommendations.

Practice this question →

147

MCQhard

A large enterprise is using Einstein Lead Scoring and notices that the model score is not updating for leads created via a web-to-lead form. The leads have all required fields populated. The admin has verified that the model is active and the data source includes the Lead object. What could be causing the score to remain static?

A.The data source excludes leads created by web-to-lead

B.Web-to-lead leads are not supported by Einstein AI

C.The model is not yet activated

D.The model has not scored enough leads to start scoring new ones

AnswerD

Einstein requires a critical mass of scored records to calibrate the model before scoring new leads.

Why this answer

Option D is correct because Einstein Lead Scoring requires a minimum number of scored leads (typically 500) before it begins scoring new leads. Until that threshold is met, the model remains in a 'training' or 'pending' state and will not update scores for any leads, including those from web-to-lead forms. The admin has confirmed the model is active and the Lead object is included, so the most likely cause is that the model has not yet processed enough leads to start scoring.

Exam trap

Salesforce often tests the concept that Einstein AI models require a minimum data threshold before they become operational, and candidates mistakenly assume that an 'active' model immediately scores all leads, ignoring the training/pending state requirement.

How to eliminate wrong answers

Option A is wrong because the data source for Einstein Lead Scoring includes all leads in the Lead object by default; there is no option to exclude leads based on creation method (e.g., web-to-lead). Option B is wrong because Einstein AI fully supports leads created via web-to-lead forms; there is no restriction on lead source. Option C is wrong because the admin has already verified that the model is active, so the model is not in an 'inactive' or 'not yet activated' state.

Practice this question →

148

MCQeasy

A company wants to use its data from Salesforce to train an Einstein AI model. However, they need to exclude records where the customer has opted out of data use. Which field should they configure in the Data Manager?

A.A checkbox field named Data Use Opt-Out

B.The Record Type of the object

C.A picklist field on the object

D.A formula field that evaluates to true

AnswerA

Salesforce Data Manager uses a checkbox field to mark records that should be excluded from AI training.

Why this answer

Option A is correct because the Data Manager in Einstein AI uses a standard checkbox field named 'Data Use Opt-Out' to identify records that should be excluded from model training. When this checkbox is true, the system automatically filters out those records during data preparation, ensuring compliance with data privacy requirements.

Exam trap

Salesforce often tests the misconception that any field indicating consent can be used, but the Data Manager specifically requires a checkbox field named 'Data Use Opt-Out' to automatically filter records.

How to eliminate wrong answers

Option B is wrong because the Record Type field is used for business process differentiation and page layouts, not for indicating data usage consent; it has no built-in mechanism to exclude records from AI training. Option C is wrong because a picklist field, while it could theoretically store opt-out status, is not recognized by the Data Manager as a standard field for data exclusion; the system specifically looks for the checkbox field. Option D is wrong because a formula field that evaluates to true is not a standard field type that the Data Manager can interpret for opt-out filtering; the Data Manager requires a native checkbox field to trigger exclusion logic.

Practice this question →

149

MCQhard

A healthcare organization uses Salesforce to develop an AI model for patient readmission prediction. They must comply with HIPAA regulations. The dataset includes patient names, addresses, medical record numbers, and detailed clinical notes. The data scientist plans to train a supervised model using historical readmission outcomes. What is the most important data governance step before model training?

A.Use only aggregated data that does not include any patient-level details.

B.De-identify all protected health information (PHI) by removing or masking identifiers.

C.Obtain written patient consent for every record used in training.

D.Store the data in a separate, encrypted environment with access controls.

AnswerB

De-identification ensures compliance with HIPAA and protects patient privacy, allowing safe use of data for AI.

Why this answer

Option B is correct because HIPAA mandates that protected health information (PHI) must be de-identified before it can be used for model training without patient authorization. Removing or masking identifiers such as names, addresses, and medical record numbers ensures the dataset no longer contains individually identifiable information, allowing the organization to comply with the HIPAA Privacy Rule while still using clinical notes for predictive modeling.

Exam trap

Salesforce often tests the misconception that security controls like encryption or access controls alone satisfy HIPAA compliance, when in fact de-identification is the primary requirement for using PHI in AI model training without patient consent.

How to eliminate wrong answers

Option A is wrong because using only aggregated data would remove the granular patient-level details needed to train a supervised model for readmission prediction, which requires individual outcomes to learn patterns. Option C is wrong because obtaining written patient consent for every record is impractical for large historical datasets and is not required under HIPAA if PHI is properly de-identified. Option D is wrong because while storing data in an encrypted environment with access controls is a good security practice, it does not address the core HIPAA requirement to de-identify PHI before using it for model training; encryption alone does not make data non-PHI.

Practice this question →

150

MCQeasy

When using Einstein Lead Scoring, which data source is most critical for generating accurate lead scores?

A.Lead source (e.g., Webinar, Trade Show)

B.Lead field update timestamps

C.Email open rates from marketing campaigns

D.Converted lead records with attached opportunities

AnswerD

The model learns from past conversions; opportunities show which leads actually became customers.

Why this answer

Converted lead records with attached opportunities are the most critical data source because Einstein Lead Scoring uses supervised machine learning to analyze historical patterns in leads that successfully converted into opportunities. By training on these converted records, the model learns which lead attributes and behaviors are predictive of conversion, enabling it to assign accurate scores to new leads. Without this historical conversion data, the model lacks the ground truth needed to distinguish high-quality leads from low-quality ones.

Exam trap

Salesforce often tests the misconception that any single lead attribute (like source or email engagement) is the most critical input, when in fact the model's accuracy depends entirely on having historical conversion data to learn from.

How to eliminate wrong answers

Option A is wrong because lead source is just one of many features Einstein Lead Scoring evaluates, but it is not the most critical data source; the model requires historical conversion outcomes to weight such features properly. Option B is wrong because lead field update timestamps indicate recency of activity but do not provide the conversion outcome data needed to train the predictive model. Option C is wrong because email open rates from marketing campaigns are behavioral signals that can be used as features, but they are insufficient without converted lead records to establish which behaviors actually correlate with successful conversions.

Practice this question →

← PreviousPage 2 of 3 · 163 questions totalNext →

Ready to test yourself?

Try a timed practice session using only Data for AI questions.

Start 20-question session