This chapter covers the foundational concepts of features, labels, and training data in machine learning, which are essential for the AI-900 exam. Understanding these concepts is critical because they form the basis of supervised learning, a major topic in Objective 2.1. Approximately 15-20% of exam questions touch on data preparation, feature engineering, and the distinction between features and labels. By mastering this chapter, you will be able to answer questions about how data is structured for training, the types of features, and the importance of data quality.
Jump to a section
Imagine you are teaching a child to recognize different types of fruit. You have a set of flashcards, each showing a picture of a fruit. On the back of each card, you write the name of the fruit. When you show the card to the child, you cover the name and ask them to guess. If they guess correctly, you reward them; if not, you show the correct name. Over time, the child learns to associate visual features (color, shape, texture) with the correct label. Here, the pictures are the 'features' (input data), the names are the 'labels' (correct output), and the set of flashcards is the 'training data'. The child's learning process is analogous to training a machine learning model: the model processes features, makes predictions, compares them to labels, and adjusts its internal parameters to reduce errors. Just as a child needs many varied examples to generalize, a machine learning model needs a diverse, representative training dataset to perform well on new, unseen data. If you only show red apples, the child might think all fruits are red and round—this is overfitting. If you show only a few cards, the child won't learn enough—this is underfitting. The goal is to prepare the child (model) to correctly identify any fruit (new data) by learning the true relationship between features and labels.
What Are Features, Labels, and Training Data?
In machine learning, particularly supervised learning, the model learns a mapping from inputs to outputs. The inputs are called features (also known as independent variables, predictors, or attributes), and the outputs are called labels (also known as targets or dependent variables). The training data is a collection of feature-label pairs used to train the model.
Why This Exists: Supervised learning aims to generalize from labeled examples to predict labels for new, unseen data. Without clear separation of features and labels, the model cannot learn the relationship. The training data must be representative of real-world scenarios to avoid bias.
How It Works Internally
The process involves several steps: 1. Data Collection: Gather raw data from sources (databases, sensors, logs). 2. Data Labeling: Assign ground-truth labels to each instance. For example, in a spam detection dataset, each email is labeled 'spam' or 'not spam'. 3. Feature Engineering: Select and transform raw data into meaningful features. For text data, features might be word frequencies; for images, pixel values or edges. 4. Splitting Data: Divide the dataset into training, validation, and test sets (commonly 60/20/20 or 80/10/10). 5. Training: The model iterates over training data, making predictions, computing loss (difference between prediction and label), and updating parameters via optimization (e.g., gradient descent). 6. Evaluation: Use validation set to tune hyperparameters; test set for final performance.
Key Components
- Feature Types: - Numerical: Continuous (e.g., temperature, price) or discrete (e.g., count of items). - Categorical: Nominal (e.g., color: red, blue) or ordinal (e.g., rating: low, medium, high). - Text: Bag-of-words, TF-IDF, word embeddings. - Image: Pixel values, histograms, deep features. - Time Series: Lagged values, rolling statistics.
- Label Types: - Binary: Two classes (e.g., yes/no, spam/not spam). - Multiclass: More than two classes (e.g., digit recognition 0-9). - Regression: Continuous value (e.g., house price).
- Data Quality Dimensions: - Completeness: Missing values (handled by imputation or removal). - Consistency: No contradictions (e.g., same feature values leading to different labels). - Accuracy: Correct labels (label noise reduces performance). - Relevance: Features should be predictive; irrelevant features add noise.
Default Values and Best Practices
Train/Test Split: Common split is 80% training, 20% test. For large datasets, 90/10 is acceptable.
Validation Set: Often 10-20% of training data is held out for validation. Cross-validation (e.g., 5-fold) is used when data is scarce.
Feature Scaling: For distance-based algorithms (SVM, k-NN), features should be normalized to zero mean and unit variance or scaled to [0,1].
Handling Missing Values: Options: remove rows, impute with mean/median/mode, or use model-based imputation.
Configuration and Verification in Azure
In Azure Machine Learning, you can create datasets and define feature columns.
Creating a Dataset:
from azureml.core import Dataset
dataset = Dataset.Tabular.from_delimited_files(path='https://mystorage.blob.core.windows.net/data/training.csv')
dataset = dataset.register(workspace=ws, name='training_data', create_new_version=True)Defining Features and Labels: In an AutoML run, you specify the label column:
automl_settings = {
'task': 'classification',
'primary_metric': 'accuracy',
'training_data': dataset,
'label_column_name': 'target',
'n_cross_validations': 5
}Verifying Data Quality:
Use DatasetProfile to generate statistics:
profile = dataset.get_profile(workspace=ws)
print(profile)Interaction with Related Technologies
Feature Engineering: Often performed with Azure Data Factory or Databricks for preprocessing.
Data Labeling: Azure Machine Learning data labeling tools assist with manual or active learning labeling.
Model Training: Features and labels are fed into algorithms like Logistic Regression, Decision Trees, or Neural Networks.
Evaluation: Metrics like accuracy, precision, recall, F1-score for classification; RMSE, MAE for regression.
Common Pitfalls
Leakage: Using features that are not available at prediction time (e.g., future data).
Imbalanced Data: One label dominates; use resampling or weighted loss.
Overfitting: Model memorizes training data; use regularization, cross-validation, or simpler models.
Underfitting: Model too simple; add more features or increase model complexity.
Exam-Relevant Details
AI-900 Objective 2.1: Identify features and labels in a dataset.
Common Question: Given a dataset, which column is the feature and which is the label?
Trap: Candidates often confuse features with labels. Remember: features are inputs, label is output.
Key Terms: 'feature vector', 'ground truth', 'training set', 'test set'.
Collect Raw Data
Gather data from relevant sources. For example, in a housing price prediction scenario, collect data on house size, number of bedrooms, location, and sale price. Ensure data is representative of the problem domain. In Azure, data can be ingested from Blob Storage, SQL Database, or external sources via Azure Data Factory.
Label the Data
Assign ground-truth labels to each instance. This may require manual effort or automated labeling. For instance, in a sentiment analysis task, each review is labeled as positive, negative, or neutral. Labeling must be accurate; noisy labels degrade model performance. Azure Machine Learning provides a data labeling tool for this step.
Select and Engineer Features
Choose relevant attributes from raw data. For numerical data, use values as-is; for categorical, encode using one-hot or label encoding. Create new features if needed, e.g., interaction terms or polynomial features. In Azure, you can use the 'Feature Engineering' module in Designer or custom Python scripts.
Split Data into Training and Test Sets
Divide the dataset into training (e.g., 80%) and test (20%) sets. Optionally, create a validation set from training data. The split must be random and stratified if classes are imbalanced. In Azure AutoML, you can specify cross-validation folds instead of a fixed split.
Train the Model on Training Data
Feed the training features and labels into a machine learning algorithm. The model learns the mapping by minimizing a loss function. During training, the model updates its parameters iteratively. In Azure, you can use AutoML to automatically try multiple algorithms and hyperparameters.
Enterprise Scenario 1: Credit Card Fraud Detection
A financial institution wants to detect fraudulent transactions in real time. They collect historical transaction data including amount, merchant category, time, and location. The label is 'fraud' or 'legitimate'. Features include transaction amount, time since last transaction, and merchant type. The training data consists of millions of transactions, but fraud is rare (~0.1%). The challenge is class imbalance. They use resampling techniques (SMOTE) and cost-sensitive learning. Misconfiguration: if they train on raw data without scaling, models like SVM perform poorly. Also, they must avoid data leakage—using future information like 'is_flagged_later' as a feature. Production deployment requires low latency; they use Azure Machine Learning to deploy a model as a real-time endpoint.
Enterprise Scenario 2: Predictive Maintenance for Manufacturing
A factory wants to predict equipment failures before they occur. They collect sensor data (temperature, vibration, pressure) every second. The label is 'failure' or 'no failure' within a future window (e.g., next 24 hours). Features include rolling averages, standard deviations, and Fourier transforms. The training data spans months. A common mistake is using the same sensor data for both features and labels without proper time-based split—this causes leakage. They use Azure Data Lake for storage and Azure Databricks for feature engineering. Model performance is measured by recall (catching failures) even if precision is lower. Misconfiguration: using a simple threshold on temperature without considering other features leads to many false positives.
Enterprise Scenario 3: Customer Churn Prediction
A telecom company wants to identify customers likely to cancel their subscription. They have customer demographics, usage patterns, support calls, and contract length. Label is 'churned' or 'not churned' within next month. Features include average monthly bill, number of complaints, and tenure. The dataset has 500,000 customers with 10% churn. They use Azure Machine Learning to train a boosted decision tree. A common pitfall is including the contract end date as a feature—it directly indicates churn and causes leakage. Also, they must handle missing values for features like 'number of complaints' (impute with 0). Production deployment uses batch inference to generate a list of at-risk customers weekly.
AI-900 Exam Focus: Features, Labels, and Training Data
This topic is tested under Objective 2.1: 'Identify features and labels in a dataset for a given scenario.' Expect 2-3 questions directly on this. The exam may present a table or description of a dataset and ask which column is the label or which are features.
Common Wrong Answers and Why Candidates Choose Them: 1. Confusing features and labels. Candidates often think the label is an input feature. For example, in a dataset of house prices, they might select 'price' as a feature instead of the label. Remember: the label is what you want to predict; features are what you use to predict. 2. Selecting irrelevant columns as features. The exam may include an ID column (e.g., customer ID) and ask which are features. ID columns are not predictive and should be excluded. Candidates may include them, thinking more data is better. 3. Assuming all numerical columns are features. But a numerical column like 'transaction_date' as a raw timestamp is not useful without feature engineering (e.g., extracting day of week). The exam expects you to recognize that raw timestamps are often not used directly. 4. Thinking training and test data must have the same label distribution. While stratified splitting helps, it's not a requirement. The test set should be representative, but the model can still learn from imbalanced training data.
Specific Numbers and Terms on the Exam: - Common split ratio: 80/20 for train/test. - Terms: 'ground truth', 'feature vector', 'label', 'target variable'. - Types of features: numerical, categorical, text, image. - The exam may ask about 'feature engineering' as a step to create new features.
Edge Cases: - Missing values: The exam may ask how to handle them. Common answers: remove rows, impute with mean/median/mode. The best approach depends on context. - Data leakage: The exam will test scenarios where features contain information from the future (e.g., using 'total_charges' which is calculated after the prediction period).
Eliminating Wrong Answers: - If a question asks for the label, look for the column that represents the outcome or target. - If a question asks for features, exclude the label column and any identifier columns. - For data quality, consider completeness, consistency, accuracy, and relevance.
Features are input variables; labels are target outputs in supervised learning.
Training data consists of feature-label pairs used to train the model.
Common train/test split: 80% training, 20% test.
ID columns are not features; they should be excluded.
Data quality: completeness, consistency, accuracy, relevance.
Feature engineering transforms raw data into useful features.
Data leakage occurs when features contain information from the future.
Imbalanced data requires special handling (resampling, class weights).
Azure ML provides tools for data labeling, feature engineering, and dataset management.
The test set must never be used for training or hyperparameter tuning.
These come up on the exam all the time. Here's how to tell them apart.
Numerical Features
Represent quantities (e.g., age, salary).
Can be used directly in most algorithms.
Require scaling for distance-based models.
Examples: temperature, price, count.
Outliers can skew model training.
Categorical Features
Represent categories (e.g., color, country).
Must be encoded (e.g., one-hot encoding).
Ordinal features preserve order (e.g., rating: low<medium<high).
High cardinality (many unique values) can cause issues.
Encoding increases dimensionality.
Mistake
More features always improve model accuracy.
Correct
Adding irrelevant or redundant features can introduce noise and cause overfitting. Feature selection or dimensionality reduction (e.g., PCA) is often needed. The curse of dimensionality means that with too many features, data becomes sparse and models fail to generalize.
Mistake
Labels must be numerical values.
Correct
Labels can be categorical (e.g., 'cat', 'dog') for classification tasks. They are often encoded as integers (0,1,2) but the original values are non-numerical. For regression, labels are continuous numbers.
Mistake
Training data must be perfectly balanced.
Correct
While balanced data helps, many real-world datasets are imbalanced. Techniques like resampling, class weights, or using appropriate metrics (e.g., F1-score) can mitigate the issue. The exam expects you to know that imbalanced data is common and handled specially.
Mistake
The test set is used to train the model.
Correct
The test set is only used for final evaluation, never for training. Using it for training would give an overly optimistic performance estimate. The validation set is used for tuning.
Mistake
Features and labels are the same thing in unsupervised learning.
Correct
In unsupervised learning, there are no labels. The model learns patterns from features alone (e.g., clustering). The AI-900 exam covers supervised and unsupervised; you must distinguish them.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
A feature is an input variable used to make predictions, while a label is the output or target that the model predicts. In a housing price dataset, features include square footage and number of bedrooms; the label is the sale price. For the AI-900 exam, you must be able to identify which column is the label in a given dataset.
Common approaches: remove rows with missing values (if few), impute with mean/median/mode (for numerical), or use model-based imputation. The choice depends on the amount of missing data and the algorithm. In Azure ML, you can use the 'Clean Missing Data' module. The exam expects you to know that imputation is a common technique.
Data leakage occurs when features contain information that would not be available at prediction time. For example, using 'total_charges' that includes future months. This leads to overly optimistic performance during training but poor results in production. Prevent leakage by ensuring features are derived only from past data.
No. Tree-based algorithms (decision trees, random forests) are not affected by feature scale. However, distance-based algorithms (k-NN, SVM, neural networks) require scaling to prevent features with larger ranges from dominating. Common scaling methods: standardization (zero mean, unit variance) or normalization (min-max scaling).
A validation set is used to tune hyperparameters and select the best model during training, without touching the test set. It helps prevent overfitting to the training data. In k-fold cross-validation, the training data is split into k folds, each used as validation once. The exam may ask about splitting strategies.
No. Using the same data for training and testing gives a biased estimate of performance (overly optimistic). Always hold out a separate test set. Common splits: 80/20 or 70/30. In Azure AutoML, you specify a validation set or cross-validation.
Numerical (continuous or discrete), categorical (nominal or ordinal), text, image, and time series. Feature engineering may create new features like polynomial terms or aggregates. The AI-900 exam expects you to recognize feature types and appropriate encoding.
You've just covered Features, Labels, and Training Data — now see how well it sticks with free AI-900 practice questions. Full explanations included, no account needed.
Done with this chapter?