A machine learning engineer is exploring a dataset with 500 features and 10,000 samples. To reduce dimensionality for visualization, which technique is most suitable if the goal is to preserve global data structure?
PCA preserves global variance (covariance structure).
Why this answer
PCA is the most suitable technique for preserving the global data structure when reducing dimensionality because it is a linear method that maximizes variance along orthogonal principal components, capturing the overall covariance structure of the 500 features. Unlike nonlinear methods, PCA ensures that the global relationships (e.g., distances between clusters) are retained, making it ideal for visualization of high-dimensional data where the goal is to see broad patterns.
Exam trap
Cisco often tests the misconception that nonlinear methods like t-SNE or UMAP are always better for visualization, but the trap here is that they sacrifice global structure for local detail, making PCA the correct choice when the question explicitly states 'preserve global data structure.'
How to eliminate wrong answers
Option A is wrong because t-SNE is a nonlinear technique that focuses on preserving local neighborhoods and pairwise similarities, often distorting global structure (e.g., cluster sizes and distances) to create visually separable clusters. Option B is wrong because LLE is a nonlinear manifold learning method that preserves local linear relationships between neighbors, but it does not guarantee preservation of global structure and can fail with high-dimensional data (500 features) due to the curse of dimensionality. Option D is wrong because UMAP, while faster than t-SNE, is also a nonlinear technique designed to preserve local and some global structure but prioritizes topological connectivity over global variance, making it less suitable than PCA when the explicit goal is to maintain the overall data covariance and global distances.