Question 1easymultiple choice
Read the full Data Preparation for Machine Learning explanation →MLA-C01 Data Preparation for Machine Learning • Complete Question Bank
Complete MLA-C01 Data Preparation for Machine Learning question bank — all 0 questions with answers and detailed explanations.
Refer to the exhibit. A data scientist runs the following AWS Glue ETL job script (Spark) to prepare data for ML:
```python
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
datasource = glueContext.create_dynamic_frame.from_options(
connection_type = "s3",
connection_options = {"paths": ["s3://bucket/input/"]},
format = "csv",
format_options = {"withHeader": True}
)
applymapping = ApplyMapping.apply(frame = datasource, mappings = [("id", "int", "id", "int"), ("value", "string", "value", "double")])
...
```
The job fails with an error: "Job run failed: org.apache.spark.SparkException: Job aborted due to stage failure: Task failed while writing rows." What is the most likely cause of this error?Refer to the exhibit. A data scientist is trying to run a SageMaker Processing job that reads data from an S3 bucket. The IAM role attached to the processing job has the following policy:
```json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "s3:GetObject",
"Resource": "arn:aws:s3:::my-bucket/*"
}
]
}
```
The job fails with an error: "Access Denied" when trying to list objects. What is the root cause?Refer to the exhibit. A data scientist uses the following SageMaker Feature Store feature definition (using the Boto3 SDK) to create a feature group:
```python
import boto3
sagemaker = boto3.client('sagemaker', region_name='us-east-1')
response = sagemaker.create_feature_group(
FeatureGroupName='my-feature-group',
RecordIdentifierFeatureName='customer_id',
EventTimeFeatureName='timestamp',
FeatureDefinitions=[
{'FeatureName': 'customer_id', 'FeatureType': 'String'},
{'FeatureName': 'age', 'FeatureType': 'String'},
{'FeatureName': 'income', 'FeatureType': 'Fractional'}
],
OnlineStoreConfig={'EnableOnlineStore': True},
RoleArn='arn:aws:iam::123456789012:role/SageMakerRole'
)
```
The data scientist later tries to ingest data with an 'age' column containing integer values. The ingestion fails. What is the most likely reason?{
"ProcessingResources": {
"ClusterConfig": {
"InstanceCount": 1,
"InstanceType": "ml.m5.large",
"VolumeSizeInGB": 30
}
},
"AppSpecification": {
"ImageUri": "123456789012.dkr.ecr.us-west-2.amazonaws.com/my-custom-image:latest",
"ContainerEntrypoint": ["python", "process.py"]
},
"RoleArn": "arn:aws:iam::123456789012:role/SageMakerProcessingRole",
"ProcessingInputs": [
{
"InputName": "input-1",
"S3Input": {
"S3Uri": "s3://my-bucket/input/data.csv",
"LocalPath": "/opt/ml/processing/input",
"S3DataType": "S3Prefix",
"S3InputMode": "File",
"S3DataDistributionType": "FullyReplicated",
"S3CompressionType": "None"
}
}
],
"ProcessingOutputConfig": {
"Outputs": [
{
"OutputName": "output-1",
"S3Output": {
"S3Uri": "s3://my-bucket/output/",
"LocalPath": "/opt/ml/processing/output",
"S3UploadMode": "EndOfJob"
}
}
]
}
}An AWS Glue job fails with the following error from the CloudWatch logs: "Conversion error: Unable to convert column 'price' from String to Double for some rows."
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject"
],
"Resource": "arn:aws:s3:::ml-data-bucket/*"
}
]
}import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
raw = glueContext.create_dynamic_frame.from_options(
connection_type="s3",
connection_options={"paths": ["s3://bucket/input/year=2023/month=01/"]},
format="json")
transformed = raw.select_fields(["col1", "col2"]).rename_field("col1", "new_col")
glueContext.write_dynamic_frame.from_options(
frame=transformed,
connection_type="s3",
connection_options={"path": "s3://bucket/output/transformed"},
format="parquet")
job.commit()ProcessingJobError: Execution failed
Error: Traceback (most recent call last):
File "/opt/ml/processing/input/code/preprocess.py", line 45, in <module>
df['age'] = df['age'].apply(float)
ValueError: could not convert string to float: 'twenty-five'