PDE Designing data processing systems • Complete Question Bank
Complete PDE Designing data processing systems question bank — all 0 questions with answers and detailed explanations.
Refer to the exhibit. ``` resource.type="dataflow_step" resource.labels.job_id="2023-01-01_000000-12345678" "worker pool exhausted" ```
Refer to the exhibit.
```
{
"bindings": [
{
"role": "roles/bigquery.dataViewer",
"members": [
"serviceAccount:dataflow-worker@PROJECT_ID.iam.gserviceaccount.com"
]
}
]
}
```Refer to the exhibit. ``` # Dataflow pipeline error log: Workflow failed. Causes: S02:ReadPubSub/Read+Transform/ParDo(ExtractTimestamps)+ ... (4b9c3d2e) The job failed because a worker experienced a "out of memory" error. ``` Pipeline configuration: - Streaming engine: disabled - Worker machine type: n1-standard-4 (4 vCPU, 15 GB memory) - Number of workers: 2 (autoscaling enabled, max 10) - Input: Pub/Sub topic with 1000 messages/sec, each message ~50 KB - Transform: Parse JSON, enrich with external API call, window into 1-minute fixed windows, write to BigQuery
Drag steps to the numbered slots on the right, or tap a step then tap a slot.
Drag steps to the numbered slots on the right, or tap a step then tap a slot.
Drag steps to the numbered slots on the right, or tap a step then tap a slot.
Drag a concept onto its matching description — or click a concept then click the description.
Serverless data warehouse for analytics
Object storage for unstructured data
Globally distributed relational database
NoSQL wide-column database for low-latency workloads
Asynchronous messaging service for event-driven systems
Drag a concept onto its matching description — or click a concept then click the description.
Unified stream and batch processing (Apache Beam)
Managed Spark and Hadoop clusters
Workflow orchestration (Apache Airflow)
Visual data integration and pipeline builder
Drag a concept onto its matching description — or click a concept then click the description.
Metrics and alerting for cloud resources
Centralized log storage and analysis
Aggregates and analyzes application errors
Records administrative and data access activities
Refer to the exhibit.
Exhibit:
{
"bindings": [
{
"role": "roles/storage.objectViewer",
"members": ["serviceAccount:sa@project.iam.gserviceaccount.com"],
"condition": {
"title": "limit_time",
"expression": "request.time < timestamp('2023-01-01T00:00:00Z')"
}
}
]
}Refer to the exhibit. Exhibit: Error: Resources exceeded during query execution. Query statement: SELECT * FROM `project.dataset.table` WHERE date >= '2023-01-01'
Refer to the exhibit. Exhibit: Pipeline description: - Source: PubSubIO.read() - Transform: ParDo(Process) - Window: Window.into(FixedWindows of 1 minute) - Transform: GroupByKey - Sink: Write to BigQuery using StreamingInserts - Estimated throughput: 10MB/s - Observed lag: increasing
Refer to the exhibit. ```sql SELECT product_id, SUM(amount) AS total_sales FROM sales WHERE sale_date BETWEEN '2024-01-01' AND '2024-12-31' GROUP BY product_id ``` The job metadata shows: Input: 10 billion rows, Output: 500 million rows, Slot time: 20000 seconds, Elapsed time: 10 minutes, Shuffle: 100% locally, Joins: 0.
Refer to the exhibit.
```json
{
"error": {
"code": 403,
"message": "The service account 'dataflow-sa@project.iam.gserviceaccount.com' does not have 'bigquery.tables.getData' permission on table 'project:dataset.table'.",
"status": "PERMISSION_DENIED"
}
}
```
This error occurs when a Dataflow pipeline tries to read from a BigQuery table.Refer to the exhibit.
```
gcloud dataflow jobs run my-batch-job \
--gcs-location gs://dataflow-templates/latest/GCS_Text_to_BigQuery \
--parameters inputFilePattern=gs://my-bucket/input/*.csv,outputTable=my-project:my_dataset.my_table
```Refer to the exhibit.
```
# bigtable.gcloud
{
"gc_rules": [
{
"column_family": "cf1",
"max_age": "100s"
}
]
}
# apply using gcloud bigtable app profiles create ...
```Refer to the exhibit.
```
# Pub/Sub subscription config
projects/my-project/subscriptions/my-sub:
topic: projects/my-project/topics/my-topic
ackDeadlineSeconds: 10
pushConfig:
pushEndpoint: https://my-endpoint.example.com/push
oidcToken:
serviceAccountEmail: sa@my-project.iam.gserviceaccount.com
```A data analyst frequently queries a BigQuery table that contains an array of structs representing product purchases. The query below runs slowly:
SELECT customer_id, COUNT(purchase) as total_purchases FROM sales, UNNEST(purchases) as purchase GROUP BY customer_id
What change would most improve query performance?
Refer to the exhibit. Error log from Dataflow pipeline: "java.lang.IllegalArgumentException: Unable to convert value '2024-08-15T10:23:45.123Z' for field 'timestamp' from type 'STRING' to type 'TIMESTAMP' at org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$Write.addRecordToBatch"
Refer to the exhibit. Table details: orders (10 TB, daily partitioned by order_date, no clustering). Query: SELECT customer_id, COUNT(item) as items_purchased FROM orders, UNNEST(items) as item WHERE item.category = 'electronics' GROUP BY customer_id ORDER BY items_purchased DESC
Refer to the exhibit. Created subscription 'my-sub' on topic 'my-topic' (project:my-project) Push endpoint: https://myapp.example.com/events Ack deadline: 10 seconds Retry policy: Retry after 10 seconds, maximum 10 retries No authentication configured.
Refer to the exhibit.
```
# Cloud Storage Object Lifecycle Rule (JSON)
{
"lifecycle": {
"rule": [
{
"action": {"type": "SetStorageClass", "storageClass": "NEARLINE"},
"condition": {"age": 30, "matchesStorageClass": ["STANDARD"]}
},
{
"action": {"type": "Delete"},
"condition": {"age": 90}
}
]
}
}
```