This chapter covers the PaLM API and Google AI APIs, which are foundational for building generative AI applications on Google Cloud. For the GCDL exam, this topic appears in roughly 10–15% of questions, focusing on understanding the capabilities, use cases, and how these APIs integrate with other Google Cloud services. You will learn the key differences between PaLM 2, Gemini, and other models, how to call them via the Vertex AI SDK, and best practices for prompt design and safety. Mastery of this content is essential for the 'Data Analytics and AI' domain.
Jump to a section
Imagine a world-class chef (PaLM) with an extensive library of recipes (training data) and a team of sous-chefs (API endpoints). When you (the developer) request a specific dish (prompt), you write down your request on an order slip (API call). The chef reads your slip, understands the ingredients and techniques needed, and then consults the recipe library to generate a custom dish (response). The sous-chefs handle the logistics: one chops vegetables (text processing), another manages the oven (model inference), and a third plates the dish (output formatting). The chef does not have to reinvent cooking each time; instead, they leverage the library's patterns and adapt them to your specific request. Importantly, the chef's recipes are not shared with you (no model access), only the finished dish (API response). You can also specify dietary restrictions (parameters like temperature, top_p) to fine-tune the output. The entire kitchen operates under strict quality control (content filtering, safety checks) to ensure every dish meets the restaurant's standards. This analogy mirrors how the PaLM API provides a powerful, pre-trained language model accessible via simple API calls, abstracting away the complexity of model training and inference.
What is the PaLM API and Why Does It Exist?
The PaLM (Pathways Language Model) API is a managed service on Google Cloud that provides access to Google's large language models (LLMs) through a simple RESTful or gRPC interface. It exists to democratize access to state-of-the-art natural language processing (NLP) capabilities without requiring deep expertise in machine learning, model training, or infrastructure management. Before the PaLM API, building an LLM-powered application required training or fine-tuning a model, provisioning GPUs/TPUs, and handling scaling, monitoring, and updates. The PaLM API abstracts all that complexity, allowing developers to integrate advanced text generation, summarization, classification, and conversation capabilities into their applications with just a few lines of code.
How It Works Internally
When you send a request to the PaLM API, the following sequence occurs:
Authentication and Authorization: Your request includes an API key or OAuth 2.0 token. Google Cloud's IAM validates that your service account has the required permissions (e.g., aiplatform.endpoints.predict).
Request Parsing: The API endpoint receives your prompt and parameters (e.g., temperature, maxOutputTokens, topP, topK). These are validated and formatted into an internal representation.
Prompt Engineering and Preprocessing: The prompt may be augmented with system instructions (if using chat models) or safety configurations. The text is tokenized using the model's tokenizer (e.g., SentencePiece for PaLM). Tokenization converts text into integer IDs that the model can process.
4. Model Inference: The tokenized input is fed into the transformer-based neural network. PaLM 2 uses a decoder-only architecture with an optimized attention mechanism. The model generates output tokens autoregressively: at each step, it predicts the most likely next token based on the input and previously generated tokens. The generation is controlled by parameters:
- temperature (0.0 to 1.0): Controls randomness. Lower values (e.g., 0.2) produce more deterministic and focused outputs; higher values (e.g., 0.8) produce more creative and diverse outputs.
- topP (0.0 to 1.0, default 0.95): Nucleus sampling. The model considers tokens with cumulative probability up to topP. A lower value (e.g., 0.5) means fewer tokens are considered, making output more focused.
- topK (1 to 40, default 40): The model samples from the top K tokens by probability. Lower values (e.g., 10) make output more deterministic.
- maxOutputTokens (1 to 8192, default 256): Maximum number of tokens in the response. The model stops when it reaches this limit or generates an end-of-sequence token.
- candidateCount (1 to 8, default 1): Number of response variations to generate.
Safety and Content Filtering: After generation, the output passes through safety classifiers that detect harmful content (e.g., hate speech, violence, sexual content). Google uses a layered approach: a baseline safety filter (blocking certain categories) and customer-configurable thresholds via safetySettings. If the output is blocked, an empty response with a safety reason is returned.
Response Formatting: The generated token IDs are detokenized back into human-readable text. The response is packaged as a JSON object containing the generated text, safety ratings, and usage metadata (e.g., prompt token count, response token count).
Logging and Monitoring: All requests and responses are logged for auditing and billing. Usage is metered per character (for text) or per second (for audio/video).
Key Components, Values, and Defaults
Models: text-bison@001 (text generation), chat-bison@001 (chat), embedding-gecko@001 (embeddings). Each has a specific context window: PaLM 2 text-bison has 4096 tokens; Gemini Pro has 32760 tokens.
API Endpoint: https://us-central1-aiplatform.googleapis.com/v1/projects/{PROJECT_ID}/locations/us-central1/publishers/google/models/{MODEL_ID}:predict
SDK: Vertex AI SDK for Python (google.cloud.aiplatform). Example:
from google.cloud import aiplatform
aiplatform.init(project='my-project', location='us-central1')
model = aiplatform.TextGenerationModel.from_pretrained('text-bison@001')
response = model.predict(
prompt='Explain quantum computing in simple terms.',
temperature=0.7,
max_output_tokens=512
)
print(response.text)Pricing: Text generation is billed per character (input and output). As of 2025, text-bison costs $0.001 per 1,000 input characters and $0.002 per 1,000 output characters. Gemini 1.0 Pro is $0.000125 per 1,000 input characters and $0.000375 per 1,000 output characters.
Quotas: Default quotas vary by region and model. For example, text-bison in us-central1 has a quota of 1500 requests per minute per project.
Configuration and Verification Commands
To verify API access, use the following gcloud command:
gcloud ai models list --region=us-central1 --filter=displayName:text-bisonTo test a prediction:
gcloud ai endpoints predict \
--region=us-central1 \
--endpoint=projects/my-project/locations/us-central1/endpoints/123456789 \
--json-request='{"instances": [{"prompt": "Hello, world!"}], "parameters": {"temperature": 0.5, "maxOutputTokens": 100}}'Interaction with Related Technologies
The PaLM API integrates tightly with:
- Vertex AI: Provides a unified UI and SDK for model management, evaluation, and deployment. You can use Vertex AI Studio to experiment with prompts without coding.
- Cloud Storage: Store large prompts or fine-tuning datasets in Cloud Storage buckets.
- BigQuery: Use BigQuery ML to create remote models that call the PaLM API directly from SQL.
- Cloud Functions / Cloud Run: Deploy serverless applications that invoke the PaLM API in response to events.
- IAM: Control access via roles like roles/aiplatform.user.
- VPC Service Controls: Restrict API calls to within your VPC for data isolation.
Set up a GCP project
Create a new or use an existing Google Cloud project. Enable the Vertex AI API (aiplatform.googleapis.com) in the API Library. This step is required to authenticate and authorize API calls. Ensure billing is enabled, as the PaLM API is a paid service. Create a service account and download its JSON key for programmatic access, or set up OAuth 2.0 for user-based authentication.
Install and configure Vertex AI SDK
Install the Python SDK using pip: `pip install google-cloud-aiplatform`. Initialize the SDK with your project ID and region (default us-central1). Set environment variables for authentication: `export GOOGLE_APPLICATION_CREDENTIALS=path/to/key.json`. The SDK abstracts the REST API calls and provides high-level methods for model interaction.
Select and load a model
Choose a pre-trained model like `text-bison@001` for text generation or `chat-bison@001` for chat. Load it using `TextGenerationModel.from_pretrained('text-bison@001')`. The model name includes a version suffix (@001). Newer versions may be available; check the documentation. The SDK downloads the model's metadata but not the model itself.
Craft a prompt and set parameters
Write a prompt that clearly instructs the model. Include context, desired format, and constraints. Set generation parameters: temperature (0.0–1.0), max_output_tokens (1–8192), top_p (0.0–1.0), top_k (1–40), and candidate_count (1–8). For safety, configure safety_settings to block certain categories at specific thresholds (e.g., BLOCK_MEDIUM_AND_ABOVE for hate speech).
Call the predict method and handle response
Invoke `model.predict(prompt=..., **params)`. The method sends an HTTP POST request to the Vertex AI endpoint. The response includes `response.text` (generated text), `response.safety_ratings` (list of safety assessments per category), and `response.metadata` (token counts). Handle errors like quota exceeded, invalid arguments, or safety blocks. Always check `response.is_blocked` to see if the output was filtered.
Enterprise Scenario 1: Customer Support Chatbot
A large e-commerce company deploys a customer support chatbot using the PaLM API (chat-bison) on Vertex AI. The chatbot handles common queries like order status, returns, and product recommendations. The system uses a retrieval-augmented generation (RAG) architecture: user queries are first sent to a search engine (e.g., Vertex AI Search) that retrieves relevant documents from a knowledge base stored in Cloud Storage. The retrieved context is prepended to the user query as the prompt. The PaLM API generates a coherent, context-aware response. The company configures safety settings to block any offensive language and sets temperature=0.2 to ensure consistent, factual answers. They monitor usage with Cloud Logging and set up alerts for high latency or error rates. A common issue is the model generating plausible-sounding but incorrect information (hallucination). To mitigate, they implement a confidence threshold: if the model's response has low confidence (not directly exposed but inferred from safety ratings), the chatbot escalates to a human agent.
Enterprise Scenario 2: Automated Content Summarization
A media company uses the PaLM API to automatically summarize long articles for their mobile app. They use the text-bison model with a prompt like 'Summarize the following article in 3 bullet points.' They set max_output_tokens=150 and temperature=0.5. The system processes thousands of articles per day, triggered by Cloud Functions when new files are uploaded to Cloud Storage. The API's scalability handles burst traffic during peak hours. They use the Vertex AI SDK with batch prediction for offline processing of large datasets. A key performance consideration is token limits: articles longer than 4096 tokens must be truncated or split. They also use the embedding model (embedding-gecko) to generate vector representations for semantic search of summaries.
Enterprise Scenario 3: Code Generation Assistant
A software development firm integrates the PaLM API into their IDE plugin to generate code snippets, documentation, and test cases. They use the text-bison model with prompts that include the programming language, context, and desired output format. They set temperature=0.1 for deterministic code generation and top_p=0.95. The plugin calls the API asynchronously to avoid blocking the UI. A common misconfiguration is not setting appropriate safety filters, leading to the model generating insecure code (e.g., SQL injection). They configure safety settings to block high-risk content and add a post-processing step that runs static analysis tools on the generated code. They also use the API's candidate_count parameter to generate multiple code options and let the developer choose.
What the GCDL Exam Tests
The GCDL exam objective 3.3 focuses on understanding the capabilities and use cases of Google AI APIs, specifically the PaLM API and Vertex AI. You are NOT expected to write code or configure models. Instead, you must know:
The difference between PaLM 2, Gemini, and other models (e.g., Codey, Imagen).
How to access these models (via Vertex AI, API, or SDK).
Common use cases: text generation, chat, summarization, classification, embeddings.
Key parameters: temperature, max tokens, top_p, top_k.
Safety and responsible AI features (safety filters, content blocking).
Integration with other Google Cloud services (BigQuery, Cloud Storage, Cloud Functions).
Common Wrong Answers and Why Candidates Choose Them
'The PaLM API requires fine-tuning before use.' WRONG. The API provides pre-trained models that can be used out-of-the-box. Fine-tuning is an advanced feature for customizing models, but it is not required.
'You must train your own model using Vertex AI Training.' WRONG. The PaLM API abstracts model training entirely. You only provide prompts.
'The API only supports English.' WRONG. PaLM 2 supports multiple languages, including English, Chinese, Spanish, and more.
'Temperature controls the length of the response.' WRONG. Temperature controls randomness, not length. Length is controlled by max_output_tokens.
Specific Numbers and Terms That Appear on the Exam
PaLM 2 context window: 4096 tokens.
Gemini 1.0 Pro context window: 32760 tokens.
Default temperature: 0.0 (but often 0.5 in examples).
Default top_p: 0.95.
Default max_output_tokens: 256.
Pricing: per character (not per token).
Safety filter categories: HARM_CATEGORY_HATE_SPEECH, HARM_CATEGORY_DANGEROUS_CONTENT, HARM_CATEGORY_SEXUALLY_EXPLICIT, HARM_CATEGORY_HARASSMENT.
Thresholds: BLOCK_LOW_AND_ABOVE, BLOCK_MEDIUM_AND_ABOVE, BLOCK_ONLY_HIGH, BLOCK_NONE.
Edge Cases and Exceptions
If the prompt exceeds the model's context window, the API returns an error. You must truncate or summarize the input.
The API may return empty responses due to safety filters even if the prompt is benign. Always handle empty responses gracefully.
Quota limits vary by region and model. For high-volume applications, request quota increases through Google Cloud Console.
How to Eliminate Wrong Answers
If an answer mentions 'training a model' or 'fine-tuning' as a prerequisite, it is likely wrong.
If an answer says the API only works with specific data types (e.g., only text), check if the model supports multimodal (Gemini does).
If an answer confuses parameters (e.g., temperature for length), eliminate it.
If an answer omits safety features, it is incomplete.
The PaLM API provides pre-trained language models accessible via REST, gRPC, or Vertex AI SDK.
Key generation parameters: temperature (0.0–1.0), max_output_tokens (1–8192), top_p (0.0–1.0), top_k (1–40).
PaLM 2 text-bison has a context window of 4096 tokens; Gemini 1.0 Pro has 32760 tokens.
Safety filters block harmful content based on configurable thresholds (e.g., BLOCK_MEDIUM_AND_ABOVE).
Pricing is per character for input and output, not per token.
Common use cases: text generation, summarization, classification, chat, code generation, embeddings.
Integration with Vertex AI, BigQuery ML, Cloud Functions, and Cloud Storage.
The GCDL exam focuses on understanding capabilities, not coding. Know the differences between models and parameters.
Fine-tuning is optional; the API works out-of-the-box without training.
Always handle empty responses due to safety blocks or quota limits.
These come up on the exam all the time. Here's how to tell them apart.
PaLM API (text-bison)
Context window: 4096 tokens
Text-only input and output
Available via Vertex AI and PaLM API
Pricing: $0.001/1K input chars, $0.002/1K output chars
Designed for text generation, chat, and embeddings
Gemini API (gemini-pro)
Context window: 32760 tokens
Multimodal: text, image, video, audio input; text output
Available via Vertex AI and Gemini API
Pricing: $0.000125/1K input chars, $0.000375/1K output chars (text only)
Designed for multimodal reasoning, code generation, and long-context tasks
Mistake
The PaLM API requires you to train the model yourself.
Correct
The PaLM API provides pre-trained models that are ready to use. You only need to send prompts. Fine-tuning is an optional feature for customization, not a requirement.
Mistake
Temperature controls the length of the generated response.
Correct
Temperature controls the randomness of the output. Lower values produce more deterministic responses; higher values produce more creative ones. The length is controlled by the max_output_tokens parameter.
Mistake
The PaLM API only works with English text.
Correct
PaLM 2 supports multiple languages, including English, Chinese, Spanish, Japanese, and many others. The model was trained on multilingual data.
Mistake
You must use the REST API directly; there is no SDK.
Correct
Google provides the Vertex AI SDK (e.g., for Python, Java, Node.js) that simplifies API calls. You can also use the REST API directly if preferred.
Mistake
The API returns the same output for the same prompt every time.
Correct
By default, the API is deterministic only when temperature=0.0 and top_p=1.0. With higher temperature or top_p, the output will vary across calls. You can set candidate_count to generate multiple variations.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
The PaLM API is a specific API for accessing Google's language models. Vertex AI is a broader platform that includes the PaLM API along with tools for model training, deployment, and management. You can call the PaLM API directly or through Vertex AI's unified interface. For the exam, know that Vertex AI provides additional features like AutoML, model evaluation, and MLOps.
Use the temperature parameter. A low temperature (e.g., 0.2) makes the model more deterministic and focused. A high temperature (e.g., 0.8) makes it more creative and diverse. For factual tasks, use low temperature. For creative writing, use higher temperature. The default is 0.0 but many examples use 0.5.
Yes, the API is designed for low-latency inference. However, response times depend on model size, input length, and current load. For real-time applications, consider using smaller models (e.g., text-bison) and setting low max_output_tokens. You can also enable streaming responses to get partial results faster.
The API returns an error indicating that the input is too long. You must truncate, summarize, or split the input to fit within the model's context window (4096 tokens for PaLM 2, 32760 for Gemini). For long documents, consider using the embedding model to retrieve relevant chunks.
Google implements safety filters that block content in categories like hate speech, dangerous content, sexually explicit, and harassment. You can configure thresholds per category. Additionally, the API returns safety ratings for each response. If a response is blocked, it returns an empty result with a safety reason. Google also provides guidelines for responsible AI development.
Pricing is based on the number of characters processed, both input and output. For text-bison, it costs $0.001 per 1,000 input characters and $0.002 per 1,000 output characters. Characters include whitespace but not special tokens. There are no charges for failed requests (e.g., due to safety blocks). Check the official pricing page for updates.
Yes, Vertex AI supports fine-tuning of PaLM 2 models using your own dataset. Fine-tuning adjusts the model's weights to improve performance on specific tasks. However, this is an advanced feature and requires a dataset in the correct format. The GCDL exam expects you to know that fine-tuning is possible but not required for basic use.
You've just covered PaLM API and Google AI APIs — now see how well it sticks with free GCDL practice questions. Full explanations included, no account needed.
Done with this chapter?