🧠Simple Definition (Word-for-word)
Prompt engineering: crafting input instructions to get reliable, structured output from LLMs.
âš¡ Super Simple Line
Techniques: system prompt (set role/persona/rules), few-shot examples (show input→output pairs), chain-of-thought (ask the model to reason step by step), output format constraints (respond only in JSON with schema X), negative constraints (do not include...).
âš¡ Key Details & Explanation
Prompt engineering: crafting input instructions to get reliable, structured output from LLMs. Techniques: system prompt (set role/persona/rules), few-shot examples (show input→output pairs), chain-of-thought (ask the model to reason step by step), output format constraints (respond only in JSON with schema X), negative constraints (do not include...). For sentiment extraction: define categories (positive/negative/neutral/mixed), provide examples of each, specify output JSON schema, test edge cases (sarcasm, neutral business language).
âš¡ One-line Interview Answer
Prompt engineering: crafting input instructions to get reliable, structured output from LLMs.
🧠Simple Definition (Word-for-word)
RAG: instead of relying on LLM's training data, retrieve relevant documents at query time and inject them into the prompt context.
âš¡ Super Simple Line
Steps: (1) Chunk documents into segments, (2) Generate vector embeddings for each chunk (OpenAI text-embedding-ada-002 or similar), (3) Store in a vector database (Pinecone, Qdrant, pgvector), (4) At query time: embed the user's question, similarity-search the vector DB for top-k relevant chunks, (5) Inject retrieved chunks into the LLM prompt as context.
âš¡ Key Details & Explanation
RAG: instead of relying on LLM's training data, retrieve relevant documents at query time and inject them into the prompt context. Steps: (1) Chunk documents into segments, (2) Generate vector embeddings for each chunk (OpenAI text-embedding-ada-002 or similar), (3) Store in a vector database (Pinecone, Qdrant, pgvector), (4) At query time: embed the user's question, similarity-search the vector DB for top-k relevant chunks, (5) Inject retrieved chunks into the LLM prompt as context. LLM answers based on retrieved context, not just training data. Prevents hallucination for domain-specific knowledge.
âš¡ One-line Interview Answer
RAG: instead of relying on LLM's training data, retrieve relevant documents at query time and inject them into the prompt context.
🧠Simple Definition (Word-for-word)
OpenAI: GPT-4o/o1 — best coding and reasoning, largest ecosystem, function calling is mature.
âš¡ Super Simple Line
Anthropic Claude: best for long-context tasks (200K tokens), nuanced writing, less prone to harmful outputs, strong structured output with tool use.
âš¡ Key Details & Explanation
OpenAI: GPT-4o/o1 — best coding and reasoning, largest ecosystem, function calling is mature. Anthropic Claude: best for long-context tasks (200K tokens), nuanced writing, less prone to harmful outputs, strong structured output with tool use. Gemini: Google's model, multimodal by default (video/audio/image/text), best integrated with Google ecosystem. Pick based on: task type (code → OpenAI, long docs → Claude, multimodal → Gemini), cost/performance tradeoff, compliance requirements, existing infrastructure. You used Gemini in Career Dock — know it well.
âš¡ One-line Interview Answer
OpenAI: GPT-4o/o1 — best coding and reasoning, largest ecosystem, function calling is mature.
🧠Simple Definition (Word-for-word)
Streaming: instead of waiting for the complete response, the API sends tokens as they're generated — dramatically improves perceived latency for long outputs.
âš¡ Super Simple Line
In Career Dock: use response.body as a ReadableStream, read chunks with a reader, decode with TextDecoder, parse the event stream format (data: {...}), update UI progressively.
âš¡ Key Details & Explanation
Streaming: instead of waiting for the complete response, the API sends tokens as they're generated — dramatically improves perceived latency for long outputs. In Career Dock: use response.body as a ReadableStream, read chunks with a reader, decode with TextDecoder, parse the event stream format (data: {...}), update UI progressively. In Next.js: use Response with a ReadableStream in a Route Handler, pipe from the Gemini/OpenAI SDK's stream to the client response. Critical for good UX in AI apps.
âš¡ One-line Interview Answer
Streaming: instead of waiting for the complete response, the API sends tokens as they're generated — dramatically improves perceived latency for long outputs.
🧠Simple Definition (Word-for-word)
Hallucinations are confident-sounding but wrong outputs.
âš¡ Super Simple Line
Mitigation strategies: RAG (ground the model in retrieved facts), structured output with validation (use function calling/JSON mode, validate schema), confidence scoring (ask model to rate confidence, flag low-confidence responses for human review), constrained outputs (provide allowed values list for classification tasks), multi-model cross-check (ask two models, flag disagreements), human-in-the-loop for high-stakes outputs, clear disclaimers in UI for AI-generated content.
âš¡ Key Details & Explanation
Hallucinations are confident-sounding but wrong outputs. Mitigation strategies: RAG (ground the model in retrieved facts), structured output with validation (use function calling/JSON mode, validate schema), confidence scoring (ask model to rate confidence, flag low-confidence responses for human review), constrained outputs (provide allowed values list for classification tasks), multi-model cross-check (ask two models, flag disagreements), human-in-the-loop for high-stakes outputs, clear disclaimers in UI for AI-generated content.
âš¡ One-line Interview Answer
Hallucinations are confident-sounding but wrong outputs.
🧠Simple Definition (Word-for-word)
Function calling: you define a set of tools (functions) with names, descriptions, and JSON schemas for parameters.
âš¡ Super Simple Line
The LLM decides when to call a tool and with what arguments — it outputs a structured tool call instead of text.
âš¡ Key Details & Explanation
Function calling: you define a set of tools (functions) with names, descriptions, and JSON schemas for parameters. The LLM decides when to call a tool and with what arguments — it outputs a structured tool call instead of text. Your code executes the actual function and returns the result to the LLM, which incorporates it into its response. Use cases: fetching real-time data (weather, stock prices), database queries, sending emails/notifications, any time the LLM needs to interact with external systems.
âš¡ One-line Interview Answer
Function calling: you define a set of tools (functions) with names, descriptions, and JSON schemas for parameters.
🧠Simple Definition (Word-for-word)
Strategies: caching responses for identical or similar queries (semantic caching with embeddings similarity), use smaller/cheaper models for simpler tasks (route simple classification to GPT-3.5/Haiku, complex reasoning to GPT-4/Opus), limit context window (trim conversation history, summarize long contexts), implement request queuing and rate limiting per user, monitor token usage with logging per user/feature, set hard spending limits on API keys, batch requests when real-time not required.
âš¡ Super Simple Line
Strategies: caching responses for identical or similar queries (semantic caching with embeddings similarity), use smaller/cheaper models for simpler tasks (route simple classification to GPT-3.5/Haiku, complex reasoning to GPT-4/Opus), limit context window (trim conversation history, summarize long contexts), implement request queuing and rate limiting per user, monitor token usage with logging per user/feature, set hard spending limits on API keys, batch requests when real-time not required.
âš¡ Key Details & Explanation
Strategies: caching responses for identical or similar queries (semantic caching with embeddings similarity), use smaller/cheaper models for simpler tasks (route simple classification to GPT-3.5/Haiku, complex reasoning to GPT-4/Opus), limit context window (trim conversation history, summarize long contexts), implement request queuing and rate limiting per user, monitor token usage with logging per user/feature, set hard spending limits on API keys, batch requests when real-time not required.
âš¡ One-line Interview Answer
Strategies: caching responses for identical or similar queries (semantic caching with embeddings similarity), use smaller/cheaper models for simpler tasks (route simple classification to GPT-3.5/Haiku, complex reasoning to GPT-4/Opus), limit context window (trim conversation history, summarize long contexts), implement request queuing and rate limiting per user, monitor token usage with logging per user/feature, set hard spending limits on API keys, batch requests when real-time not required.
🧠Simple Definition (Word-for-word)
Embeddings: numerical vector representations of text (or images) in high-dimensional space where semantically similar content is close together.
âš¡ Super Simple Line
Generated by embedding models (text-embedding-ada-002, Cohere, etc.).
âš¡ Key Details & Explanation
Embeddings: numerical vector representations of text (or images) in high-dimensional space where semantically similar content is close together. Generated by embedding models (text-embedding-ada-002, Cohere, etc.). Vector databases store and index these embeddings for fast nearest-neighbor search. Examples: Pinecone (managed), Qdrant (self-hosted), Weaviate, pgvector (PostgreSQL extension). Use cases: semantic search, RAG, recommendation systems, duplicate detection. The core operation is cosine similarity or dot product search.
âš¡ One-line Interview Answer
Embeddings: numerical vector representations of text (or images) in high-dimensional space where semantically similar content is close together.
🧠Simple Definition (Word-for-word)
A simple LLM call: one input, one output.
âš¡ Super Simple Line
An agent: the LLM can use tools, observe results, and decide next actions in a loop until a goal is achieved.
âš¡ Key Details & Explanation
A simple LLM call: one input, one output. An agent: the LLM can use tools, observe results, and decide next actions in a loop until a goal is achieved. Agent loop: (1) LLM receives task + available tools, (2) LLM decides to call a tool, (3) tool executes, result returned to LLM, (4) LLM uses result to decide next step or produce final answer. Examples: an agent that can search the web, write and execute code, read emails, then synthesize a report. Frameworks: LangChain, LlamaIndex, Vercel AI SDK.
âš¡ One-line Interview Answer
A simple LLM call: one input, one output.
🧠Simple Definition (Word-for-word)
Manual evaluation doesn't scale.
âš¡ Super Simple Line
Automated approaches: LLM-as-judge (use a separate LLM to rate the output against a rubric), reference-based evaluation (compare to gold standard answers with BLEU/ROUGE for text tasks), task-specific metrics (accuracy for classification, F1 for extraction), embedding similarity between expected and actual output, A/B testing with user engagement signals (click-through, thumbs up/down).
âš¡ Key Details & Explanation
Manual evaluation doesn't scale. Automated approaches: LLM-as-judge (use a separate LLM to rate the output against a rubric), reference-based evaluation (compare to gold standard answers with BLEU/ROUGE for text tasks), task-specific metrics (accuracy for classification, F1 for extraction), embedding similarity between expected and actual output, A/B testing with user engagement signals (click-through, thumbs up/down). Build an eval dataset from real usage patterns and run it on every model/prompt change.
âš¡ One-line Interview Answer
Manual evaluation doesn't scale.
🧠Simple Definition (Word-for-word)
Never trust model output blindly.
âš¡ Super Simple Line
Ask for a strict JSON schema, parse it defensively, and validate it with a runtime validator such as zod or JSON Schema before using it.
âš¡ Key Details & Explanation
Never trust model output blindly. Ask for a strict JSON schema, parse it defensively, and validate it with a runtime validator such as zod or JSON Schema before using it. If parsing fails, retry with a repair prompt or fall back to a safer path. This matters because LLMs may return extra text, missing fields, wrong types, or hallucinated keys even when they usually behave well.
âš¡ One-line Interview Answer
Never trust model output blindly.
🧠Simple Definition (Word-for-word)
Plan for fallback.
âš¡ Super Simple Line
Use timeouts, retries with backoff, circuit breakers, and clear error handling in the UI.
âš¡ Key Details & Explanation
Plan for fallback. Use timeouts, retries with backoff, circuit breakers, and clear error handling in the UI. For critical paths, keep a secondary provider or cheaper backup model, though responses may differ so test prompt portability. Log provider latency, error rates, and token usage separately so you can detect degradation quickly. In production, graceful degradation is better than a hard outage.
âš¡ One-line Interview Answer
Plan for fallback.
🧠Simple Definition (Word-for-word)
Main concerns: sending sensitive data to third-party models, prompt injection from untrusted content, data retention policies, and leaking secrets through tool use.
âš¡ Super Simple Line
Mitigations: redact or avoid PII where possible, isolate tool permissions, validate tool inputs and outputs, use allowlists for actions, and understand your model provider's retention settings.
âš¡ Key Details & Explanation
Main concerns: sending sensitive data to third-party models, prompt injection from untrusted content, data retention policies, and leaking secrets through tool use. Mitigations: redact or avoid PII where possible, isolate tool permissions, validate tool inputs and outputs, use allowlists for actions, and understand your model provider's retention settings. For RAG systems, treat retrieved documents as untrusted input and instruct the model not to follow embedded malicious instructions. The main thing I would emphasize is the threat model: what can an attacker do, and which protection stops that attack. Security answers become stronger when they connect the risk to the mitigation.
âš¡ One-line Interview Answer
Main concerns: sending sensitive data to third-party models, prompt injection from untrusted content, data retention policies, and leaking secrets through tool use.