Back to Categories
AI Integration

AI Integration

Prompt engineering, RAG, embeddings, LLM APIs and agents

13Questions

🧠 Simple Definition (Word-for-word)

Prompt engineering: crafting input instructions to get reliable, structured output from LLMs.


âš¡ Super Simple Line

Techniques: system prompt (set role/persona/rules), few-shot examples (show input→output pairs), chain-of-thought (ask the model to reason step by step), output format constraints (respond only in JSON with schema X), negative constraints (do not include...).


âš¡ Key Details & Explanation

Prompt engineering: crafting input instructions to get reliable, structured output from LLMs. Techniques: system prompt (set role/persona/rules), few-shot examples (show input→output pairs), chain-of-thought (ask the model to reason step by step), output format constraints (respond only in JSON with schema X), negative constraints (do not include...). For sentiment extraction: define categories (positive/negative/neutral/mixed), provide examples of each, specify output JSON schema, test edge cases (sarcasm, neutral business language).


âš¡ One-line Interview Answer

Prompt engineering: crafting input instructions to get reliable, structured output from LLMs.

🧠 Simple Definition (Word-for-word)

RAG: instead of relying on LLM's training data, retrieve relevant documents at query time and inject them into the prompt context.


âš¡ Super Simple Line

Steps: (1) Chunk documents into segments, (2) Generate vector embeddings for each chunk (OpenAI text-embedding-ada-002 or similar), (3) Store in a vector database (Pinecone, Qdrant, pgvector), (4) At query time: embed the user's question, similarity-search the vector DB for top-k relevant chunks, (5) Inject retrieved chunks into the LLM prompt as context.


âš¡ Key Details & Explanation

RAG: instead of relying on LLM's training data, retrieve relevant documents at query time and inject them into the prompt context. Steps: (1) Chunk documents into segments, (2) Generate vector embeddings for each chunk (OpenAI text-embedding-ada-002 or similar), (3) Store in a vector database (Pinecone, Qdrant, pgvector), (4) At query time: embed the user's question, similarity-search the vector DB for top-k relevant chunks, (5) Inject retrieved chunks into the LLM prompt as context. LLM answers based on retrieved context, not just training data. Prevents hallucination for domain-specific knowledge.


âš¡ One-line Interview Answer

RAG: instead of relying on LLM's training data, retrieve relevant documents at query time and inject them into the prompt context.

🧠 Simple Definition (Word-for-word)

OpenAI: GPT-4o/o1 — best coding and reasoning, largest ecosystem, function calling is mature.


âš¡ Super Simple Line

Anthropic Claude: best for long-context tasks (200K tokens), nuanced writing, less prone to harmful outputs, strong structured output with tool use.


âš¡ Key Details & Explanation

OpenAI: GPT-4o/o1 — best coding and reasoning, largest ecosystem, function calling is mature. Anthropic Claude: best for long-context tasks (200K tokens), nuanced writing, less prone to harmful outputs, strong structured output with tool use. Gemini: Google's model, multimodal by default (video/audio/image/text), best integrated with Google ecosystem. Pick based on: task type (code → OpenAI, long docs → Claude, multimodal → Gemini), cost/performance tradeoff, compliance requirements, existing infrastructure. You used Gemini in Career Dock — know it well.


âš¡ One-line Interview Answer

OpenAI: GPT-4o/o1 — best coding and reasoning, largest ecosystem, function calling is mature.

🧠 Simple Definition (Word-for-word)

Streaming: instead of waiting for the complete response, the API sends tokens as they're generated — dramatically improves perceived latency for long outputs.


âš¡ Super Simple Line

In Career Dock: use response.body as a ReadableStream, read chunks with a reader, decode with TextDecoder, parse the event stream format (data: {...}), update UI progressively.


âš¡ Key Details & Explanation

Streaming: instead of waiting for the complete response, the API sends tokens as they're generated — dramatically improves perceived latency for long outputs. In Career Dock: use response.body as a ReadableStream, read chunks with a reader, decode with TextDecoder, parse the event stream format (data: {...}), update UI progressively. In Next.js: use Response with a ReadableStream in a Route Handler, pipe from the Gemini/OpenAI SDK's stream to the client response. Critical for good UX in AI apps.


âš¡ One-line Interview Answer

Streaming: instead of waiting for the complete response, the API sends tokens as they're generated — dramatically improves perceived latency for long outputs.

🧠 Simple Definition (Word-for-word)

Hallucinations are confident-sounding but wrong outputs.


âš¡ Super Simple Line

Mitigation strategies: RAG (ground the model in retrieved facts), structured output with validation (use function calling/JSON mode, validate schema), confidence scoring (ask model to rate confidence, flag low-confidence responses for human review), constrained outputs (provide allowed values list for classification tasks), multi-model cross-check (ask two models, flag disagreements), human-in-the-loop for high-stakes outputs, clear disclaimers in UI for AI-generated content.


âš¡ Key Details & Explanation

Hallucinations are confident-sounding but wrong outputs. Mitigation strategies: RAG (ground the model in retrieved facts), structured output with validation (use function calling/JSON mode, validate schema), confidence scoring (ask model to rate confidence, flag low-confidence responses for human review), constrained outputs (provide allowed values list for classification tasks), multi-model cross-check (ask two models, flag disagreements), human-in-the-loop for high-stakes outputs, clear disclaimers in UI for AI-generated content.


âš¡ One-line Interview Answer

Hallucinations are confident-sounding but wrong outputs.

🧠 Simple Definition (Word-for-word)

Function calling: you define a set of tools (functions) with names, descriptions, and JSON schemas for parameters.


âš¡ Super Simple Line

The LLM decides when to call a tool and with what arguments — it outputs a structured tool call instead of text.


âš¡ Key Details & Explanation

Function calling: you define a set of tools (functions) with names, descriptions, and JSON schemas for parameters. The LLM decides when to call a tool and with what arguments — it outputs a structured tool call instead of text. Your code executes the actual function and returns the result to the LLM, which incorporates it into its response. Use cases: fetching real-time data (weather, stock prices), database queries, sending emails/notifications, any time the LLM needs to interact with external systems.


âš¡ One-line Interview Answer

Function calling: you define a set of tools (functions) with names, descriptions, and JSON schemas for parameters.

🧠 Simple Definition (Word-for-word)

Strategies: caching responses for identical or similar queries (semantic caching with embeddings similarity), use smaller/cheaper models for simpler tasks (route simple classification to GPT-3.5/Haiku, complex reasoning to GPT-4/Opus), limit context window (trim conversation history, summarize long contexts), implement request queuing and rate limiting per user, monitor token usage with logging per user/feature, set hard spending limits on API keys, batch requests when real-time not required.


âš¡ Super Simple Line

Strategies: caching responses for identical or similar queries (semantic caching with embeddings similarity), use smaller/cheaper models for simpler tasks (route simple classification to GPT-3.5/Haiku, complex reasoning to GPT-4/Opus), limit context window (trim conversation history, summarize long contexts), implement request queuing and rate limiting per user, monitor token usage with logging per user/feature, set hard spending limits on API keys, batch requests when real-time not required.


âš¡ Key Details & Explanation

Strategies: caching responses for identical or similar queries (semantic caching with embeddings similarity), use smaller/cheaper models for simpler tasks (route simple classification to GPT-3.5/Haiku, complex reasoning to GPT-4/Opus), limit context window (trim conversation history, summarize long contexts), implement request queuing and rate limiting per user, monitor token usage with logging per user/feature, set hard spending limits on API keys, batch requests when real-time not required.


âš¡ One-line Interview Answer

Strategies: caching responses for identical or similar queries (semantic caching with embeddings similarity), use smaller/cheaper models for simpler tasks (route simple classification to GPT-3.5/Haiku, complex reasoning to GPT-4/Opus), limit context window (trim conversation history, summarize long contexts), implement request queuing and rate limiting per user, monitor token usage with logging per user/feature, set hard spending limits on API keys, batch requests when real-time not required.

🧠 Simple Definition (Word-for-word)

Embeddings: numerical vector representations of text (or images) in high-dimensional space where semantically similar content is close together.


âš¡ Super Simple Line

Generated by embedding models (text-embedding-ada-002, Cohere, etc.).


âš¡ Key Details & Explanation

Embeddings: numerical vector representations of text (or images) in high-dimensional space where semantically similar content is close together. Generated by embedding models (text-embedding-ada-002, Cohere, etc.). Vector databases store and index these embeddings for fast nearest-neighbor search. Examples: Pinecone (managed), Qdrant (self-hosted), Weaviate, pgvector (PostgreSQL extension). Use cases: semantic search, RAG, recommendation systems, duplicate detection. The core operation is cosine similarity or dot product search.


âš¡ One-line Interview Answer

Embeddings: numerical vector representations of text (or images) in high-dimensional space where semantically similar content is close together.

🧠 Simple Definition (Word-for-word)

A simple LLM call: one input, one output.


âš¡ Super Simple Line

An agent: the LLM can use tools, observe results, and decide next actions in a loop until a goal is achieved.


âš¡ Key Details & Explanation

A simple LLM call: one input, one output. An agent: the LLM can use tools, observe results, and decide next actions in a loop until a goal is achieved. Agent loop: (1) LLM receives task + available tools, (2) LLM decides to call a tool, (3) tool executes, result returned to LLM, (4) LLM uses result to decide next step or produce final answer. Examples: an agent that can search the web, write and execute code, read emails, then synthesize a report. Frameworks: LangChain, LlamaIndex, Vercel AI SDK.


âš¡ One-line Interview Answer

A simple LLM call: one input, one output.

🧠 Simple Definition (Word-for-word)

Manual evaluation doesn't scale.


âš¡ Super Simple Line

Automated approaches: LLM-as-judge (use a separate LLM to rate the output against a rubric), reference-based evaluation (compare to gold standard answers with BLEU/ROUGE for text tasks), task-specific metrics (accuracy for classification, F1 for extraction), embedding similarity between expected and actual output, A/B testing with user engagement signals (click-through, thumbs up/down).


âš¡ Key Details & Explanation

Manual evaluation doesn't scale. Automated approaches: LLM-as-judge (use a separate LLM to rate the output against a rubric), reference-based evaluation (compare to gold standard answers with BLEU/ROUGE for text tasks), task-specific metrics (accuracy for classification, F1 for extraction), embedding similarity between expected and actual output, A/B testing with user engagement signals (click-through, thumbs up/down). Build an eval dataset from real usage patterns and run it on every model/prompt change.


âš¡ One-line Interview Answer

Manual evaluation doesn't scale.

🧠 Simple Definition (Word-for-word)

Never trust model output blindly.


âš¡ Super Simple Line

Ask for a strict JSON schema, parse it defensively, and validate it with a runtime validator such as zod or JSON Schema before using it.


âš¡ Key Details & Explanation

Never trust model output blindly. Ask for a strict JSON schema, parse it defensively, and validate it with a runtime validator such as zod or JSON Schema before using it. If parsing fails, retry with a repair prompt or fall back to a safer path. This matters because LLMs may return extra text, missing fields, wrong types, or hallucinated keys even when they usually behave well.


âš¡ One-line Interview Answer

Never trust model output blindly.

🧠 Simple Definition (Word-for-word)

Plan for fallback.


âš¡ Super Simple Line

Use timeouts, retries with backoff, circuit breakers, and clear error handling in the UI.


âš¡ Key Details & Explanation

Plan for fallback. Use timeouts, retries with backoff, circuit breakers, and clear error handling in the UI. For critical paths, keep a secondary provider or cheaper backup model, though responses may differ so test prompt portability. Log provider latency, error rates, and token usage separately so you can detect degradation quickly. In production, graceful degradation is better than a hard outage.


âš¡ One-line Interview Answer

Plan for fallback.

🧠 Simple Definition (Word-for-word)

Main concerns: sending sensitive data to third-party models, prompt injection from untrusted content, data retention policies, and leaking secrets through tool use.


âš¡ Super Simple Line

Mitigations: redact or avoid PII where possible, isolate tool permissions, validate tool inputs and outputs, use allowlists for actions, and understand your model provider's retention settings.


âš¡ Key Details & Explanation

Main concerns: sending sensitive data to third-party models, prompt injection from untrusted content, data retention policies, and leaking secrets through tool use. Mitigations: redact or avoid PII where possible, isolate tool permissions, validate tool inputs and outputs, use allowlists for actions, and understand your model provider's retention settings. For RAG systems, treat retrieved documents as untrusted input and instruct the model not to follow embedded malicious instructions. The main thing I would emphasize is the threat model: what can an attacker do, and which protection stops that attack. Security answers become stronger when they connect the risk to the mitigation.


âš¡ One-line Interview Answer

Main concerns: sending sensitive data to third-party models, prompt injection from untrusted content, data retention policies, and leaking secrets through tool use.