LangSmith: The Complete Guide to Building Production-Ready LLM Applications
On this page
- LangSmith: The Complete Guide to Building Production-Ready LLM Applications
- What is LangSmith?
- Why LangSmith Matters for LLM Development
- The Challenge of LLM Unpredictability
- LangSmith’s Solution
- Core Features Deep Dive
- 1. Observability: See Inside Your LLM Applications
- 2. Evaluation: Measure What Matters
- 3. Prompt Engineering: Iterate with Confidence
- Advanced Use Cases
- Multi-Agent Systems
- Production Monitoring
- Integration with LangChain and LangGraph
- LangChain Integration
- LangGraph Integration
- Best Practices for Production
- 1. Structured Logging
- 2. Error Handling
- 3. Performance Monitoring
- Getting Started: Step-by-Step
- 1. Sign Up and Get API Key
- 2. Install Dependencies
- 3. Set Environment Variables
- 4. Start Tracing
- 5. Create Your First Evaluation
- Conclusion
LangSmith: The Complete Guide to Building Production-Ready LLM Applications
Large Language Model (LLM) applications are transforming industries, but building reliable, production-ready AI systems remains challenging. Enter LangSmith - a comprehensive platform designed specifically for developing, monitoring, and optimizing LLM applications at scale.
In this comprehensive guide, we’ll explore how LangSmith addresses the critical challenges of LLM development and helps you ship AI applications with confidence.
What is LangSmith?
LangSmith is a platform for building production-grade LLM applications that provides three core capabilities:
- Observability: Monitor and analyze your LLM applications in real-time
- Evaluation: Systematically test and measure application performance
- Prompt Engineering: Iterate on prompts with version control and collaboration
The platform is framework-agnostic, meaning you can use it with or without LangChain’s open-source frameworks like langchain and langgraph.
Why LangSmith Matters for LLM Development
The Challenge of LLM Unpredictability
LLMs don’t always behave predictably. Small changes in prompts, models, or inputs can significantly impact results. This unpredictability makes it difficult to:
- Debug issues in production
- Measure application performance consistently
- Iterate on prompts effectively
- Ensure reliability across different scenarios
LangSmith’s Solution
LangSmith provides structured approaches to these challenges through:
- Comprehensive Tracing: Track every component of your LLM application
- Quantitative Evaluation: Measure performance with structured metrics
- Collaborative Development: Enable teams to work together on prompt engineering
- Production Monitoring: Get insights into real-world application behavior
Core Features Deep Dive
1. Observability: See Inside Your LLM Applications
Observability in LangSmith allows you to trace and analyze every aspect of your LLM application’s behavior.
Key Capabilities:
- Trace Analysis: Visualize the complete flow of your application
- Metrics & Dashboards: Configure custom metrics and monitoring dashboards
- Alerts: Set up notifications for performance issues or anomalies
- Real-time Monitoring: Track application behavior as it happens
Getting Started with Observability
Here’s how to set up basic tracing for a RAG (Retrieval-Augmented Generation) application:
Installation:
pip install -U langsmith openaiEnvironment Setup:
export LANGSMITH_TRACING=trueexport LANGSMITH_API_KEY="<your-langsmith-api-key>"export OPENAI_API_KEY="<your-openai-api-key>"Basic RAG Application with Tracing:
from openai import OpenAIfrom langsmith import traceable
openai_client = OpenAI()
@traceabledef retriever(query: str): # Mock retriever - replace with your actual retrieval logic results = ["Harrison worked at Kensho", "He was a software engineer"] return results
@traceabledef rag(question: str): docs = retriever(question) system_message = f"""Answer the user's question using only the provided information below: {chr(10).join(docs)}"""
return openai_client.chat.completions.create( messages=[ {"role": "system", "content": system_message}, {"role": "user", "content": question}, ], model="gpt-4o-mini", )
# Usageresponse = rag("Where did Harrison work?")print(response.choices[0].message.content)The @traceable decorator automatically captures:
- Function inputs and outputs
- Execution time and performance metrics
- Error handling and debugging information
- Nested function calls and their relationships
2. Evaluation: Measure What Matters
Evaluation provides quantitative ways to measure your LLM application’s performance, crucial for maintaining quality as you iterate and scale.
Components of Evaluation:
- Dataset: Test inputs and optionally expected outputs
- Target Function: What you’re evaluating (could be a single LLM call or entire application)
- Evaluators: Functions that score your target function’s outputs
Setting Up Evaluations
Creating a Dataset:
from langsmith import Client
client = Client()
# Create a datasetdataset = client.create_dataset( dataset_name="rag_qa_dataset", description="Questions and answers for RAG evaluation")
# Add examples to the datasetexamples = [ { "inputs": {"question": "Where did Harrison work?"}, "outputs": {"answer": "Harrison worked at Kensho"} }, { "inputs": {"question": "What was Harrison's role?"}, "outputs": {"answer": "Harrison was a software engineer"} }]
for example in examples: client.create_example( inputs=example["inputs"], outputs=example["outputs"], dataset_id=dataset.id )Running Evaluations:
from langsmith.evaluation import evaluatefrom langsmith.schemas import Example, Run
def accuracy_evaluator(run: Run, example: Example) -> dict: """Custom evaluator to check answer accuracy""" predicted = run.outputs.get("answer", "").lower() expected = example.outputs.get("answer", "").lower()
return { "key": "accuracy", "score": 1.0 if expected in predicted else 0.0 }
# Run evaluationresults = evaluate( lambda inputs: rag(inputs["question"]), data=dataset, evaluators=[accuracy_evaluator], experiment_prefix="rag_experiment")
print(f"Accuracy: {results['accuracy']}")Using Pre-built Evaluators
LangSmith integrates with the openevals package for common evaluation patterns:
from langsmith.evaluation import evaluatefrom openevals.evaluators import Correctness
# Use pre-built correctness evaluatorresults = evaluate( lambda inputs: rag(inputs["question"]), data="rag_qa_dataset", evaluators=[Correctness()], experiment_prefix="correctness_test")3. Prompt Engineering: Iterate with Confidence
LangSmith’s prompt engineering capabilities provide version control, collaboration features, and systematic testing for prompt development.
Key Features:
- Version Control: Automatic tracking of prompt changes
- Collaboration: Team-based prompt development
- A/B Testing: Compare different prompt versions
- Integration: Seamless connection with your applications
Prompt Engineering Workflow
- Create Prompts in the UI: Use LangSmith’s web interface to design prompts
- Version Management: Automatically track changes and iterations
- Testing: Evaluate prompts against your datasets
- Deployment: Push successful prompts to production
Advanced Use Cases
Multi-Agent Systems
LangSmith excels at tracing complex multi-agent workflows:
@traceabledef research_agent(topic: str): """Agent that researches a topic""" # Research logic here return f"Research findings on {topic}"
@traceabledef writing_agent(research: str, style: str): """Agent that writes content based on research""" # Writing logic here return f"Article written in {style} style based on: {research}"
@traceabledef multi_agent_workflow(topic: str, style: str): research = research_agent(topic) article = writing_agent(research, style) return articleProduction Monitoring
Set up comprehensive monitoring for production applications:
from langsmith import Clientfrom langsmith.run_helpers import trace
client = Client()
@tracedef production_rag(question: str, user_id: str): try: result = rag(question)
# Log additional metadata client.create_run( name="production_query", inputs={"question": question, "user_id": user_id}, outputs={"result": result}, run_type="llm", tags=["production", "rag"] )
return result except Exception as e: # Log errors for debugging client.create_run( name="production_error", inputs={"question": question, "user_id": user_id}, error=str(e), run_type="llm", tags=["production", "error"] ) raiseIntegration with LangChain and LangGraph
If you’re using LangChain or LangGraph, LangSmith integration is even simpler:
LangChain Integration
import osfrom langchain_openai import ChatOpenAIfrom langchain_core.prompts import ChatPromptTemplatefrom langchain_core.output_parsers import StrOutputParser
# Set environment variablesos.environ["LANGSMITH_TRACING"] = "true"os.environ["LANGSMITH_API_KEY"] = "your-api-key"
# Create a simple chainllm = ChatOpenAI(model="gpt-4o-mini")prompt = ChatPromptTemplate.from_messages([ ("system", "You are a helpful assistant."), ("user", "{input}")])
chain = prompt | llm | StrOutputParser()
# Automatic tracing is enabled!result = chain.invoke({"input": "What is LangSmith?"})LangGraph Integration
from langgraph.graph import StateGraph, ENDfrom langchain_core.messages import HumanMessage
def chatbot(state): return {"messages": [llm.invoke(state["messages"])]}
# Build graphworkflow = StateGraph({"messages": list})workflow.add_node("chatbot", chatbot)workflow.set_entry_point("chatbot")workflow.add_edge("chatbot", END)
app = workflow.compile()
# Automatic tracing for the entire graphresult = app.invoke({"messages": [HumanMessage(content="Hello!")]})Best Practices for Production
1. Structured Logging
@traceable(tags=["production", "user-facing"])def production_function(input_data): # Add context to traces return process_data(input_data)2. Error Handling
@traceabledef robust_llm_call(prompt: str): try: return llm.invoke(prompt) except Exception as e: # LangSmith will automatically capture the error logger.error(f"LLM call failed: {e}") return "I apologize, but I'm having trouble processing your request."3. Performance Monitoring
from langsmith import Clientimport time
client = Client()
@traceabledef monitored_function(input_data): start_time = time.time() result = expensive_operation(input_data) duration = time.time() - start_time
# Log custom metrics client.create_run( name="performance_metric", inputs={"duration": duration}, run_type="tool" )
return resultGetting Started: Step-by-Step
1. Sign Up and Get API Key
- Visit LangSmith
- Create an account
- Navigate to Settings → API Keys
- Create a new API key
2. Install Dependencies
pip install langsmith openai3. Set Environment Variables
export LANGSMITH_TRACING=trueexport LANGSMITH_API_KEY="your-api-key"4. Start Tracing
Add the @traceable decorator to your functions and start seeing insights immediately.
5. Create Your First Evaluation
- Create a dataset with test cases
- Define evaluation metrics
- Run evaluations to measure performance
Conclusion
LangSmith transforms LLM application development from an art into a science. By providing comprehensive observability, systematic evaluation, and collaborative prompt engineering, it enables teams to build reliable, production-ready AI applications.
Whether you’re building a simple chatbot or a complex multi-agent system, LangSmith provides the tools you need to:
- Debug effectively with detailed tracing
- Measure performance with quantitative evaluations
- Iterate confidently with version-controlled prompt engineering
- Scale reliably with production monitoring
The platform’s framework-agnostic approach means you can integrate it into existing workflows, while its deep integration with LangChain and LangGraph provides seamless experiences for those ecosystems.
Start your LangSmith journey today and experience the difference that proper tooling makes in LLM application development. Your future self (and your users) will thank you for building more reliable, observable, and maintainable AI applications.
Ready to get started with LangSmith? Check out the official documentation and begin building production-ready LLM applications today.