LangSmith: The Complete Guide to Building Production-Ready LLM Applications

August 16, 2025

8 min read

Python TypeScript LangSmith LangChain LangGraph OpenAI AI Observability AI Development

LangSmith: The Complete Guide to Building Production-Ready LLM Applications

Large Language Model (LLM) applications are transforming industries, but building reliable, production-ready AI systems remains challenging. Enter LangSmith - a comprehensive platform designed specifically for developing, monitoring, and optimizing LLM applications at scale.

In this comprehensive guide, we’ll explore how LangSmith addresses the critical challenges of LLM development and helps you ship AI applications with confidence.

What is LangSmith?

LangSmith is a platform for building production-grade LLM applications that provides three core capabilities:

Observability: Monitor and analyze your LLM applications in real-time
Evaluation: Systematically test and measure application performance
Prompt Engineering: Iterate on prompts with version control and collaboration

The platform is framework-agnostic, meaning you can use it with or without LangChain’s open-source frameworks like langchain and langgraph.

Why LangSmith Matters for LLM Development

The Challenge of LLM Unpredictability

LLMs don’t always behave predictably. Small changes in prompts, models, or inputs can significantly impact results. This unpredictability makes it difficult to:

Debug issues in production
Measure application performance consistently
Iterate on prompts effectively
Ensure reliability across different scenarios

LangSmith’s Solution

LangSmith provides structured approaches to these challenges through:

Comprehensive Tracing: Track every component of your LLM application
Quantitative Evaluation: Measure performance with structured metrics
Collaborative Development: Enable teams to work together on prompt engineering
Production Monitoring: Get insights into real-world application behavior

Core Features Deep Dive

1. Observability: See Inside Your LLM Applications

Observability in LangSmith allows you to trace and analyze every aspect of your LLM application’s behavior.

Key Capabilities:

Trace Analysis: Visualize the complete flow of your application
Metrics & Dashboards: Configure custom metrics and monitoring dashboards
Alerts: Set up notifications for performance issues or anomalies
Real-time Monitoring: Track application behavior as it happens

Getting Started with Observability

Here’s how to set up basic tracing for a RAG (Retrieval-Augmented Generation) application:

Installation:

pip install -U langsmith openai

Environment Setup:

export LANGSMITH_TRACING=true
export LANGSMITH_API_KEY="<your-langsmith-api-key>"
export OPENAI_API_KEY="<your-openai-api-key>"

Basic RAG Application with Tracing:

from openai import OpenAI
from langsmith import traceable

openai_client = OpenAI()

@traceable
def retriever(query: str):
    # Mock retriever - replace with your actual retrieval logic
    results = ["Harrison worked at Kensho", "He was a software engineer"]
    return results

@traceable
def rag(question: str):
    docs = retriever(question)
    system_message = f"""Answer the user's question using only the provided information below:
    {chr(10).join(docs)}"""

    return openai_client.chat.completions.create(
        messages=[
            {"role": "system", "content": system_message},
            {"role": "user", "content": question},
        ],
        model="gpt-4o-mini",
    )

# Usage
response = rag("Where did Harrison work?")
print(response.choices[0].message.content)

The @traceable decorator automatically captures:

Function inputs and outputs
Execution time and performance metrics
Error handling and debugging information
Nested function calls and their relationships

2. Evaluation: Measure What Matters

Evaluation provides quantitative ways to measure your LLM application’s performance, crucial for maintaining quality as you iterate and scale.

Components of Evaluation:

Dataset: Test inputs and optionally expected outputs
Target Function: What you’re evaluating (could be a single LLM call or entire application)
Evaluators: Functions that score your target function’s outputs

Setting Up Evaluations

Creating a Dataset:

from langsmith import Client

client = Client()

# Create a dataset
dataset = client.create_dataset(
    dataset_name="rag_qa_dataset",
    description="Questions and answers for RAG evaluation"
)

# Add examples to the dataset
examples = [
    {
        "inputs": {"question": "Where did Harrison work?"},
        "outputs": {"answer": "Harrison worked at Kensho"}
    },
    {
        "inputs": {"question": "What was Harrison's role?"},
        "outputs": {"answer": "Harrison was a software engineer"}
    }
]

for example in examples:
    client.create_example(
        inputs=example["inputs"],
        outputs=example["outputs"],
        dataset_id=dataset.id
    )

Running Evaluations:

from langsmith.evaluation import evaluate
from langsmith.schemas import Example, Run

def accuracy_evaluator(run: Run, example: Example) -> dict:
    """Custom evaluator to check answer accuracy"""
    predicted = run.outputs.get("answer", "").lower()
    expected = example.outputs.get("answer", "").lower()

    return {
        "key": "accuracy",
        "score": 1.0 if expected in predicted else 0.0
    }

# Run evaluation
results = evaluate(
    lambda inputs: rag(inputs["question"]),
    data=dataset,
    evaluators=[accuracy_evaluator],
    experiment_prefix="rag_experiment"
)

print(f"Accuracy: {results['accuracy']}")

Using Pre-built Evaluators

LangSmith integrates with the openevals package for common evaluation patterns:

from langsmith.evaluation import evaluate
from openevals.evaluators import Correctness

# Use pre-built correctness evaluator
results = evaluate(
    lambda inputs: rag(inputs["question"]),
    data="rag_qa_dataset",
    evaluators=[Correctness()],
    experiment_prefix="correctness_test"
)

3. Prompt Engineering: Iterate with Confidence

LangSmith’s prompt engineering capabilities provide version control, collaboration features, and systematic testing for prompt development.

Key Features:

Version Control: Automatic tracking of prompt changes
Collaboration: Team-based prompt development
A/B Testing: Compare different prompt versions
Integration: Seamless connection with your applications

Prompt Engineering Workflow

Create Prompts in the UI: Use LangSmith’s web interface to design prompts
Version Management: Automatically track changes and iterations
Testing: Evaluate prompts against your datasets
Deployment: Push successful prompts to production

Advanced Use Cases

Multi-Agent Systems

LangSmith excels at tracing complex multi-agent workflows:

@traceable
def research_agent(topic: str):
    """Agent that researches a topic"""
    # Research logic here
    return f"Research findings on {topic}"

@traceable
def writing_agent(research: str, style: str):
    """Agent that writes content based on research"""
    # Writing logic here
    return f"Article written in {style} style based on: {research}"

@traceable
def multi_agent_workflow(topic: str, style: str):
    research = research_agent(topic)
    article = writing_agent(research, style)
    return article

Production Monitoring

Set up comprehensive monitoring for production applications:

from langsmith import Client
from langsmith.run_helpers import trace

client = Client()

@trace
def production_rag(question: str, user_id: str):
    try:
        result = rag(question)

        # Log additional metadata
        client.create_run(
            name="production_query",
            inputs={"question": question, "user_id": user_id},
            outputs={"result": result},
            run_type="llm",
            tags=["production", "rag"]
        )

        return result
    except Exception as e:
        # Log errors for debugging
        client.create_run(
            name="production_error",
            inputs={"question": question, "user_id": user_id},
            error=str(e),
            run_type="llm",
            tags=["production", "error"]
        )
        raise

Integration with LangChain and LangGraph

If you’re using LangChain or LangGraph, LangSmith integration is even simpler:

LangChain Integration

import os
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

# Set environment variables
os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_API_KEY"] = "your-api-key"

# Create a simple chain
llm = ChatOpenAI(model="gpt-4o-mini")
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant."),
    ("user", "{input}")
])

chain = prompt | llm | StrOutputParser()

# Automatic tracing is enabled!
result = chain.invoke({"input": "What is LangSmith?"})

LangGraph Integration

from langgraph.graph import StateGraph, END
from langchain_core.messages import HumanMessage

def chatbot(state):
    return {"messages": [llm.invoke(state["messages"])]}

# Build graph
workflow = StateGraph({"messages": list})
workflow.add_node("chatbot", chatbot)
workflow.set_entry_point("chatbot")
workflow.add_edge("chatbot", END)

app = workflow.compile()

# Automatic tracing for the entire graph
result = app.invoke({"messages": [HumanMessage(content="Hello!")]})

Best Practices for Production

1. Structured Logging

@traceable(tags=["production", "user-facing"])
def production_function(input_data):
    # Add context to traces
    return process_data(input_data)

2. Error Handling

@traceable
def robust_llm_call(prompt: str):
    try:
        return llm.invoke(prompt)
    except Exception as e:
        # LangSmith will automatically capture the error
        logger.error(f"LLM call failed: {e}")
        return "I apologize, but I'm having trouble processing your request."

3. Performance Monitoring

from langsmith import Client
import time

client = Client()

@traceable
def monitored_function(input_data):
    start_time = time.time()
    result = expensive_operation(input_data)
    duration = time.time() - start_time

    # Log custom metrics
    client.create_run(
        name="performance_metric",
        inputs={"duration": duration},
        run_type="tool"
    )

    return result

Getting Started: Step-by-Step

Visit LangSmith
Create an account
Navigate to Settings → API Keys
Create a new API key

2. Install Dependencies

pip install langsmith openai

3. Set Environment Variables

export LANGSMITH_TRACING=true
export LANGSMITH_API_KEY="your-api-key"

4. Start Tracing

Add the @traceable decorator to your functions and start seeing insights immediately.

5. Create Your First Evaluation

Create a dataset with test cases
Define evaluation metrics
Run evaluations to measure performance

Conclusion

LangSmith transforms LLM application development from an art into a science. By providing comprehensive observability, systematic evaluation, and collaborative prompt engineering, it enables teams to build reliable, production-ready AI applications.

Whether you’re building a simple chatbot or a complex multi-agent system, LangSmith provides the tools you need to:

Debug effectively with detailed tracing
Measure performance with quantitative evaluations
Iterate confidently with version-controlled prompt engineering
Scale reliably with production monitoring

The platform’s framework-agnostic approach means you can integrate it into existing workflows, while its deep integration with LangChain and LangGraph provides seamless experiences for those ecosystems.

Start your LangSmith journey today and experience the difference that proper tooling makes in LLM application development. Your future self (and your users) will thank you for building more reliable, observable, and maintainable AI applications.

Ready to get started with LangSmith? Check out the official documentation and begin building production-ready LLM applications today.

Table of Contents

On this page

LangSmith: The Complete Guide to Building Production-Ready LLM Applications

LangSmith: The Complete Guide to Building Production-Ready LLM Applications

What is LangSmith?

Why LangSmith Matters for LLM Development

The Challenge of LLM Unpredictability

LangSmith’s Solution

Core Features Deep Dive

1. Observability: See Inside Your LLM Applications

Key Capabilities:

Getting Started with Observability

2. Evaluation: Measure What Matters

Components of Evaluation:

Setting Up Evaluations

Using Pre-built Evaluators

3. Prompt Engineering: Iterate with Confidence

Key Features:

Prompt Engineering Workflow

Advanced Use Cases

Multi-Agent Systems

Production Monitoring

Integration with LangChain and LangGraph

LangChain Integration

LangGraph Integration

Best Practices for Production

1. Structured Logging

2. Error Handling

3. Performance Monitoring

Getting Started: Step-by-Step

2. Install Dependencies

3. Set Environment Variables

4. Start Tracing

5. Create Your First Evaluation

Conclusion

Vijendra Rana

Table of Contents

On this page

LangSmith: The Complete Guide to Building Production-Ready LLM Applications

What is LangSmith?

Why LangSmith Matters for LLM Development

The Challenge of LLM Unpredictability

LangSmith’s Solution

Core Features Deep Dive

1. Observability: See Inside Your LLM Applications

Key Capabilities:

Getting Started with Observability

2. Evaluation: Measure What Matters

Components of Evaluation:

Setting Up Evaluations

Using Pre-built Evaluators

3. Prompt Engineering: Iterate with Confidence

Key Features:

Prompt Engineering Workflow

Advanced Use Cases

Multi-Agent Systems

Production Monitoring

Integration with LangChain and LangGraph

LangChain Integration

LangGraph Integration

Best Practices for Production

1. Structured Logging

2. Error Handling

3. Performance Monitoring

Getting Started: Step-by-Step

1. Sign Up and Get API Key

2. Install Dependencies

3. Set Environment Variables

4. Start Tracing

5. Create Your First Evaluation

Conclusion