Weco's LangSmith integration lets you optimize code against LangSmith datasets using both code-based evaluators and dashboard-configured LLM judges, without writing shell eval scripts.

Instead of writing an evaluation command that prints metrics to stdout, you point Weco at a LangSmith dataset and evaluators. Weco handles the rest: running your target function against each example, collecting scores, and iteratively improving your code.

How it works

Target function: A Python function that receives dataset inputs and returns outputs. LangSmith calls this for each example in your dataset.
Evaluators: Scoring functions that run locally (code evaluators) and/or LLM judges configured in the LangSmith dashboard (dashboard evaluators).
Metric function: A function that combines all evaluator scores into a single number for Weco to optimize.
Optimization loop: Weco iteratively modifies your source file, re-runs evaluation against the dataset, and keeps the version that scores best.

Prerequisites

Python 3.10+
Weco CLI installed
A LangSmith account
API keys for LangSmith and your LLM provider (e.g., OpenAI)

Tutorial: Optimize an HR QA Agent

This walkthrough uses the ZephHR example, a QA agent that answers HR policy questions over fictional documentation. Weco optimizes the agent's prompts to improve answer quality, measured by a composite metric that combines correctness and helpfulness.

Clone the example

git clone https://github.com/WecoAI/weco-cli.git
cd weco-cli/examples/langsmith-zeph-hr-qa

Install the Weco CLI and set up a virtual environment for the project dependencies:

pipx install weco

python -m venv .venv
source .venv/bin/activate
pip install openai langsmith

python -m venv .venv
.venv\Scripts\activate
pip install openai langsmith

python -m venv .venv
.venv\Scripts\Activate.ps1
pip install openai langsmith

Set environment variables

export OPENAI_API_KEY="sk-..."
export LANGCHAIN_API_KEY="lsv2_..."

You can get your LangSmith API key from smith.langchain.com/settings.

Understand the project structure

File	Purpose
`agent.py`	QA agent using GPT-4o-mini. Weco optimizes the prompts in this file
`evaluators.py`	Code-based evaluators + the `qa_score` metric function
`setup_dataset.py`	Creates LangSmith datasets from JSON question sets
`docs.md`	ZephHR product documentation (the knowledge base)
`optimizer_exemplars.md`	Few-shot Q&A examples to guide the optimizer
`data/`	JSON files with optimization (15) and holdout (10) questions

The target function in agent.py is what LangSmith calls for each dataset example:

def answer_hr_question(inputs: dict) -> dict:
    """Answer an HR policy question from the ZephHR docs."""
    question = inputs.get("question", "")

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": USER_TEMPLATE.format(docs=DOCS, question=question)},
        ],
        temperature=0.0,
        response_format={"type": "json_object"},
    )

    # ... parse and return {answer, confidence, relevant_sections}

Weco optimizes SYSTEM_PROMPT and USER_TEMPLATE in this file to improve the agent's answers.

The metric function in evaluators.py combines evaluator scores into a single optimization target:

def qa_score(scores: dict) -> float:
    """Combine correctness (binary gate) with helpfulness (1-5 signal).

    correctness * (helpfulness - 1) / 4

    - Incorrect answers always score 0.
    - Correct answers are ranked by helpfulness, normalized to 0-1.
    """
    correctness = scores.get("correctness", 0.0)
    helpfulness = scores.get("helpfulness", 1.0)
    return correctness * (helpfulness - 1.0) / 4.0

This gated metric means the optimizer can't game helpfulness without getting facts right. Incorrect answers always score 0.

The example also includes two code evaluators that run locally:

def json_schema_validity(run, example) -> dict:
    """Check that the agent output contains the required fields."""
    outputs = run.outputs or {}
    checks = {
        "answer": isinstance(outputs.get("answer"), str) and len(outputs["answer"]) > 0,
        "confidence": outputs.get("confidence") in ("high", "medium", "low"),
        "relevant_sections": isinstance(outputs.get("relevant_sections"), list),
    }
    passed = all(checks.values())
    return {"key": "json_schema_validity", "score": 1.0 if passed else 0.0, "comment": "..."}

def conciseness(run, example) -> dict:
    """Score based on answer length. Penalises empty or excessively verbose answers."""
    # Returns score: 1.0 for ≤150 words, 0.5 for ≤250, 0.0 for >250 or empty

Create the dataset

Run the setup script to create a dataset in LangSmith:

python setup_dataset.py

This creates a single dataset called zephhr-qa with two splits:

opt: 15 optimization questions (used during the optimization loop)
holdout: 10 held-out questions (used for validation after optimization)

The setup script is idempotent. Running it multiple times won't create duplicate examples.

Configure LangSmith dashboard evaluators

Before running the optimization, set up two online evaluators in your LangSmith project. These are LLM judges that run server-side and score each agent response asynchronously.

Go to your LangSmith project
Navigate to the evaluators section
Add the correctness evaluator. This is available as a default evaluator in LangSmith (binary factual accuracy, 0 or 1).
Create a custom helpfulness evaluator that scores how complete and useful the answer is (1-5 scale). Make sure the Feedback Key is set to helpfulness, as this is the name Weco uses to match the evaluator's scores.

Dashboard evaluators run asynchronously after each evaluation. Weco automatically polls for their scores (up to 15 minutes by default). You can adjust the timeout with --langsmith-dashboard-evaluator-timeout.

Run the optimization

Run the optimization with all parameters specified on the command line:

weco run --source agent.py \
  --eval-backend langsmith \
  --langsmith-dataset zephhr-qa \
  --langsmith-splits opt \
  --langsmith-target agent:answer_hr_question \
  --langsmith-evaluators evaluators:json_schema_validity evaluators:conciseness \
  --langsmith-dashboard-evaluators helpfulness correctness \
  --langsmith-metric-function evaluators:qa_score \
  --additional-instructions optimizer_exemplars.md \
  --metric qa_score --goal maximize --steps 10

Here's what each flag does:

Flag	Purpose
`--source agent.py`	The file Weco will optimize
`--eval-backend langsmith`	Use LangSmith instead of a shell eval command
`--langsmith-dataset`	LangSmith dataset to evaluate against
`--langsmith-splits`	Evaluate only examples in these dataset splits
`--langsmith-target`	Target function as `module:function`
`--langsmith-evaluators`	Code-based evaluator functions as `module:function`
`--langsmith-dashboard-evaluators`	Names of LLM judges configured in LangSmith
`--langsmith-metric-function`	Function that combines scores into a single metric
`--additional-instructions`	File with hints/exemplars to guide the optimizer
`--metric`	Name of the metric to optimize
`--goal maximize`	Direction of optimization
`--steps 10`	Number of optimization iterations

If you prefer a visual setup, run Weco with just the eval backend flag:

weco run --eval-backend langsmith

A browser-based setup wizard opens automatically where you can configure everything visually:

Source file(s) to optimize
LangSmith dataset name
Target function (module:function)
Code evaluators and dashboard evaluators
Metric name and metric function
Run parameters (steps, model, instructions, timeout)

Once you submit the configuration, the optimization starts automatically.

The wizard launches automatically when required parameters (--langsmith-dataset and --langsmith-target) are not provided. You can also partially specify flags on the command line and the wizard will pre-fill those values and ask for the rest.

Monitor the optimization

Track progress in the Weco dashboard. Each iteration shows the metric score and the code changes Weco made. When the run completes, you'll be prompted to apply the best-performing version to your source file.

Validate on the holdout set

After optimization, verify that the improvements generalize to unseen questions by running a single evaluation against the holdout dataset:

weco run --source agent.py \
  --eval-backend langsmith \
  --langsmith-dataset zephhr-qa \
  --langsmith-splits holdout \
  --langsmith-target agent:answer_hr_question \
  --langsmith-evaluators evaluators:json_schema_validity evaluators:conciseness \
  --langsmith-dashboard-evaluators helpfulness correctness \
  --langsmith-metric-function evaluators:qa_score \
  --metric qa_score --goal maximize --steps 1

weco run --eval-backend langsmith

In the wizard, select the zephhr-qa dataset, choose the holdout split, and set steps to 1. This runs a single evaluation pass without optimization, giving you the holdout score for the optimized agent.

Key concepts

Target function

Your target function is specified as module:function (e.g., agent:answer_hr_question). It receives an inputs dict from the dataset and returns a dict of outputs. LangSmith calls this function once per dataset example.

Code evaluators

Code evaluators are Python functions specified as module:function (e.g., evaluators:json_schema_validity). Each receives (run, example) and returns a dict with key, score, and optionally comment:

def my_evaluator(run, example) -> dict:
    # run.outputs contains the target function's output
    # example.outputs contains the expected output from the dataset
    return {"key": "my_evaluator", "score": 1.0, "comment": "Passed"}

Dashboard evaluators

Dashboard evaluators are LLM judges configured in the LangSmith UI. They run asynchronously after evaluation. Weco polls for their scores automatically. Specify them by name with --langsmith-dashboard-evaluators.

Metric function

A metric function combines all evaluator scores into a single number for Weco to optimize. It receives a dict of {evaluator_name: aggregated_score} and returns a float:

def my_metric(scores: dict) -> float:
    return scores["accuracy"] * scores["efficiency"]

Summary aggregation

Per-example evaluator scores are aggregated across the dataset using --langsmith-summary (default: mean). Options: mean, median, min, max.

CLI reference

All LangSmith-specific flags for weco run --eval-backend langsmith:

Flag	Type	Default	Description
`--langsmith-dataset`	string	required	LangSmith dataset name or ID
`--langsmith-target`	string	required	Target function as `module:function`
`--langsmith-splits`	string[]	—	Evaluate only examples in these dataset splits
`--langsmith-evaluators`	string[]	—	Code evaluator functions as `module:function`
`--langsmith-dashboard-evaluators`	string[]	—	Names of dashboard-bound LLM judge evaluators
`--langsmith-metric-function`	string	—	Custom scoring function as `module:function`
`--langsmith-summary`	string	`mean`	Aggregation method: `mean`, `median`, `min`, `max`
`--langsmith-experiment-prefix`	string	—	Prefix for experiment names in LangSmith UI
`--langsmith-max-examples`	int	—	Evaluate only N examples (faster iteration)
`--langsmith-max-concurrency`	int	—	Number of parallel evaluation threads
`--langsmith-target-adapter`	string	`raw`	Target adapter: `raw`, `langchain`, `single-input`
`--langsmith-dashboard-evaluator-timeout`	int	`900`	Seconds to poll for dashboard evaluator scores

LangSmith

Helpfulness evaluator prompt

On this page