Weco LogoWeco Docs

LangSmith

Use Weco with LangSmith datasets and evaluators for offline evaluation

Weco's LangSmith integration lets you optimize code against LangSmith datasets using both code-based evaluators and dashboard-configured LLM judges, without writing shell eval scripts.

Instead of writing an evaluation command that prints metrics to stdout, you point Weco at a LangSmith dataset and evaluators. Weco handles the rest: running your target function against each example, collecting scores, and iteratively improving your code.

How it works

  1. Target function: A Python function that receives dataset inputs and returns outputs. LangSmith calls this for each example in your dataset.
  2. Evaluators: Scoring functions that run locally (code evaluators) and/or LLM judges configured in the LangSmith dashboard (dashboard evaluators).
  3. Metric function: A function that combines all evaluator scores into a single number for Weco to optimize.
  4. Optimization loop: Weco iteratively modifies your source file, re-runs evaluation against the dataset, and keeps the version that scores best.

Prerequisites

Tutorial: Optimize an HR QA Agent

This walkthrough uses the ZephHR example, a QA agent that answers HR policy questions over fictional documentation. Weco optimizes the agent's prompts to improve answer quality, measured by a composite metric that combines correctness and helpfulness.

Clone the example

git clone https://github.com/WecoAI/weco-cli.git
cd weco-cli/examples/langsmith-zeph-hr-qa

Create a virtual environment and install dependencies:

python -m venv .venv
source .venv/bin/activate
pip install weco openai langsmith
python -m venv .venv
.venv\Scripts\activate
pip install weco openai langsmith
python -m venv .venv
.venv\Scripts\Activate.ps1
pip install weco openai langsmith

Set environment variables

export OPENAI_API_KEY="sk-..."
export LANGCHAIN_API_KEY="lsv2_..."

You can get your LangSmith API key from smith.langchain.com/settings.

Understand the project structure

FilePurpose
agent.pyQA agent using GPT-4o-mini. Weco optimizes the prompts in this file
evaluators.pyCode-based evaluators + the qa_score metric function
setup_dataset.pyCreates LangSmith datasets from JSON question sets
docs.mdZephHR product documentation (the knowledge base)
optimizer_exemplars.mdFew-shot Q&A examples to guide the optimizer
data/JSON files with optimization (15) and holdout (10) questions

The target function in agent.py is what LangSmith calls for each dataset example:

def answer_hr_question(inputs: dict) -> dict:
    """Answer an HR policy question from the ZephHR docs."""
    question = inputs.get("question", "")

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": USER_TEMPLATE.format(docs=DOCS, question=question)},
        ],
        temperature=0.0,
        response_format={"type": "json_object"},
    )

    # ... parse and return {answer, confidence, relevant_sections}

Weco optimizes SYSTEM_PROMPT and USER_TEMPLATE in this file to improve the agent's answers.

The metric function in evaluators.py combines evaluator scores into a single optimization target:

def qa_score(scores: dict) -> float:
    """Combine correctness (binary gate) with helpfulness (1-5 signal).

    correctness * (helpfulness - 1) / 4

    - Incorrect answers always score 0.
    - Correct answers are ranked by helpfulness, normalized to 0-1.
    """
    correctness = scores.get("correctness", 0.0)
    helpfulness = scores.get("helpfulness", 1.0)
    return correctness * (helpfulness - 1.0) / 4.0

This gated metric means the optimizer can't game helpfulness without getting facts right. Incorrect answers always score 0.

The example also includes two code evaluators that run locally:

def json_schema_validity(run, example) -> dict:
    """Check that the agent output contains the required fields."""
    outputs = run.outputs or {}
    checks = {
        "answer": isinstance(outputs.get("answer"), str) and len(outputs["answer"]) > 0,
        "confidence": outputs.get("confidence") in ("high", "medium", "low"),
        "relevant_sections": isinstance(outputs.get("relevant_sections"), list),
    }
    passed = all(checks.values())
    return {"key": "json_schema_validity", "score": 1.0 if passed else 0.0, "comment": "..."}

def conciseness(run, example) -> dict:
    """Score based on answer length. Penalises empty or excessively verbose answers."""
    # Returns score: 1.0 for ≤150 words, 0.5 for ≤250, 0.0 for >250 or empty

Create the dataset

Run the setup script to create a dataset in LangSmith:

python setup_dataset.py

This creates a single dataset called zephhr-qa with two splits:

  • opt: 15 optimization questions (used during the optimization loop)
  • holdout: 10 held-out questions (used for validation after optimization)

The setup script is idempotent. Running it multiple times won't create duplicate examples.

Configure LangSmith dashboard evaluators

Before running the optimization, set up two online evaluators in your LangSmith project. These are LLM judges that run server-side and score each agent response asynchronously.

  1. Go to your LangSmith project
  2. Navigate to the evaluators section
  3. Add the correctness evaluator. This is available as a default evaluator in LangSmith (binary factual accuracy, 0 or 1).
  4. Create a custom helpfulness evaluator that scores how complete and useful the answer is (1-5 scale). Make sure the Feedback Key is set to helpfulness, as this is the name Weco uses to match the evaluator's scores.

Dashboard evaluators run asynchronously after each evaluation. Weco automatically polls for their scores (up to 15 minutes by default). You can adjust the timeout with --langsmith-dashboard-evaluator-timeout.

Run the optimization

Run the optimization with all parameters specified on the command line:

weco run --source agent.py \
  --eval-backend langsmith \
  --langsmith-dataset zephhr-qa \
  --langsmith-splits opt \
  --langsmith-target agent:answer_hr_question \
  --langsmith-evaluators evaluators:json_schema_validity evaluators:conciseness \
  --langsmith-dashboard-evaluators helpfulness correctness \
  --langsmith-metric-function evaluators:qa_score \
  --additional-instructions optimizer_exemplars.md \
  --metric qa_score --goal maximize --steps 10

Here's what each flag does:

FlagPurpose
--source agent.pyThe file Weco will optimize
--eval-backend langsmithUse LangSmith instead of a shell eval command
--langsmith-datasetLangSmith dataset to evaluate against
--langsmith-splitsEvaluate only examples in these dataset splits
--langsmith-targetTarget function as module:function
--langsmith-evaluatorsCode-based evaluator functions as module:function
--langsmith-dashboard-evaluatorsNames of LLM judges configured in LangSmith
--langsmith-metric-functionFunction that combines scores into a single metric
--additional-instructionsFile with hints/exemplars to guide the optimizer
--metricName of the metric to optimize
--goal maximizeDirection of optimization
--steps 10Number of optimization iterations

If you prefer a visual setup, run Weco with just the eval backend flag:

weco run --eval-backend langsmith

A browser-based setup wizard opens automatically where you can configure everything visually:

  • Source file(s) to optimize
  • LangSmith dataset name
  • Target function (module:function)
  • Code evaluators and dashboard evaluators
  • Metric name and metric function
  • Run parameters (steps, model, instructions, timeout)

Once you submit the configuration, the optimization starts automatically.

The wizard launches automatically when required parameters (--langsmith-dataset and --langsmith-target) are not provided. You can also partially specify flags on the command line and the wizard will pre-fill those values and ask for the rest.

Monitor the optimization

Track progress in the Weco dashboard. Each iteration shows the metric score and the code changes Weco made. When the run completes, you'll be prompted to apply the best-performing version to your source file.

Validate on the holdout set

After optimization, verify that the improvements generalize to unseen questions by running a single evaluation against the holdout dataset:

weco run --source agent.py \
  --eval-backend langsmith \
  --langsmith-dataset zephhr-qa \
  --langsmith-splits holdout \
  --langsmith-target agent:answer_hr_question \
  --langsmith-evaluators evaluators:json_schema_validity evaluators:conciseness \
  --langsmith-dashboard-evaluators helpfulness correctness \
  --langsmith-metric-function evaluators:qa_score \
  --metric qa_score --goal maximize --steps 1
weco run --eval-backend langsmith

In the wizard, select the zephhr-qa dataset, choose the holdout split, and set steps to 1. This runs a single evaluation pass without optimization, giving you the holdout score for the optimized agent.

Key concepts

Target function

Your target function is specified as module:function (e.g., agent:answer_hr_question). It receives an inputs dict from the dataset and returns a dict of outputs. LangSmith calls this function once per dataset example.

Code evaluators

Code evaluators are Python functions specified as module:function (e.g., evaluators:json_schema_validity). Each receives (run, example) and returns a dict with key, score, and optionally comment:

def my_evaluator(run, example) -> dict:
    # run.outputs contains the target function's output
    # example.outputs contains the expected output from the dataset
    return {"key": "my_evaluator", "score": 1.0, "comment": "Passed"}

Dashboard evaluators

Dashboard evaluators are LLM judges configured in the LangSmith UI. They run asynchronously after evaluation. Weco polls for their scores automatically. Specify them by name with --langsmith-dashboard-evaluators.

Metric function

A metric function combines all evaluator scores into a single number for Weco to optimize. It receives a dict of {evaluator_name: aggregated_score} and returns a float:

def my_metric(scores: dict) -> float:
    return scores["accuracy"] * scores["efficiency"]

Summary aggregation

Per-example evaluator scores are aggregated across the dataset using --langsmith-summary (default: mean). Options: mean, median, min, max.

CLI reference

All LangSmith-specific flags for weco run --eval-backend langsmith:

FlagTypeDefaultDescription
--langsmith-datasetstringrequiredLangSmith dataset name or ID
--langsmith-targetstringrequiredTarget function as module:function
--langsmith-splitsstring[]Evaluate only examples in these dataset splits
--langsmith-evaluatorsstring[]Code evaluator functions as module:function
--langsmith-dashboard-evaluatorsstring[]Names of dashboard-bound LLM judge evaluators
--langsmith-metric-functionstringCustom scoring function as module:function
--langsmith-summarystringmeanAggregation method: mean, median, min, max
--langsmith-experiment-prefixstringPrefix for experiment names in LangSmith UI
--langsmith-max-examplesintEvaluate only N examples (faster iteration)
--langsmith-max-concurrencyintNumber of parallel evaluation threads
--langsmith-target-adapterstringrawTarget adapter: raw, langchain, single-input
--langsmith-dashboard-evaluator-timeoutint900Seconds to poll for dashboard evaluator scores

On this page