LangSmith
Use Weco with LangSmith datasets and evaluators for offline evaluation
Weco's LangSmith integration lets you optimize code against LangSmith datasets using both code-based evaluators and dashboard-configured LLM judges, without writing shell eval scripts.
Instead of writing an evaluation command that prints metrics to stdout, you point Weco at a LangSmith dataset and evaluators. Weco handles the rest: running your target function against each example, collecting scores, and iteratively improving your code.
How it works
- Target function: A Python function that receives dataset inputs and returns outputs. LangSmith calls this for each example in your dataset.
- Evaluators: Scoring functions that run locally (code evaluators) and/or LLM judges configured in the LangSmith dashboard (dashboard evaluators).
- Metric function: A function that combines all evaluator scores into a single number for Weco to optimize.
- Optimization loop: Weco iteratively modifies your source file, re-runs evaluation against the dataset, and keeps the version that scores best.
Prerequisites
- Python 3.10+
- Weco CLI installed
- A LangSmith account
- API keys for LangSmith and your LLM provider (e.g., OpenAI)
Tutorial: Optimize an HR QA Agent
This walkthrough uses the ZephHR example, a QA agent that answers HR policy questions over fictional documentation. Weco optimizes the agent's prompts to improve answer quality, measured by a composite metric that combines correctness and helpfulness.
Clone the example
git clone https://github.com/WecoAI/weco-cli.git
cd weco-cli/examples/langsmith-zeph-hr-qaCreate a virtual environment and install dependencies:
python -m venv .venv
source .venv/bin/activate
pip install weco openai langsmithpython -m venv .venv
.venv\Scripts\activate
pip install weco openai langsmithpython -m venv .venv
.venv\Scripts\Activate.ps1
pip install weco openai langsmithSet environment variables
export OPENAI_API_KEY="sk-..."
export LANGCHAIN_API_KEY="lsv2_..."You can get your LangSmith API key from smith.langchain.com/settings.
Understand the project structure
| File | Purpose |
|---|---|
agent.py | QA agent using GPT-4o-mini. Weco optimizes the prompts in this file |
evaluators.py | Code-based evaluators + the qa_score metric function |
setup_dataset.py | Creates LangSmith datasets from JSON question sets |
docs.md | ZephHR product documentation (the knowledge base) |
optimizer_exemplars.md | Few-shot Q&A examples to guide the optimizer |
data/ | JSON files with optimization (15) and holdout (10) questions |
The target function in agent.py is what LangSmith calls for each dataset example:
def answer_hr_question(inputs: dict) -> dict:
"""Answer an HR policy question from the ZephHR docs."""
question = inputs.get("question", "")
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": USER_TEMPLATE.format(docs=DOCS, question=question)},
],
temperature=0.0,
response_format={"type": "json_object"},
)
# ... parse and return {answer, confidence, relevant_sections}Weco optimizes SYSTEM_PROMPT and USER_TEMPLATE in this file to improve the agent's answers.
The metric function in evaluators.py combines evaluator scores into a single optimization target:
def qa_score(scores: dict) -> float:
"""Combine correctness (binary gate) with helpfulness (1-5 signal).
correctness * (helpfulness - 1) / 4
- Incorrect answers always score 0.
- Correct answers are ranked by helpfulness, normalized to 0-1.
"""
correctness = scores.get("correctness", 0.0)
helpfulness = scores.get("helpfulness", 1.0)
return correctness * (helpfulness - 1.0) / 4.0This gated metric means the optimizer can't game helpfulness without getting facts right. Incorrect answers always score 0.
The example also includes two code evaluators that run locally:
def json_schema_validity(run, example) -> dict:
"""Check that the agent output contains the required fields."""
outputs = run.outputs or {}
checks = {
"answer": isinstance(outputs.get("answer"), str) and len(outputs["answer"]) > 0,
"confidence": outputs.get("confidence") in ("high", "medium", "low"),
"relevant_sections": isinstance(outputs.get("relevant_sections"), list),
}
passed = all(checks.values())
return {"key": "json_schema_validity", "score": 1.0 if passed else 0.0, "comment": "..."}
def conciseness(run, example) -> dict:
"""Score based on answer length. Penalises empty or excessively verbose answers."""
# Returns score: 1.0 for ≤150 words, 0.5 for ≤250, 0.0 for >250 or emptyCreate the dataset
Run the setup script to create a dataset in LangSmith:
python setup_dataset.pyThis creates a single dataset called zephhr-qa with two splits:
- opt: 15 optimization questions (used during the optimization loop)
- holdout: 10 held-out questions (used for validation after optimization)
The setup script is idempotent. Running it multiple times won't create duplicate examples.
Configure LangSmith dashboard evaluators
Before running the optimization, set up two online evaluators in your LangSmith project. These are LLM judges that run server-side and score each agent response asynchronously.
- Go to your LangSmith project
- Navigate to the evaluators section
- Add the correctness evaluator. This is available as a default evaluator in LangSmith (binary factual accuracy, 0 or 1).
- Create a custom helpfulness evaluator that scores how complete and useful the answer is (1-5 scale). Make sure the Feedback Key is set to
helpfulness, as this is the name Weco uses to match the evaluator's scores.
Dashboard evaluators run asynchronously after each evaluation. Weco automatically polls for their scores (up to 15 minutes by default). You can adjust the timeout with --langsmith-dashboard-evaluator-timeout.
Run the optimization
Run the optimization with all parameters specified on the command line:
weco run --source agent.py \
--eval-backend langsmith \
--langsmith-dataset zephhr-qa \
--langsmith-splits opt \
--langsmith-target agent:answer_hr_question \
--langsmith-evaluators evaluators:json_schema_validity evaluators:conciseness \
--langsmith-dashboard-evaluators helpfulness correctness \
--langsmith-metric-function evaluators:qa_score \
--additional-instructions optimizer_exemplars.md \
--metric qa_score --goal maximize --steps 10Here's what each flag does:
| Flag | Purpose |
|---|---|
--source agent.py | The file Weco will optimize |
--eval-backend langsmith | Use LangSmith instead of a shell eval command |
--langsmith-dataset | LangSmith dataset to evaluate against |
--langsmith-splits | Evaluate only examples in these dataset splits |
--langsmith-target | Target function as module:function |
--langsmith-evaluators | Code-based evaluator functions as module:function |
--langsmith-dashboard-evaluators | Names of LLM judges configured in LangSmith |
--langsmith-metric-function | Function that combines scores into a single metric |
--additional-instructions | File with hints/exemplars to guide the optimizer |
--metric | Name of the metric to optimize |
--goal maximize | Direction of optimization |
--steps 10 | Number of optimization iterations |
If you prefer a visual setup, run Weco with just the eval backend flag:
weco run --eval-backend langsmithA browser-based setup wizard opens automatically where you can configure everything visually:
- Source file(s) to optimize
- LangSmith dataset name
- Target function (
module:function) - Code evaluators and dashboard evaluators
- Metric name and metric function
- Run parameters (steps, model, instructions, timeout)
Once you submit the configuration, the optimization starts automatically.
The wizard launches automatically when required parameters (--langsmith-dataset and --langsmith-target) are not provided. You can also partially specify flags on the command line and the wizard will pre-fill those values and ask for the rest.
Monitor the optimization
Track progress in the Weco dashboard. Each iteration shows the metric score and the code changes Weco made. When the run completes, you'll be prompted to apply the best-performing version to your source file.
Validate on the holdout set
After optimization, verify that the improvements generalize to unseen questions by running a single evaluation against the holdout dataset:
weco run --source agent.py \
--eval-backend langsmith \
--langsmith-dataset zephhr-qa \
--langsmith-splits holdout \
--langsmith-target agent:answer_hr_question \
--langsmith-evaluators evaluators:json_schema_validity evaluators:conciseness \
--langsmith-dashboard-evaluators helpfulness correctness \
--langsmith-metric-function evaluators:qa_score \
--metric qa_score --goal maximize --steps 1weco run --eval-backend langsmithIn the wizard, select the zephhr-qa dataset, choose the holdout split, and set steps to 1. This runs a single evaluation pass without optimization, giving you the holdout score for the optimized agent.
Key concepts
Target function
Your target function is specified as module:function (e.g., agent:answer_hr_question). It receives an inputs dict from the dataset and returns a dict of outputs. LangSmith calls this function once per dataset example.
Code evaluators
Code evaluators are Python functions specified as module:function (e.g., evaluators:json_schema_validity). Each receives (run, example) and returns a dict with key, score, and optionally comment:
def my_evaluator(run, example) -> dict:
# run.outputs contains the target function's output
# example.outputs contains the expected output from the dataset
return {"key": "my_evaluator", "score": 1.0, "comment": "Passed"}Dashboard evaluators
Dashboard evaluators are LLM judges configured in the LangSmith UI. They run asynchronously after evaluation. Weco polls for their scores automatically. Specify them by name with --langsmith-dashboard-evaluators.
Metric function
A metric function combines all evaluator scores into a single number for Weco to optimize. It receives a dict of {evaluator_name: aggregated_score} and returns a float:
def my_metric(scores: dict) -> float:
return scores["accuracy"] * scores["efficiency"]Summary aggregation
Per-example evaluator scores are aggregated across the dataset using --langsmith-summary (default: mean). Options: mean, median, min, max.
CLI reference
All LangSmith-specific flags for weco run --eval-backend langsmith:
| Flag | Type | Default | Description |
|---|---|---|---|
--langsmith-dataset | string | required | LangSmith dataset name or ID |
--langsmith-target | string | required | Target function as module:function |
--langsmith-splits | string[] | — | Evaluate only examples in these dataset splits |
--langsmith-evaluators | string[] | — | Code evaluator functions as module:function |
--langsmith-dashboard-evaluators | string[] | — | Names of dashboard-bound LLM judge evaluators |
--langsmith-metric-function | string | — | Custom scoring function as module:function |
--langsmith-summary | string | mean | Aggregation method: mean, median, min, max |
--langsmith-experiment-prefix | string | — | Prefix for experiment names in LangSmith UI |
--langsmith-max-examples | int | — | Evaluate only N examples (faster iteration) |
--langsmith-max-concurrency | int | — | Number of parallel evaluation threads |
--langsmith-target-adapter | string | raw | Target adapter: raw, langchain, single-input |
--langsmith-dashboard-evaluator-timeout | int | 900 | Seconds to poll for dashboard evaluator scores |