LangFuse
Use Weco with LangFuse datasets and evaluators for offline evaluation
Weco's LangFuse integration lets you optimize code against LangFuse datasets using both local code evaluators and managed LLM-as-a-Judge evaluators, without writing shell eval scripts.
Instead of writing an evaluation command that prints metrics to stdout, you point Weco at a LangFuse dataset and evaluators. Weco handles the rest: running your target function against each example, collecting scores, and iteratively improving your code.
How it works
- Target function: A Python function that receives dataset inputs and returns outputs. LangFuse calls this for each item in your dataset.
- Evaluators: Scoring functions that run locally (code evaluators) and/or LLM-as-a-Judge evaluators configured in the LangFuse UI (managed evaluators).
- Metric function: A function that combines all evaluator scores into a single number for Weco to optimize.
- Optimization loop: Weco iteratively modifies your source file, re-runs evaluation against the dataset, and keeps the version that scores best.
Prerequisites
- Python 3.10+
- Weco CLI installed
- A LangFuse account
- API keys for LangFuse and your LLM provider (e.g., OpenAI)
Tutorial: Optimize an HR QA Agent
This walkthrough uses the ZephHR example, a QA agent that answers HR policy questions over fictional documentation. Weco optimizes the agent's prompts to improve answer quality, measured by a composite metric that combines correctness and helpfulness.
Clone the example
git clone https://github.com/WecoAI/weco-cli.git
cd weco-cli/examples/langfuse-zeph-hr-qaInstall the Weco CLI and set up a virtual environment for the project dependencies:
pipx install 'weco[langfuse]'python -m venv .venv
source .venv/bin/activate
pip install openai langfusepython -m venv .venv
.venv\Scripts\activate
pip install openai langfusepython -m venv .venv
.venv\Scripts\Activate.ps1
pip install openai langfuseSet environment variables
export OPENAI_API_KEY="sk-..."
export LANGFUSE_SECRET_KEY="sk-lf-..."
export LANGFUSE_PUBLIC_KEY="pk-lf-..."
export LANGFUSE_BASE_URL="https://cloud.langfuse.com" # or https://us.cloud.langfuse.comYou can get your LangFuse API keys from your project settings at cloud.langfuse.com.
Understand the project structure
| File | Purpose |
|---|---|
agent.py | QA agent using GPT-4o-mini. Weco optimizes the prompts in this file |
evaluators.py | Code-based evaluators + the qa_score metric function |
setup_dataset.py | Creates LangFuse datasets from JSON question sets |
docs.md | ZephHR product documentation (the knowledge base) |
optimizer_exemplars.md | Few-shot Q&A examples to guide the optimizer |
data/ | JSON files with optimization (15) and holdout (10) questions |
The target function in agent.py is what LangFuse calls for each dataset item:
def answer_hr_question(inputs: dict) -> dict:
"""Answer an HR policy question from the ZephHR docs."""
question = inputs.get("question", "")
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": USER_TEMPLATE.format(docs=DOCS, question=question)},
],
temperature=0.0,
response_format={"type": "json_object"},
)
# ... parse and return {answer, confidence, relevant_sections}Weco optimizes SYSTEM_PROMPT and USER_TEMPLATE in this file to improve the agent's answers.
The metric function in evaluators.py combines evaluator scores into a single optimization target:
def qa_score(scores: dict) -> float:
"""Combine correctness (binary gate) with helpfulness (0-1 signal).
correctness * helpfulness
- Incorrect answers always score 0.
- Correct answers are ranked by helpfulness.
"""
correctness = scores.get("Correctness", 0.0)
helpfulness = scores.get("Helpfulness", 0.0)
return correctness * helpfulnessThis gated metric means the optimizer can't game helpfulness without getting facts right. Incorrect answers always score 0.
The example also includes two code evaluators that run locally. LangFuse evaluators receive keyword arguments and return an Evaluation object:
from langfuse import Evaluation
def json_schema_validity(*, input, output, expected_output=None, **kwargs):
"""Check that the agent output contains the required fields."""
outputs = output or {}
checks = {
"answer": isinstance(outputs.get("answer"), str) and len(outputs["answer"]) > 0,
"confidence": outputs.get("confidence") in ("high", "medium", "low"),
"relevant_sections": isinstance(outputs.get("relevant_sections"), list),
}
passed = all(checks.values())
failed_fields = [k for k, v in checks.items() if not v]
comment = "All fields valid" if passed else f"Invalid fields: {', '.join(failed_fields)}"
return Evaluation(name="json_schema_validity", value=1.0 if passed else 0.0, comment=comment)
def conciseness(*, input, output, expected_output=None, **kwargs):
"""Score based on answer length. Penalises empty or excessively verbose answers."""
# Returns score: 1.0 for ≤150 words, 0.5 for ≤250, 0.0 for >250 or emptyCreate the datasets
Run the setup script to create datasets in LangFuse:
python setup_dataset.pySince LangFuse does not have native dataset splits, this creates two separate datasets:
- zephhr-qa-opt: 15 optimization questions (used during the optimization loop)
- zephhr-qa-holdout: 10 held-out questions (used for validation after optimization)
The setup script is idempotent. Running it multiple times won't create duplicate items.
Configure LangFuse managed evaluators
Before running the optimization, set up two managed evaluators (LLM-as-a-Judge) in your LangFuse project. These run server-side and score each agent response automatically.
- Go to your project in LangFuse
- Navigate to Evaluation → Evaluators
- Click + New Evaluator for each evaluator below
Correctness evaluator
- Name:
Correctness - Score: 0 or 1 (binary factual accuracy)
- Variable mappings (JSONPath expressions pointing to trace fields):
{{input}}→$.input.question(the user's question){{output}}→$.output.answer(the agent's answer){{expected_output}}→$.expected_output.expected_answer(the ground truth)
Helpfulness evaluator
- Name:
Helpfulness - Score: 0–1 continuous scale
- Variable mappings:
{{input}}→$.input.question{{output}}→$.output.answer
Use the live preview in the evaluator setup to verify the mappings are picking up the right data from your traces. The preview shows historical traces matching your filter criteria, populated with the mapped variables.
Evaluator names are case-sensitive. The name you set in LangFuse (e.g., Correctness) must match exactly what you pass to --langfuse-managed-evaluators and what you use as keys in your metric function (e.g., scores.get("Correctness")).
Managed evaluators run asynchronously after each experiment. Weco automatically polls for their scores (up to 15 minutes by default). You can adjust the timeout with --langfuse-managed-evaluator-timeout.
Run the optimization
Run the optimization with all parameters specified on the command line:
weco run --source agent.py \
--eval-backend langfuse \
--langfuse-dataset zephhr-qa-opt \
--langfuse-target agent:answer_hr_question \
--langfuse-evaluators evaluators:json_schema_validity evaluators:conciseness \
--langfuse-managed-evaluators Correctness Helpfulness \
--langfuse-metric-function evaluators:qa_score \
--additional-instructions optimizer_exemplars.md \
--metric qa_score --goal maximize --steps 10Here's what each flag does:
| Flag | Purpose |
|---|---|
--source agent.py | The file Weco will optimize |
--eval-backend langfuse | Use LangFuse instead of a shell eval command |
--langfuse-dataset | LangFuse dataset to evaluate against |
--langfuse-target | Target function as module:function |
--langfuse-evaluators | Code-based evaluator functions as module:function |
--langfuse-managed-evaluators | Names of LLM-as-a-Judge evaluators configured in LangFuse |
--langfuse-metric-function | Function that combines scores into a single metric |
--additional-instructions | File with hints/exemplars to guide the optimizer |
--metric | Name of the metric to optimize |
--goal maximize | Direction of optimization |
--steps 10 | Number of optimization iterations |
If you prefer a visual setup, run Weco with just the eval backend flag:
weco run --eval-backend langfuseA browser-based setup wizard opens automatically where you can configure everything visually:
- Source file(s) to optimize
- LangFuse dataset name
- Target function (
module:function) - Code evaluators and managed evaluators
- Metric name and metric function
- Run parameters (steps, model, instructions, timeout)
Once you submit the configuration, the optimization starts automatically.
The wizard launches automatically when required parameters (--langfuse-dataset and --langfuse-target) are not provided. You can also partially specify flags on the command line and the wizard will pre-fill those values and ask for the rest.
Monitor the optimization
Track progress in the Weco dashboard. Each iteration shows the metric score and the code changes Weco made. When the run completes, you'll be prompted to apply the best-performing version to your source file.
Each optimization step also creates an experiment in LangFuse, so you can compare all variants in the LangFuse dashboard under Datasets → your dataset → Runs.
Validate on the holdout set
After optimization, verify that the improvements generalize to unseen questions by running a single evaluation against the holdout dataset:
weco run --source agent.py \
--eval-backend langfuse \
--langfuse-dataset zephhr-qa-holdout \
--langfuse-target agent:answer_hr_question \
--langfuse-evaluators evaluators:json_schema_validity evaluators:conciseness \
--langfuse-managed-evaluators Correctness Helpfulness \
--langfuse-metric-function evaluators:qa_score \
--metric qa_score --goal maximize --steps 1weco run --eval-backend langfuseIn the wizard, select the zephhr-qa-holdout dataset and set steps to 1. This runs a single evaluation pass without optimization, giving you the holdout score for the optimized agent.
Key concepts
Target function
Your target function is specified as module:function (e.g., agent:answer_hr_question). It receives an inputs dict from the dataset and returns a dict of outputs. LangFuse calls this function once per dataset item.
Code evaluators
Code evaluators are Python functions specified as module:function (e.g., evaluators:json_schema_validity). Each receives keyword arguments and returns a LangFuse Evaluation object:
from langfuse import Evaluation
def my_evaluator(*, input, output, expected_output=None, **kwargs):
# input: the dataset item's input dict
# output: the target function's return value
# expected_output: the dataset item's expected output (if any)
return Evaluation(name="my_evaluator", value=1.0, comment="Passed")Evaluators can also return a plain dict with name/key and value/score keys — the bridge will normalize it to an Evaluation object automatically.
Managed evaluators
Managed evaluators are LLM-as-a-Judge evaluators configured in the LangFuse UI under Evaluation → Evaluators. They run server-side automatically on experiment traces. Weco polls for their scores after each experiment run. Specify them by name with --langfuse-managed-evaluators.
Evaluator names are case-sensitive. Use the exact name as shown in the LangFuse UI (e.g., Correctness, not correctness).
When configuring managed evaluators, make sure the variable mappings point to the correct fields in your traces. For example, if your target function returns {"answer": "...", "confidence": "..."}, map {{generation}} to output.answer in the evaluator template.
Metric function
A metric function combines all evaluator scores into a single number for Weco to optimize. It receives a dict of {evaluator_name: aggregated_score} and returns a float:
def my_metric(scores: dict) -> float:
return scores["Correctness"] * scores["Helpfulness"]Summary aggregation
Per-item evaluator scores are aggregated across the dataset using --langfuse-summary (default: mean). Options: mean, median, min, max.
CLI reference
All LangFuse-specific flags for weco run --eval-backend langfuse:
| Flag | Type | Default | Description |
|---|---|---|---|
--langfuse-dataset | string | required | LangFuse dataset name |
--langfuse-target | string | required | Target function as module:function |
--langfuse-evaluators | string[] | — | Code evaluator functions as module:function |
--langfuse-managed-evaluators | string[] | — | Names of managed LLM-as-a-Judge evaluators |
--langfuse-metric-function | string | — | Custom scoring function as module:function |
--langfuse-summary | string | mean | Aggregation method: mean, median, min, max |
--langfuse-experiment-name | string | — | Experiment name in the LangFuse UI |
--langfuse-max-concurrency | int | — | Number of parallel evaluation threads |
--langfuse-managed-evaluator-timeout | int | 900 | Seconds to poll for managed evaluator scores |