Weco LogoWeco Docs

LangFuse

Use Weco with LangFuse datasets and evaluators for offline evaluation

Weco's LangFuse integration lets you optimize code against LangFuse datasets using both local code evaluators and managed LLM-as-a-Judge evaluators, without writing shell eval scripts.

Instead of writing an evaluation command that prints metrics to stdout, you point Weco at a LangFuse dataset and evaluators. Weco handles the rest: running your target function against each example, collecting scores, and iteratively improving your code.

How it works

  1. Target function: A Python function that receives dataset inputs and returns outputs. LangFuse calls this for each item in your dataset.
  2. Evaluators: Scoring functions that run locally (code evaluators) and/or LLM-as-a-Judge evaluators configured in the LangFuse UI (managed evaluators).
  3. Metric function: A function that combines all evaluator scores into a single number for Weco to optimize.
  4. Optimization loop: Weco iteratively modifies your source file, re-runs evaluation against the dataset, and keeps the version that scores best.

Prerequisites

Tutorial: Optimize an HR QA Agent

This walkthrough uses the ZephHR example, a QA agent that answers HR policy questions over fictional documentation. Weco optimizes the agent's prompts to improve answer quality, measured by a composite metric that combines correctness and helpfulness.

Clone the example

git clone https://github.com/WecoAI/weco-cli.git
cd weco-cli/examples/langfuse-zeph-hr-qa

Install the Weco CLI and set up a virtual environment for the project dependencies:

pipx install 'weco[langfuse]'
python -m venv .venv
source .venv/bin/activate
pip install openai langfuse
python -m venv .venv
.venv\Scripts\activate
pip install openai langfuse
python -m venv .venv
.venv\Scripts\Activate.ps1
pip install openai langfuse

Set environment variables

export OPENAI_API_KEY="sk-..."
export LANGFUSE_SECRET_KEY="sk-lf-..."
export LANGFUSE_PUBLIC_KEY="pk-lf-..."
export LANGFUSE_BASE_URL="https://cloud.langfuse.com"  # or https://us.cloud.langfuse.com

You can get your LangFuse API keys from your project settings at cloud.langfuse.com.

Understand the project structure

FilePurpose
agent.pyQA agent using GPT-4o-mini. Weco optimizes the prompts in this file
evaluators.pyCode-based evaluators + the qa_score metric function
setup_dataset.pyCreates LangFuse datasets from JSON question sets
docs.mdZephHR product documentation (the knowledge base)
optimizer_exemplars.mdFew-shot Q&A examples to guide the optimizer
data/JSON files with optimization (15) and holdout (10) questions

The target function in agent.py is what LangFuse calls for each dataset item:

def answer_hr_question(inputs: dict) -> dict:
    """Answer an HR policy question from the ZephHR docs."""
    question = inputs.get("question", "")

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": USER_TEMPLATE.format(docs=DOCS, question=question)},
        ],
        temperature=0.0,
        response_format={"type": "json_object"},
    )

    # ... parse and return {answer, confidence, relevant_sections}

Weco optimizes SYSTEM_PROMPT and USER_TEMPLATE in this file to improve the agent's answers.

The metric function in evaluators.py combines evaluator scores into a single optimization target:

def qa_score(scores: dict) -> float:
    """Combine correctness (binary gate) with helpfulness (0-1 signal).

    correctness * helpfulness

    - Incorrect answers always score 0.
    - Correct answers are ranked by helpfulness.
    """
    correctness = scores.get("Correctness", 0.0)
    helpfulness = scores.get("Helpfulness", 0.0)
    return correctness * helpfulness

This gated metric means the optimizer can't game helpfulness without getting facts right. Incorrect answers always score 0.

The example also includes two code evaluators that run locally. LangFuse evaluators receive keyword arguments and return an Evaluation object:

from langfuse import Evaluation

def json_schema_validity(*, input, output, expected_output=None, **kwargs):
    """Check that the agent output contains the required fields."""
    outputs = output or {}
    checks = {
        "answer": isinstance(outputs.get("answer"), str) and len(outputs["answer"]) > 0,
        "confidence": outputs.get("confidence") in ("high", "medium", "low"),
        "relevant_sections": isinstance(outputs.get("relevant_sections"), list),
    }
    passed = all(checks.values())
    failed_fields = [k for k, v in checks.items() if not v]
    comment = "All fields valid" if passed else f"Invalid fields: {', '.join(failed_fields)}"
    return Evaluation(name="json_schema_validity", value=1.0 if passed else 0.0, comment=comment)

def conciseness(*, input, output, expected_output=None, **kwargs):
    """Score based on answer length. Penalises empty or excessively verbose answers."""
    # Returns score: 1.0 for ≤150 words, 0.5 for ≤250, 0.0 for >250 or empty

Create the datasets

Run the setup script to create datasets in LangFuse:

python setup_dataset.py

Since LangFuse does not have native dataset splits, this creates two separate datasets:

  • zephhr-qa-opt: 15 optimization questions (used during the optimization loop)
  • zephhr-qa-holdout: 10 held-out questions (used for validation after optimization)

The setup script is idempotent. Running it multiple times won't create duplicate items.

Configure LangFuse managed evaluators

Before running the optimization, set up two managed evaluators (LLM-as-a-Judge) in your LangFuse project. These run server-side and score each agent response automatically.

  1. Go to your project in LangFuse
  2. Navigate to EvaluationEvaluators
  3. Click + New Evaluator for each evaluator below

Correctness evaluator

  • Name: Correctness
  • Score: 0 or 1 (binary factual accuracy)
  • Variable mappings (JSONPath expressions pointing to trace fields):
    • {{input}}$.input.question (the user's question)
    • {{output}}$.output.answer (the agent's answer)
    • {{expected_output}}$.expected_output.expected_answer (the ground truth)

Helpfulness evaluator

  • Name: Helpfulness
  • Score: 0–1 continuous scale
  • Variable mappings:
    • {{input}}$.input.question
    • {{output}}$.output.answer

Use the live preview in the evaluator setup to verify the mappings are picking up the right data from your traces. The preview shows historical traces matching your filter criteria, populated with the mapped variables.

Evaluator names are case-sensitive. The name you set in LangFuse (e.g., Correctness) must match exactly what you pass to --langfuse-managed-evaluators and what you use as keys in your metric function (e.g., scores.get("Correctness")).

Managed evaluators run asynchronously after each experiment. Weco automatically polls for their scores (up to 15 minutes by default). You can adjust the timeout with --langfuse-managed-evaluator-timeout.

Run the optimization

Run the optimization with all parameters specified on the command line:

weco run --source agent.py \
  --eval-backend langfuse \
  --langfuse-dataset zephhr-qa-opt \
  --langfuse-target agent:answer_hr_question \
  --langfuse-evaluators evaluators:json_schema_validity evaluators:conciseness \
  --langfuse-managed-evaluators Correctness Helpfulness \
  --langfuse-metric-function evaluators:qa_score \
  --additional-instructions optimizer_exemplars.md \
  --metric qa_score --goal maximize --steps 10

Here's what each flag does:

FlagPurpose
--source agent.pyThe file Weco will optimize
--eval-backend langfuseUse LangFuse instead of a shell eval command
--langfuse-datasetLangFuse dataset to evaluate against
--langfuse-targetTarget function as module:function
--langfuse-evaluatorsCode-based evaluator functions as module:function
--langfuse-managed-evaluatorsNames of LLM-as-a-Judge evaluators configured in LangFuse
--langfuse-metric-functionFunction that combines scores into a single metric
--additional-instructionsFile with hints/exemplars to guide the optimizer
--metricName of the metric to optimize
--goal maximizeDirection of optimization
--steps 10Number of optimization iterations

If you prefer a visual setup, run Weco with just the eval backend flag:

weco run --eval-backend langfuse

A browser-based setup wizard opens automatically where you can configure everything visually:

  • Source file(s) to optimize
  • LangFuse dataset name
  • Target function (module:function)
  • Code evaluators and managed evaluators
  • Metric name and metric function
  • Run parameters (steps, model, instructions, timeout)

Once you submit the configuration, the optimization starts automatically.

The wizard launches automatically when required parameters (--langfuse-dataset and --langfuse-target) are not provided. You can also partially specify flags on the command line and the wizard will pre-fill those values and ask for the rest.

Monitor the optimization

Track progress in the Weco dashboard. Each iteration shows the metric score and the code changes Weco made. When the run completes, you'll be prompted to apply the best-performing version to your source file.

Each optimization step also creates an experiment in LangFuse, so you can compare all variants in the LangFuse dashboard under Datasets → your dataset → Runs.

Validate on the holdout set

After optimization, verify that the improvements generalize to unseen questions by running a single evaluation against the holdout dataset:

weco run --source agent.py \
  --eval-backend langfuse \
  --langfuse-dataset zephhr-qa-holdout \
  --langfuse-target agent:answer_hr_question \
  --langfuse-evaluators evaluators:json_schema_validity evaluators:conciseness \
  --langfuse-managed-evaluators Correctness Helpfulness \
  --langfuse-metric-function evaluators:qa_score \
  --metric qa_score --goal maximize --steps 1
weco run --eval-backend langfuse

In the wizard, select the zephhr-qa-holdout dataset and set steps to 1. This runs a single evaluation pass without optimization, giving you the holdout score for the optimized agent.

Key concepts

Target function

Your target function is specified as module:function (e.g., agent:answer_hr_question). It receives an inputs dict from the dataset and returns a dict of outputs. LangFuse calls this function once per dataset item.

Code evaluators

Code evaluators are Python functions specified as module:function (e.g., evaluators:json_schema_validity). Each receives keyword arguments and returns a LangFuse Evaluation object:

from langfuse import Evaluation

def my_evaluator(*, input, output, expected_output=None, **kwargs):
    # input: the dataset item's input dict
    # output: the target function's return value
    # expected_output: the dataset item's expected output (if any)
    return Evaluation(name="my_evaluator", value=1.0, comment="Passed")

Evaluators can also return a plain dict with name/key and value/score keys — the bridge will normalize it to an Evaluation object automatically.

Managed evaluators

Managed evaluators are LLM-as-a-Judge evaluators configured in the LangFuse UI under EvaluationEvaluators. They run server-side automatically on experiment traces. Weco polls for their scores after each experiment run. Specify them by name with --langfuse-managed-evaluators.

Evaluator names are case-sensitive. Use the exact name as shown in the LangFuse UI (e.g., Correctness, not correctness).

When configuring managed evaluators, make sure the variable mappings point to the correct fields in your traces. For example, if your target function returns {"answer": "...", "confidence": "..."}, map {{generation}} to output.answer in the evaluator template.

Metric function

A metric function combines all evaluator scores into a single number for Weco to optimize. It receives a dict of {evaluator_name: aggregated_score} and returns a float:

def my_metric(scores: dict) -> float:
    return scores["Correctness"] * scores["Helpfulness"]

Summary aggregation

Per-item evaluator scores are aggregated across the dataset using --langfuse-summary (default: mean). Options: mean, median, min, max.

CLI reference

All LangFuse-specific flags for weco run --eval-backend langfuse:

FlagTypeDefaultDescription
--langfuse-datasetstringrequiredLangFuse dataset name
--langfuse-targetstringrequiredTarget function as module:function
--langfuse-evaluatorsstring[]Code evaluator functions as module:function
--langfuse-managed-evaluatorsstring[]Names of managed LLM-as-a-Judge evaluators
--langfuse-metric-functionstringCustom scoring function as module:function
--langfuse-summarystringmeanAggregation method: mean, median, min, max
--langfuse-experiment-namestringExperiment name in the LangFuse UI
--langfuse-max-concurrencyintNumber of parallel evaluation threads
--langfuse-managed-evaluator-timeoutint900Seconds to poll for managed evaluator scores

On this page