now in public beta — v1.0.0

StackEval

Virtually and empirically evaluate multiple AI stacks simultaneously with consistent metrics like latency, accuracy and throughput.

Try it Now

Getting Started

Follow these simple steps to get started to unlock the full potential of our product

Enter your Backboard API key

Enter your API key to connect your provider account, then choose the models you want to compare. This sets up your evaluation using the exact model options you are considering, so your benchmark reflects real production choices rather than generic comparisons.

Add data Data and choose a run mode

Input a single example for a quick test or upload a CSV to evaluate multiple cases at once. Then choose between Live mode for real outputs and token usage, or Dry Run for estimated cost and usage before running a full benchmark.

Review Results and update your app

Sort and compare results based on the metric that matters most to you, whether that is cost, accuracy, or speed. Once you identify the strongest model for your use case, update your app configuration with more confidence and less guesswork.

What will you benchmark?

Built for every task type

Select a task type to see how StackEval configures the evaluation for that use case.

Q&A from Memory

Tests a model's ability to accurately retrieve and reason over information stored from previous interactions. Evaluates how well an AI recalls personal context and details from its memory to answer factual questions.

Summarization

Challenges AI models to condense complex, lengthy documents into concise, coherent summaries that capture the core essence of the original text. This evaluates the model's comprehension, abstraction, and information distillation skills.

Document Q&A (RAG)

Tests a model's ability to search through uploaded documents, extract relevant information, and generate precise answers grounded in the source material. This evaluates how effectively an AI can navigate, comprehend, and reference specific details across complex documents.

LoCoMo (Long-term Conversational Memory)

Evaluates AI models' capacity to remember, retrieve, and reason over information across extended, multi-session dialogues. This tests how well an AI maintains contextual understanding and demonstrates human-like long-term conversational memory over time.

Guest — virtual mode only

Models

Loading…

0 selected

Loading models…

Task Type

StackEval

Task Type

Selected Models

0 selected

Select a model to begin evaluating!

Input Mode

Choose an input method to start evaluating

↓ CSV template

Input Text

Golden Output

Drop CSV here or click to browse

Columns: input_text, golden_output

Input Text Column

Golden Output Column

Input Mode

Upload a LoCoMo JSON file to test model memory across long context

Drop your LoCoMo JSON file here or click to browse

Must contain "conversation" and "qa" fields

Input Mode

Upload a PDF document and ask a question about it

PDF Document *

Drop your PDF file here or click to browse

PDF files only

Question (Prompt) *

Expected Answer *

Embedding Provider *

Embedding Model *

Submitting…

Run Mode Running…

Results

Model	Cost	Latency	ROUGE-L	F1	BLEU

Upgrade to Pro

Unlock higher limits, advanced evaluations, and more control.

Run more evaluations with higher usage limits.

Access advanced memory and RAG testing.

Compare provider pricing side by side.

Get deeper insights and more detailed results.

Use premium features built for production workflows.

Don't have an account yet? Sign Up

CHOOSE YOUR PLAN

Free Basic access · limited usage $0 / month

Current Plan

Education For students & learners $5 / month

Pro Full access · higher limits $20 / month