StackEval
by
Virtually and empirically evaluate multiple AI stacks simultaneously with consistent metrics like latency, accuracy and throughput.
What will you benchmark?
Built for every task type
Select a task type to see how StackEval configures the evaluation for that use case.
Q&A from Memory
Tests a model's ability to accurately retrieve and reason over information stored from previous interactions. Evaluates how well an AI recalls personal context and details from its memory to answer factual questions.
StackEval
byTask Type
Selected Models
0 selectedInput Mode
Choose an input method to start evaluating
Input Text
Golden Output
Drop CSV here or click to browse
Columns: input_text, golden_output
Input Text Column
Golden Output Column
Input Mode
Upload a LoCoMo JSON file to test model memory across long context
Drop your LoCoMo JSON file here or click to browse
Must contain "conversation" and "qa" fields
Input Mode
Upload a PDF document and ask a question about it
PDF Document *
Drop your PDF file here or click to browse
PDF files only
Question (Prompt) *
Expected Answer *
Embedding Provider *
Embedding Model *
Results
| Model | Cost | Latency | ROUGE-L | F1 | BLEU |
|---|
Upgrade to Pro
Unlock higher limits, advanced evaluations, and more control.
Don't have an account yet? Sign Up
CHOOSE YOUR PLAN