Scaling FAITH Model Evaluation Analysis with BigQuery
FAITH (Foundation AI's Testing Hub) evaluates LLMs on cybersecurity benchmarks, generating performance metrics and configuration data as CSV exports and JSON files for each run.
For a handful of experiments, these local files are sufficient. But as evaluation becomes continuous — testing model checkpoints, comparing hyper-parameters, tracking regressions across releases — comparing metrics across scattered files becomes unmanageable.
{
"trials/373366/e1cbf4": {
"accuracy": 0.1021,
"accuracy_per_subject": {
// 40+ nested fields...
}
}
}
"metadata": {
"end_time": "2026-04-02 14:23:15",
"run_args": [
"--benchmarks",
"cybermetric-80",
// 15+ more args...
],
"version": "0.1.dev1"
}
We ran into this ourselves. After evaluating dozens of models across various benchmarks, we found ourselves repeatedly writing one-off scripts to extract trends from scattered files and rebuilding the same comparison tables for different stakeholders.
FAITH produced the evaluation data, but we had no scalable way to analyze it after the fact.
Today we're releasing BigQuery integration for FAITH: a structured approach to metrics storage, querying, and analysis that treats evaluation data as a first-class dataset.
Ingestion
The Solution: Treating Metrics as Structured Data
FAITH now writes evaluation metrics directly to a normalized BigQuery table where each row captures a complete metric observation: model, benchmark, configuration parameters, and score.
| Key | Type | Field |
|---|---|---|
| PK | STRING | metrics_file_uri |
| STRING | model_key | |
| STRING | source_uri | |
| STRING | benchmark | |
| STRING | metric_name | |
| FLOAT64 | metric_value | |
| BOOL | is_primary | |
| INT64 | num_shots | |
| INT64 | num_shots_pool_size | |
| FLOAT64 | temperature | |
| FLOAT64 | top_p | |
| INT64 | max_completion_tokens | |
| INT64 | context_length | |
| STRING | generation_mode | |
| STRING | prompt_format | |
| STRING | faith_version | |
| TIMESTAMP | ingest_time |
The ingestion command is a single line:
faith summarize --experiment-path gs://my-bucket/results \
--output-format bigquery
The tool manages the table automatically: it creates the table schema on first run, prevents duplicate ingestion, and ensures data consistency. Once metrics are in the table, they're immediately queryable without any preprocessing.
What This Unlocks
Longitudinal Analysis
With evaluation data in a database, tracking changes over time becomes trivial. You can measure whether a model checkpoint improved or regressed on specific benchmarks, compare performance across experiment versions, or identify when a configuration change affected results. Each metric includes provenance metadata: FAITH version, configuration parameters, and ingestion timestamps are automatically stored making it possible to answer questions like "what version produced this result?" or "what config was used for this run?"
This was effectively impossible to do at scale with per-run CSV files.
Cross-Run Aggregation
Aggregating metrics for multiple models across multiple runs (e.g. "average accuracy on all CTI benchmarks") no longer requires copying over various CSV files and writing custom merge scripts. It's now a simple SQL query with a GROUP BY clause.
| Model ▲ | Benchmark ▲ | Accuracy ▲ | Num Shots ▲ | Temperature ▲ |
|---|
Team Workflows
For teams running evaluation in CI/CD pipelines, BigQuery ingestion can integrate directly into the workflow. Results flow from pipeline completion to queryable data to dashboards without manual export or upload steps.
Pre-Built Views and Queries
We include pre-built SQL views for common analysis patterns:
- Leaderboard view: Latest primary metrics for each model/benchmark combination. Optimized for building rankings and dashboards that compare model performance across benchmarks.
- Latest runs view: Most recent ingestion for each unique experiment run, including all metrics. Useful for detailed analysis and debugging specific evaluation runs.
Visualization and Dashboards
FAITH Leaderboard Live
Compare evaluation results across security benchmarks. Click any model name to view configuration parameters.
| Model | CTI | CyberMetric | MMLU | SecQA | Mean Score |
|---|
Evaluation results can be visualized using any tool that connects to BigQuery — Looker Studio, Tableau, Grafana, or custom dashboards. Rather than emailing CSV exports or screenshots, you can share a live dashboard that updates automatically as new results are ingested. We've included a dashboard cookbook that covers building dashboards with Looker Studio.
How to Use It
Install FAITH with BigQuery dependencies:
pip install faith[bigquery]
Configure your BigQuery dataset and run ingestion:
export FAITH_BIGQUERY_PROJECT=my-project
export FAITH_BIGQUERY_DATASET=faith_results
faith summarize --experiment-path /path/to/results \
--output-format bigquery
The table is created automatically on first ingestion. From there, you can query directly or create the pre-built views for common analysis patterns.
Full documentation:
- Quickstart Guide - Setup and basic usage
- Schema Reference - Table structure and field definitions
- Dashboard Cookbook - Building visualizations in Looker Studio
- SQL Examples - Ready-to-run queries for common tasks
Why This Matters
Evaluation infrastructure matters as much as the benchmarks themselves. Treating metrics as structured data, queryable, easily aggregated, and shareable, is a prerequisite for evaluation at scale. BigQuery integration provides the infrastructure for teams that need their evaluation data to grow with their evaluation workload.
FAITH is open source. If you're using it and have feedback on the BigQuery integration, we'd like to hear it. If you've built custom queries or views that generalize beyond your specific use case, consider contributing them back.