Scaling FAITH Model Evaluation Analysis with BigQuery

FAITH (Foundation AI's Testing Hub) evaluates LLMs on cybersecurity benchmarks, generating performance metrics and configuration data as CSV exports and JSON files for each run.

For a handful of experiments, these local files are sufficient. But as evaluation becomes continuous — testing model checkpoints, comparing hyper-parameters, tracking regressions across releases — comparing metrics across scattered files becomes unmanageable.

❌ Metrics File (scattered)
{ "trials/373366/e1cbf4": { "accuracy": 0.1021, "accuracy_per_subject": { // 40+ nested fields... } } }
❌ Config File (separate)
"metadata": { "end_time": "2026-04-02 14:23:15", "run_args": [ "--benchmarks", "cybermetric-80", // 15+ more args... ], "version": "0.1.dev1" }

We ran into this ourselves. After evaluating dozens of models across various benchmarks, we found ourselves repeatedly writing one-off scripts to extract trends from scattered files and rebuilding the same comparison tables for different stakeholders.

FAITH produced the evaluation data, but we had no scalable way to analyze it after the fact.

Today we're releasing BigQuery integration for FAITH: a structured approach to metrics storage, querying, and analysis that treats evaluation data as a first-class dataset.

Before: Scattered & Unstructured
eval_logs_1.json
eval_logs_2.json
metrics_out.csv
results_3.json
config.yaml
metadata.json
✗ Hard to query
✗ No single source of truth
✗ Manual aggregation needed
✗ Difficult to analyze trends
BigQuery
Ingestion
After: Clean & Queryable
🗃 SELECT * FROM metrics
Model
Benchmark
Score
Timestamp
orion-1b
MMLU
0.412
2026-04-02
foundation-sec
CTI
0.872
2026-04-02
orion-1b
MMLU
0.398
2026-04-01
foundation-sec
CTI
0.865
2026-04-01
llama-3-8B
SecQA
0.712
2026-04-02
mistral-7b
CyberMetric
0.714
2026-04-02
foundation-sec
SecQA
0.878
2026-04-01
SQL queries in seconds
Centralized data store
Easy trend analysis
Instant visualizations
0
Files Processed
0
Rows Ingested
100%
Data Quality

The Solution: Treating Metrics as Structured Data

FAITH now writes evaluation metrics directly to a normalized BigQuery table where each row captures a complete metric observation: model, benchmark, configuration parameters, and score.

METRICS
KeyTypeField
PKSTRINGmetrics_file_uri
STRINGmodel_key
STRINGsource_uri
STRINGbenchmark
STRINGmetric_name
FLOAT64metric_value
BOOLis_primary
INT64num_shots
INT64num_shots_pool_size
FLOAT64temperature
FLOAT64top_p
INT64max_completion_tokens
INT64context_length
STRINGgeneration_mode
STRINGprompt_format
STRINGfaith_version
TIMESTAMPingest_time

The ingestion command is a single line:

bash
faith summarize --experiment-path gs://my-bucket/results \
  --output-format bigquery

The tool manages the table automatically: it creates the table schema on first run, prevents duplicate ingestion, and ensures data consistency. Once metrics are in the table, they're immediately queryable without any preprocessing.

What This Unlocks

Longitudinal Analysis

With evaluation data in a database, tracking changes over time becomes trivial. You can measure whether a model checkpoint improved or regressed on specific benchmarks, compare performance across experiment versions, or identify when a configuration change affected results. Each metric includes provenance metadata: FAITH version, configuration parameters, and ingestion timestamps are automatically stored making it possible to answer questions like "what version produced this result?" or "what config was used for this run?"

This was effectively impossible to do at scale with per-run CSV files.

Cross-Run Aggregation

Aggregating metrics for multiple models across multiple runs (e.g. "average accuracy on all CTI benchmarks") no longer requires copying over various CSV files and writing custom merge scripts. It's now a simple SQL query with a GROUP BY clause.

Cross-Run Results Click column headers to sort
Model Benchmark Accuracy Num Shots Temperature

Team Workflows

For teams running evaluation in CI/CD pipelines, BigQuery ingestion can integrate directly into the workflow. Results flow from pipeline completion to queryable data to dashboards without manual export or upload steps.

Pre-Built Views and Queries

We include pre-built SQL views for common analysis patterns:

  • Leaderboard view: Latest primary metrics for each model/benchmark combination. Optimized for building rankings and dashboards that compare model performance across benchmarks.
  • Latest runs view: Most recent ingestion for each unique experiment run, including all metrics. Useful for detailed analysis and debugging specific evaluation runs.

Visualization and Dashboards

FAITH Leaderboard — Interactive (filter and click model names to expand config details)

FAITH Leaderboard Live

Compare evaluation results across security benchmarks. Click any model name to view configuration parameters.

Model CTI CyberMetric MMLU SecQA Mean Score

Evaluation results can be visualized using any tool that connects to BigQuery — Looker Studio, Tableau, Grafana, or custom dashboards. Rather than emailing CSV exports or screenshots, you can share a live dashboard that updates automatically as new results are ingested. We've included a dashboard cookbook that covers building dashboards with Looker Studio.

How to Use It

Install FAITH with BigQuery dependencies:

bash
pip install faith[bigquery]

Configure your BigQuery dataset and run ingestion:

bash
export FAITH_BIGQUERY_PROJECT=my-project
export FAITH_BIGQUERY_DATASET=faith_results

faith summarize --experiment-path /path/to/results \
  --output-format bigquery

The table is created automatically on first ingestion. From there, you can query directly or create the pre-built views for common analysis patterns.

Full documentation:

Why This Matters

Evaluation infrastructure matters as much as the benchmarks themselves. Treating metrics as structured data, queryable, easily aggregated, and shareable, is a prerequisite for evaluation at scale. BigQuery integration provides the infrastructure for teams that need their evaluation data to grow with their evaluation workload.

FAITH is open source. If you're using it and have feedback on the BigQuery integration, we'd like to hear it. If you've built custom queries or views that generalize beyond your specific use case, consider contributing them back.