Scaling FAITH Model Evaluation Analysis with BigQuery

FAITH (Foundation AI's Testing Hub) evaluates LLMs on cybersecurity benchmarks, generating performance metrics and configuration data as CSV exports and JSON files for each run.

For a handful of experiments, these local files are sufficient. But as evaluation becomes continuous — testing model checkpoints, comparing hyper-parameters, tracking regressions across releases — comparing metrics across scattered files becomes unmanageable.

❌ Metrics File (scattered)

{
  "trials/373366/e1cbf4": {
    "accuracy": 0.1021,
    "accuracy_per_subject": {
      // 40+ nested fields...
    }
  }
}

❌ Config File (separate)

"metadata": {
  "end_time": "2026-04-02 14:23:15",
  "run_args": [
    "--benchmarks",
    "cybermetric-80",
    // 15+ more args...
  ],
  "version": "0.1.dev1"
}

We ran into this ourselves. After evaluating dozens of models across various benchmarks, we found ourselves repeatedly writing one-off scripts to extract trends from scattered files and rebuilding the same comparison tables for different stakeholders.

FAITH produced the evaluation data, but we had no scalable way to analyze it after the fact.

Today we're releasing BigQuery integration for FAITH: a structured approach to metrics storage, querying, and analysis that treats evaluation data as a first-class dataset.

Before: Scattered & Unstructured

⚠ eval_logs_1.json

⚠ eval_logs_2.json

⚠ metrics_out.csv

⚠ results_3.json

⚠ config.yaml

⚠ metadata.json

✗ Hard to query

✗ No single source of truth

✗ Manual aggregation needed

✗ Difficult to analyze trends

BigQuery
Ingestion

After: Clean & Queryable

🗃 SELECT * FROM metrics

Model

Benchmark

Score

Timestamp

orion-1b

MMLU

0.412

2026-04-02

foundation-sec

CTI

0.872

2026-04-02

orion-1b

MMLU

0.398

2026-04-01

foundation-sec

CTI

0.865

2026-04-01

llama-3-8B

SecQA

0.712

2026-04-02

mistral-7b

CyberMetric

0.714

2026-04-02

foundation-sec

SecQA

0.878

2026-04-01

✓ SQL queries in seconds

✓ Centralized data store

✓ Easy trend analysis

✓ Instant visualizations

0

Files Processed

0

Rows Ingested

100%

Data Quality

The Solution: Treating Metrics as Structured Data

FAITH now writes evaluation metrics directly to a normalized BigQuery table where each row captures a complete metric observation: model, benchmark, configuration parameters, and score.

METRICS

Key	Type	Field
PK	STRING	metrics_file_uri
	STRING	model_key
	STRING	source_uri
	STRING	benchmark
	STRING	metric_name
	FLOAT64	metric_value
	BOOL	is_primary
	INT64	num_shots
	INT64	num_shots_pool_size
	FLOAT64	temperature
	FLOAT64	top_p
	INT64	max_completion_tokens
	INT64	context_length
	STRING	generation_mode
	STRING	prompt_format
	STRING	faith_version
	TIMESTAMP	ingest_time

The ingestion command is a single line:

bash

faith summarize --experiment-path gs://my-bucket/results \
  --output-format bigquery

The tool manages the table automatically: it creates the table schema on first run, prevents duplicate ingestion, and ensures data consistency. Once metrics are in the table, they're immediately queryable without any preprocessing.

What This Unlocks

Longitudinal Analysis

With evaluation data in a database, tracking changes over time becomes trivial. You can measure whether a model checkpoint improved or regressed on specific benchmarks, compare performance across experiment versions, or identify when a configuration change affected results. Each metric includes provenance metadata: FAITH version, configuration parameters, and ingestion timestamps are automatically stored making it possible to answer questions like "what version produced this result?" or "what config was used for this run?"

This was effectively impossible to do at scale with per-run CSV files.

Cross-Run Aggregation

Aggregating metrics for multiple models across multiple runs (e.g. "average accuracy on all CTI benchmarks") no longer requires copying over various CSV files and writing custom merge scripts. It's now a simple SQL query with a GROUP BY clause.

Cross-Run Results Click column headers to sort

Model ▲	Benchmark ▲	Accuracy ▲	Num Shots ▲	Temperature ▲

Team Workflows

For teams running evaluation in CI/CD pipelines, BigQuery ingestion can integrate directly into the workflow. Results flow from pipeline completion to queryable data to dashboards without manual export or upload steps.

Pre-Built Views and Queries

We include pre-built SQL views for common analysis patterns:

Leaderboard view: Latest primary metrics for each model/benchmark combination. Optimized for building rankings and dashboards that compare model performance across benchmarks.
Latest runs view: Most recent ingestion for each unique experiment run, including all metrics. Useful for detailed analysis and debugging specific evaluation runs.

Visualization and Dashboards

FAITH Leaderboard — Interactive (filter and click model names to expand config details)

FAITH Leaderboard Live

Compare evaluation results across security benchmarks. Click any model name to view configuration parameters.

Model	CTI	CyberMetric	MMLU	SecQA	Mean Score

Evaluation results can be visualized using any tool that connects to BigQuery — Looker Studio, Tableau, Grafana, or custom dashboards. Rather than emailing CSV exports or screenshots, you can share a live dashboard that updates automatically as new results are ingested. We've included a dashboard cookbook that covers building dashboards with Looker Studio.

How to Use It

Install FAITH with BigQuery dependencies:

bash

pip install faith[bigquery]

Configure your BigQuery dataset and run ingestion:

bash

export FAITH_BIGQUERY_PROJECT=my-project
export FAITH_BIGQUERY_DATASET=faith_results

faith summarize --experiment-path /path/to/results \
  --output-format bigquery

The table is created automatically on first ingestion. From there, you can query directly or create the pre-built views for common analysis patterns.

Full documentation:

Quickstart Guide - Setup and basic usage
Schema Reference - Table structure and field definitions
Dashboard Cookbook - Building visualizations in Looker Studio
SQL Examples - Ready-to-run queries for common tasks

Why This Matters

Evaluation infrastructure matters as much as the benchmarks themselves. Treating metrics as structured data, queryable, easily aggregated, and shareable, is a prerequisite for evaluation at scale. BigQuery integration provides the infrastructure for teams that need their evaluation data to grow with their evaluation workload.

FAITH is open source. If you're using it and have feedback on the BigQuery integration, we'd like to hear it. If you've built custom queries or views that generalize beyond your specific use case, consider contributing them back.