BNNR

Model Analysis

Run zero-friction diagnostics on a trained model without running full BNNR training. bnnr analyze produces metrics, XAI insights, data quality checks, failure patterns, and actionable recommendations — all in a single command.

New in v0.3.0. Supports classification and multilabel tasks.


Quick Start

# 1. Train a model (or use an existing checkpoint)
python3 -m bnnr train --dataset cifar10 -o out_cifar -e 3
 
# 2. Run analysis
python3 -m bnnr analyze \
  --model out_cifar/checkpoints/best.pt \
  --data cifar10 \
  --output out_analyze
 
# 3. Open the HTML report
xdg-open out_analyze/report.html

The report includes: accuracy/F1, per-class diagnostics, confusion matrix, XAI quality score, failure patterns, and actionable recommendations.


How It Works

bnnr analyze runs a 6-step pipeline on your trained model:

  1. Evaluation — forward pass on the validation set to compute metrics (accuracy, F1, precision, recall, Cohen's kappa), per-class accuracy, and confusion matrix.
  2. XAI (optional) — saliency maps and rich analysis (focus ratio, edge ratio, quality score, class diagnoses) on a configurable probe set.
  3. Data quality (optional) — duplicate detection and image quality checks on the validation data.
  4. Failure analysis — per-sample predictions ranked by loss/confidence; top-N worst predictions with optional XAI overlays.
  5. Failure patterns — automated detection of confused class pairs, classes with low XAI quality, zero/near-zero recall, class collapse, dominant bias, and calibration issues.
  6. Recommendations — prioritized, literature-backed improvement hints linked to detected findings (e.g. "add data for class X", "consider ICD", "review data labeling").

CLI Reference

python3 -m bnnr analyze --model PATH --data PATH_OR_DATASET --output DIR [OPTIONS]

Required Arguments

ArgumentDescription
--model, -mPath to a saved model checkpoint (.pt). Supports BNNR checkpoints (with model_state or model key) or raw state_dict.
--data, -dEither a directory path (ImageFolder-style: class1/, class2/, …) or a built-in dataset name: mnist, fashion_mnist, cifar10, stl10.
--output, -oDirectory where analysis_report.json and report.html are written.

Options

OptionDefaultDescription
--task, -tclassificationTask type: classification or multilabel. Detection is not supported by analyze.
--config, -cNoneOptional BNNR config YAML (for device, metrics, num_classes, etc.).
--max-worst20Number of worst predictions to include in the report.
--no-xaifalseDisable XAI analysis (faster run).
--no-data-qualityfalseDisable data quality checks.
--deviceautoDevice: cuda, cpu, or auto.
--batch-size64Batch size for evaluation.
--cv-folds0Number of folds for lightweight cross-validation on cached predictions (0 = disabled).
--xai-samples500Number of samples for XAI probe set. More = more accurate, slower.
--summary/--no-summaryenabledPrint executive summary, key findings, and top actions to stdout.

Behavior Notes

  • The CLI builds a pipeline (dataset + adapter) from --data and loads the checkpoint into the adapter.
  • For ImageFolder, use --data /path/to/val_root; the pipeline expects --config or compatible defaults.
  • XAI requires an adapter that implements XAICapableModel (e.g. SimpleTorchAdapter with target_layers).
  • --cv-folds is a lightweight estimate of metric variability: one inference pass, then k-fold metrics on cached predictions (no retraining).

Python API

analyze_model

from bnnr import analyze_model, SimpleTorchAdapter
import torch
 
# Build your adapter with target_layers for XAI
adapter = SimpleTorchAdapter(
    model=my_model,
    criterion=torch.nn.CrossEntropyLoss(),
    optimizer=torch.optim.Adam(my_model.parameters()),
    target_layers=[my_model.features[-1]],
    device="auto",
)
 
# Run analysis
report = analyze_model(
    adapter,
    val_loader,
    config=None,
    task="classification",
    output_dir="./analysis_out",
    run_data_quality=True,
    max_worst=20,
    xai_enabled=True,
    xai_method="opticam",
    xai_samples=500,
    cv_folds=3,
    data_quality_max_samples=5000,
)
 
# Access results
print(report.metrics)           # {"accuracy": 0.92, "f1_macro": 0.91, ...}
print(report.executive_summary) # health score, key findings, top actions
print(report.findings)          # structured root-cause findings
print(report.recommendations)   # text recommendations
 
# Save outputs
report.save("./analysis_out")
report.to_html("./analysis_out/report.html")

AnalysisReport Attributes

Core fields:

AttributeTypeDescription
metricsdict[str, float]Global metrics: accuracy, F1, precision, recall, loss, Cohen's kappa, ECE.
per_class_accuracydict[str, dict]Per-class accuracy and support.
confusiondictConfusion matrix data.
xai_insightsdictXAI summary insights per class.
xai_diagnosesdictPer-class XAI diagnostic details.
xai_quality_summarydictOverall XAI quality (mean score, distribution).
data_quality_resultdictData quality analysis results.
failure_patternslist[dict]Detected failure patterns (confused pairs, low XAI, etc.).
recommendationslist[str]Text improvement recommendations.

Extended fields (v0.2+):

AttributeTypeDescription
schema_versionstrReport schema version (currently "0.2.1").
executive_summarydictHealth badge/score, key findings, top actions, critical classes.
findingslist[dict]Structured findings with type, evidence, interpretation, severity.
recommendations_structuredlist[dict]Prioritized recommendations linked to findings.
class_diagnosticslist[dict]Per-class precision/recall/F1/support/severity.
true_distributiondict[str, int]Ground truth class distribution.
pred_distributiondict[str, int]Predicted class distribution.
distribution_summarydictOver/under-predicted classes, collapse hints.
failure_patterns_extendedlist[dict]Enriched failure taxonomy with evidence.
cv_resultsdictCross-validation results (per-fold metrics, mean/std).
calibration_summarydictECE and calibration bin statistics.
confusion_pair_xailist[dict]XAI analysis for top confused class pairs.
best_worst_examplesdictBest/worst examples per class with overlay paths.

Methods:

  • save(output_dir) — writes analysis_report.json and artifact directories.
  • to_html(path) — writes a self-contained HTML report with dark/light theme.
  • failure_patterns_list() — returns the list of detected failure patterns.

Output Files

After running bnnr analyze --output ./out:

out/
├── analysis_report.json    # Full structured report (all fields above)
├── report.html             # Self-contained HTML report (dark theme, interactive)
└── artifacts/              # Optional (when XAI enabled)
    ├── xai_examples/       # Per-class XAI overlay samples
    ├── confusion_pairs/    # XAI overlays for confused class pairs
    ├── class_examples/     # Best/worst examples per class
    └── data_quality/       # Data quality thumbnails and diagnostics

Supported Datasets

Dataset--data valueResolutionClasses
MNISTmnist28×28 grayscale10
Fashion-MNISTfashion_mnist28×28 grayscale10
CIFAR-10cifar1032×32 RGB10
STL-10stl1096×96 RGB10
Custom/path/to/dirAny (ImageFolder layout)Auto-detected

For custom ImageFolder datasets, organize images as:

val_root/
├── class_a/
│   ├── img001.jpg
│   └── ...
├── class_b/
│   └── ...
└── ...

Examples

Analyze after BNNR training

python3 -m bnnr analyze \
  --model reports/run_20260501_120000/checkpoints/best.pt \
  --data mnist \
  --output ./analysis_mnist

Custom ImageFolder with config

python3 -m bnnr analyze \
  --model ./my_model.pt \
  --data /path/to/validation_images \
  --output ./analysis_custom \
  --config my_config.yaml

Fast mode (no XAI, no data quality)

python3 -m bnnr analyze \
  --model best.pt \
  --data cifar10 \
  --output ./out \
  --no-xai \
  --no-data-quality

With cross-validation

python3 -m bnnr analyze \
  --model best.pt \
  --data stl10 \
  --output ./out \
  --cv-folds 5

Multilabel task

python3 -m bnnr analyze \
  --model multilabel_best.pt \
  --data /path/to/multilabel_val \
  --output ./out_ml \
  --task multilabel

Metric Definitions

MetricDescription
accuracyOverall classification accuracy.
f1_macroMacro-averaged F1 score across all classes.
precision_macroMacro-averaged precision.
recall_macroMacro-averaged recall.
cohen_kappaAgreement beyond chance (chance-corrected), scalar in [-1, 1].
ECE (top-1)Expected calibration error on top-1 confidence bins.

Limitations (current code)

  • Detection: Not supported by bnnr analyze or analyze_model; supported tasks are classification and multilabel only.
  • Compare: compare_runs compares training report.json files; there is no built-in side-by-side compare of two analyze HTML reports in the CLI.
  • Events: Analyze does not emit events to events.jsonl; it produces standalone artifacts only.
  • ROC/PR curves: Analyze focuses on point metrics and diagnostics; ROC-AUC / PR curves are not rendered in report.html.
  • Advanced concept XAI (e.g. CRAFT/NMF): analyze_model uses saliency/CAM-style methods (xai_method, default opticam); CRAFT/NMF are available in training/XAI modules but not wired into the analyze pipeline.

See Also