fix: prevent div-by-zero in evaluator when base_refusals is 0 (#225)

* fix: prevent div-by-zero in evaluator when base_refusals is 0

When a model refuses all prompts from the start, base_refusals is 0.
Return refusals directly in that case so ablations that introduce new
refusals are still penalized correctly.

* fix: cast refusals to float for type consistency" before hitting commit changes

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
This commit is contained in:
cpagac
2026-03-13 00:51:23 -05:00
committed by GitHub
parent e26da5e0e6
commit 515a7b9eb5
+3 -1
View File
@@ -110,7 +110,9 @@ class Evaluator:
kl_divergence_scale = self.settings.kl_divergence_scale kl_divergence_scale = self.settings.kl_divergence_scale
kl_divergence_target = self.settings.kl_divergence_target kl_divergence_target = self.settings.kl_divergence_target
refusals_score = refusals / self.base_refusals refusals_score = (
refusals / self.base_refusals if self.base_refusals > 0 else float(refusals)
)
if kl_divergence >= kl_divergence_target: if kl_divergence >= kl_divergence_target:
kld_score = kl_divergence / kl_divergence_scale kld_score = kl_divergence / kl_divergence_scale