fix: prevent div-by-zero in evaluator when base_refusals is 0 (#225)
* fix: prevent div-by-zero in evaluator when base_refusals is 0 When a model refuses all prompts from the start, base_refusals is 0. Return refusals directly in that case so ablations that introduce new refusals are still penalized correctly. * fix: cast refusals to float for type consistency" before hitting commit changes Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
This commit is contained in:
@@ -110,7 +110,9 @@ class Evaluator:
|
|||||||
kl_divergence_scale = self.settings.kl_divergence_scale
|
kl_divergence_scale = self.settings.kl_divergence_scale
|
||||||
kl_divergence_target = self.settings.kl_divergence_target
|
kl_divergence_target = self.settings.kl_divergence_target
|
||||||
|
|
||||||
refusals_score = refusals / self.base_refusals
|
refusals_score = (
|
||||||
|
refusals / self.base_refusals if self.base_refusals > 0 else float(refusals)
|
||||||
|
)
|
||||||
|
|
||||||
if kl_divergence >= kl_divergence_target:
|
if kl_divergence >= kl_divergence_target:
|
||||||
kld_score = kl_divergence / kl_divergence_scale
|
kld_score = kl_divergence / kl_divergence_scale
|
||||||
|
|||||||
Reference in New Issue
Block a user