feat: avoid excessive low divergence iteration (#73)

* feat: adjust scoring to avoid useless iteration

Adjusts the scoring function to avoid targeting meaninglessly low KL divergences.
Below a threshold value, the KL divergence score switches to the refusal count.
Adds config option kl_divergence_target (defaulting to 0.01).

* fix: Clean up parameter selection in objective

Create variables for num_layers and last_layer_index
* Improves readability and makes choices explicit

* feat: Print the parameters of the selected model
This commit is contained in:
Spiky Moth
2025-12-14 09:56:48 +01:00
committed by GitHub
parent 740aab61ba
commit 9d1734855d
4 changed files with 53 additions and 11 deletions
+4
View File
@@ -49,6 +49,10 @@ residual_plot_style = "dark_background"
# This is used to ensure balanced co-optimization of KL divergence and refusal count.
kl_divergence_scale = 1.0
# The KL divergence to target. Below this value, an objective based on the refusal count is used.
# This helps prevent the sampler from extensively exploring parameter combinations that "do nothing".
kl_divergence_target = 0.01
# Number of abliteration trials to run during optimization.
n_trials = 200