From 12ecf5003395220fff1e97c2408987d0c6c92788 Mon Sep 17 00:00:00 2001 From: Philipp Emanuel Weidmann Date: Sun, 16 Nov 2025 15:19:27 +0530 Subject: [PATCH] Add README --- README.md | 141 ++++++++++++++++++++++++++++++++++++++++++++++++- pyproject.toml | 2 +- 2 files changed, 140 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index e1d1655..126cb83 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,143 @@ -# Heretic +# Heretic: Fully automatic censorship removal for language models -TBD +Heretic is a tool that removes censorship (aka "safety alignment") from +transformer-based language models without expensive post-training. +It combines an advanced implementation of directional ablation, also known +as "abliteration" ([Arditi et al. 2024](https://arxiv.org/abs/2406.11717)), +with a TPE-based parameter optimizer powered by [Optuna](https://optuna.org/). + +This approach enables Heretic to work **completely automatically.** Heretic +finds high-quality abliteration parameters by co-minimizing the number of +refusals and the KL divergence from the original model. This results in a +decensored model that retains as much of the original model's intelligence +as possible. Using Heretic does not require an understanding of transformer +internals. In fact, anyone who knows how to run a command-line program +can use Heretic to decensor language models. + +Screenshot + +
+ +Running unsupervised with the default configuration, Heretic can produce +decensored models that rival the quality of abliterations created manually +by human experts: + +| Model | Refusals for "harmful" prompts | KL divergence from original model for "harmless" prompts | +| :--- | ---: | ---: | +| [google/gemma-3-12b-it](https://huggingface.co/google/gemma-3-12b-it) (original) | 97/100 | 0 *(by definition)* | +| [mlabonne/gemma-3-12b-it-abliterated-v2](https://huggingface.co/mlabonne/gemma-3-12b-it-abliterated-v2) | 3/100 | 1.04 | +| [huihui-ai/gemma-3-12b-it-abliterated](https://huggingface.co/huihui-ai/gemma-3-12b-it-abliterated) | 3/100 | 0.45 | +| **[p-e-w/gemma-3-12b-it-heretic](https://huggingface.co/p-e-w/gemma-3-12b-it-heretic) (ours)** | **3/100** | **0.16** | + +The Heretic version, generated without any human effort, achieves the same +level of refusal suppression as other abliterations, but at a much lower +KL divergence, indicating less damage to the original model's capabilities. +*(You can reproduce those numbers using Heretic's built-in evaluation functionality, +e.g. `heretic --model google/gemma-3-12b-it --evaluate-model p-e-w/gemma-3-12b-it-heretic`. +Note that the exact values might be platform- and hardware-dependent. +The table above was compiled using PyTorch 2.8 on an RTX 5090.)* + +Heretic supports most dense models, including many multimodal models, and +several different MoE architectures. It does not yet support SSMs/hybrid models, +models with inhomogeneous layers, and certain novel attention systems. + +You can find a collection of models that have been decensored using Heretic +[on Hugging Face](https://huggingface.co/collections/p-e-w/the-bestiary). + + +## Usage + +Prepare a Python 3.10+ environment with PyTorch 2.2+ installed as appropriate +for your hardware. Then run: + +``` +pip install heretic +heretic Qwen/Qwen3-4B-Instruct-2507 +``` + +Replace `Qwen/Qwen3-4B-Instruct-2507` with whatever model you want to decensor. + +The process is fully automatic and does not require configuration; however, +Heretic has a variety of configuration parameters that can be changed for +greater control. Run `heretic --help` to see available command-line options, +or look at [`config.default.toml`](config.default.toml) if you prefer to use +a configuration file. + +At the start of a program run, Heretic benchmarks the system to determine +the optimal batch size to make the most of the available hardware. +On an RTX 3090, with the default configuration, decensoring Llama-3.1-8B +takes about 45 minutes. + +After Heretic has finished decensoring a model, you are given the option to +save the model, upload it to Hugging Face, chat with it to test how well it works, +or any combination of those actions. + + +## How it works + +Heretic implements a parametrized variant of directional ablation. For each +supported transformer component (currently, attention out-projection and +MLP down-projection), it identifies the associated matrices in each transformer +layer, and orthogonalizes them with respect to the relevant "refusal direction", +inhibiting the expression of that direction in the result of multiplications +with that matrix. + +Refusal directions are computed for each layer as a difference-of-means between +the first-token residuals for "harmful" and "harmless" example prompts. + +The ablation process is controlled by several optimizable parameters: + +* `direction_index`: Either the index of a refusal direction, or the special + value `per layer`, indicating that each layer should be ablated using the + refusal direction associated with that layer. +* `max_weight`, `max_weight_position`, `min_weight`, and `min_weight_distance`: + For each component, these parameters describe the shape and position of the + ablation weight kernel over the layers. The following diagram illustrates this: + +Explanation + +
+ +Heretic's main innovations over existing abliteration systems are: + +* The shape of the ablation weight kernel is highly flexible, which, combined with + automatic parameter optimization, can improve the compliance/quality tradeoff. + Non-constant ablation weights were previously explored by Maxime Labonne in + [gemma-3-12b-it-abliterated-v2](https://huggingface.co/mlabonne/gemma-3-12b-it-abliterated-v2). +* The refusal direction index is a float rather than an integer. For non-integral + values, the two nearest refusal direction vectors are linearly interpolated. + This unlocks a vast space of additional directions beyond the ones identified + by the difference-of-means computation, and often enables the optimization + process to find a better direction than that belonging to any individual layer. +* Ablation parameters are chosen separately for each component. I have found that + MLP interventions tend to be more damaging to the model than attention interventions, + so using different ablation weights can squeeze out some extra performance. + + +## Prior art + +I'm aware of the following publicly available implementations of abliteration +techniques: + +* [AutoAbliteration](https://huggingface.co/posts/mlabonne/714992455492422) +* [abliterator.py](https://github.com/FailSpy/abliterator) +* [wassname's Abliterator](https://github.com/wassname/abliterator) +* [ErisForge](https://github.com/Tsadoq/ErisForge) +* [Removing refusals with HF Transformers](https://github.com/Sumandora/remove-refusals-with-transformers) +* [deccp](https://github.com/AUGMXNT/deccp) + +Note that Heretic was written from scratch, and does not reuse code from +any of those projects. + + +## Acknowledgments + +The development of Heretic was informed by: + +* [The original abliteration paper (Arditi et al. 2024)](https://arxiv.org/abs/2406.11717) +* [Maxime Labonne's article on abliteration](https://huggingface.co/blog/mlabonne/abliteration), + as well as some details from the model cards of his own abliterated models (see above) +* [Jim Lai's article describing "projected abliteration"](https://huggingface.co/blog/grimjim/projected-abliteration) ## License diff --git a/pyproject.toml b/pyproject.toml index 1159408..7228068 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -1,7 +1,7 @@ [project] name = "heretic" version = "1.0.0" -description = "Fully automatic decensoring for transformer language models" +description = "Fully automatic censorship removal for language models" readme = "README.md" license = "AGPL-3.0-or-later" authors = [