Add README

2025-11-16 15:19:27 +05:30
parent ea699dce46
commit 12ecf50033
2 changed files with 140 additions and 3 deletions
@@ -1,6 +1,143 @@
-# Heretic
+# Heretic: Fully automatic censorship removal for language models

-TBD
+Heretic is a tool that removes censorship (aka "safety alignment") from
+transformer-based language models without expensive post-training.
+It combines an advanced implementation of directional ablation, also known
+as "abliteration" ([Arditi et al. 2024](https://arxiv.org/abs/2406.11717)),
+with a TPE-based parameter optimizer powered by [Optuna](https://optuna.org/).
+
+This approach enables Heretic to work **completely automatically.** Heretic
+finds high-quality abliteration parameters by co-minimizing the number of
+refusals and the KL divergence from the original model. This results in a
+decensored model that retains as much of the original model's intelligence
+as possible. Using Heretic does not require an understanding of transformer
+internals. In fact, anyone who knows how to run a command-line program
+can use Heretic to decensor language models.
+
+<img width="650" height="715" alt="Screenshot" src="https://github.com/user-attachments/assets/d71a5efa-d6be-4705-a817-63332afb2d15" />
+
+<br>
+
+Running unsupervised with the default configuration, Heretic can produce
+decensored models that rival the quality of abliterations created manually
+by human experts:
+
+| Model | Refusals for "harmful" prompts | KL divergence from original model for "harmless" prompts |
+| :--- | ---: | ---: |
+| [google/gemma-3-12b-it](https://huggingface.co/google/gemma-3-12b-it) (original) | 97/100 | 0 *(by definition)* |
+| [mlabonne/gemma-3-12b-it-abliterated-v2](https://huggingface.co/mlabonne/gemma-3-12b-it-abliterated-v2) | 3/100 | 1.04 |
+| [huihui-ai/gemma-3-12b-it-abliterated](https://huggingface.co/huihui-ai/gemma-3-12b-it-abliterated) | 3/100 | 0.45 |
+| **[p-e-w/gemma-3-12b-it-heretic](https://huggingface.co/p-e-w/gemma-3-12b-it-heretic) (ours)** | **3/100** | **0.16** |
+
+The Heretic version, generated without any human effort, achieves the same
+level of refusal suppression as other abliterations, but at a much lower
+KL divergence, indicating less damage to the original model's capabilities.
+*(You can reproduce those numbers using Heretic's built-in evaluation functionality,
+e.g. `heretic --model google/gemma-3-12b-it --evaluate-model p-e-w/gemma-3-12b-it-heretic`.
+Note that the exact values might be platform- and hardware-dependent.
+The table above was compiled using PyTorch 2.8 on an RTX 5090.)*
+
+Heretic supports most dense models, including many multimodal models, and
+several different MoE architectures. It does not yet support SSMs/hybrid models,
+models with inhomogeneous layers, and certain novel attention systems.
+
+You can find a collection of models that have been decensored using Heretic
+[on Hugging Face](https://huggingface.co/collections/p-e-w/the-bestiary).
+
+
+## Usage
+
+Prepare a Python 3.10+ environment with PyTorch 2.2+ installed as appropriate
+for your hardware. Then run:
+
+```
+pip install heretic
+heretic Qwen/Qwen3-4B-Instruct-2507
+```
+
+Replace `Qwen/Qwen3-4B-Instruct-2507` with whatever model you want to decensor.
+
+The process is fully automatic and does not require configuration; however,
+Heretic has a variety of configuration parameters that can be changed for
+greater control. Run `heretic --help` to see available command-line options,
+or look at [`config.default.toml`](config.default.toml) if you prefer to use
+a configuration file.
+
+At the start of a program run, Heretic benchmarks the system to determine
+the optimal batch size to make the most of the available hardware.
+On an RTX 3090, with the default configuration, decensoring Llama-3.1-8B
+takes about 45 minutes.
+
+After Heretic has finished decensoring a model, you are given the option to
+save the model, upload it to Hugging Face, chat with it to test how well it works,
+or any combination of those actions.
+
+
+## How it works
+
+Heretic implements a parametrized variant of directional ablation. For each
+supported transformer component (currently, attention out-projection and
+MLP down-projection), it identifies the associated matrices in each transformer
+layer, and orthogonalizes them with respect to the relevant "refusal direction",
+inhibiting the expression of that direction in the result of multiplications
+with that matrix.
+
+Refusal directions are computed for each layer as a difference-of-means between
+the first-token residuals for "harmful" and "harmless" example prompts.
+
+The ablation process is controlled by several optimizable parameters:
+
+* `direction_index`: Either the index of a refusal direction, or the special
+  value `per layer`, indicating that each layer should be ablated using the
+  refusal direction associated with that layer.
+* `max_weight`, `max_weight_position`, `min_weight`, and `min_weight_distance`:
+  For each component, these parameters describe the shape and position of the
+  ablation weight kernel over the layers. The following diagram illustrates this:
+
+<img width="800" height="500" alt="Explanation" src="https://github.com/user-attachments/assets/82e4b84e-5a82-4faf-b918-ac642f9e4892" />
+
+<br>
+
+Heretic's main innovations over existing abliteration systems are:
+
+* The shape of the ablation weight kernel is highly flexible, which, combined with
+  automatic parameter optimization, can improve the compliance/quality tradeoff.
+  Non-constant ablation weights were previously explored by Maxime Labonne in
+  [gemma-3-12b-it-abliterated-v2](https://huggingface.co/mlabonne/gemma-3-12b-it-abliterated-v2).
+* The refusal direction index is a float rather than an integer. For non-integral
+  values, the two nearest refusal direction vectors are linearly interpolated.
+  This unlocks a vast space of additional directions beyond the ones identified
+  by the difference-of-means computation, and often enables the optimization
+  process to find a better direction than that belonging to any individual layer.
+* Ablation parameters are chosen separately for each component. I have found that
+  MLP interventions tend to be more damaging to the model than attention interventions,
+  so using different ablation weights can squeeze out some extra performance.
+
+
+## Prior art
+
+I'm aware of the following publicly available implementations of abliteration
+techniques:
+
+* [AutoAbliteration](https://huggingface.co/posts/mlabonne/714992455492422)
+* [abliterator.py](https://github.com/FailSpy/abliterator)
+* [wassname's Abliterator](https://github.com/wassname/abliterator)
+* [ErisForge](https://github.com/Tsadoq/ErisForge)
+* [Removing refusals with HF Transformers](https://github.com/Sumandora/remove-refusals-with-transformers)
+* [deccp](https://github.com/AUGMXNT/deccp)
+
+Note that Heretic was written from scratch, and does not reuse code from
+any of those projects.
+
+
+## Acknowledgments
+
+The development of Heretic was informed by:
+
+* [The original abliteration paper (Arditi et al. 2024)](https://arxiv.org/abs/2406.11717)
+* [Maxime Labonne's article on abliteration](https://huggingface.co/blog/mlabonne/abliteration),
+  as well as some details from the model cards of his own abliterated models (see above)
+* [Jim Lai's article describing "projected abliteration"](https://huggingface.co/blog/grimjim/projected-abliteration)


 ## License