Add README

2025-11-16 15:19:27 +05:30
parent ea699dce46
commit 12ecf50033
2 changed files with 140 additions and 3 deletions
@@ -1,6 +1,143 @@
-# Heretic
+# Heretic: Fully automatic censorship removal for language models
-TBD
+Heretic is a tool that removes censorship (aka "safety alignment") from
 transformer-based language models without expensive post-training.
 It combines an advanced implementation of directional ablation, also known
 as "abliteration" ([Arditi et al. 2024](https://arxiv.org/abs/2406.11717)),
 with a TPE-based parameter optimizer powered by [Optuna](https://optuna.org/).
 This approach enables Heretic to work **completely automatically.** Heretic
 finds high-quality abliteration parameters by co-minimizing the number of
 refusals and the KL divergence from the original model. This results in a
 decensored model that retains as much of the original model's intelligence
 as possible. Using Heretic does not require an understanding of transformer
 internals. In fact, anyone who knows how to run a command-line program
 can use Heretic to decensor language models.
 <img width="650" height="715" alt="Screenshot" src="https://github.com/user-attachments/assets/d71a5efa-d6be-4705-a817-63332afb2d15" />
 <br>
 Running unsupervised with the default configuration, Heretic can produce
 decensored models that rival the quality of abliterations created manually
 by human experts:
 | Model | Refusals for "harmful" prompts | KL divergence from original model for "harmless" prompts |
 | :--- | ---: | ---: |
 | [google/gemma-3-12b-it](https://huggingface.co/google/gemma-3-12b-it) (original) | 97/100 | 0 *(by definition)* |
 | [mlabonne/gemma-3-12b-it-abliterated-v2](https://huggingface.co/mlabonne/gemma-3-12b-it-abliterated-v2) | 3/100 | 1.04 |
 | [huihui-ai/gemma-3-12b-it-abliterated](https://huggingface.co/huihui-ai/gemma-3-12b-it-abliterated) | 3/100 | 0.45 |
 | **[p-e-w/gemma-3-12b-it-heretic](https://huggingface.co/p-e-w/gemma-3-12b-it-heretic) (ours)** | **3/100** | **0.16** |
 The Heretic version, generated without any human effort, achieves the same
 level of refusal suppression as other abliterations, but at a much lower
 KL divergence, indicating less damage to the original model's capabilities.
 *(You can reproduce those numbers using Heretic's built-in evaluation functionality,
 e.g. `heretic --model google/gemma-3-12b-it --evaluate-model p-e-w/gemma-3-12b-it-heretic`.
 Note that the exact values might be platform- and hardware-dependent.
 The table above was compiled using PyTorch 2.8 on an RTX 5090.)*
 Heretic supports most dense models, including many multimodal models, and
 several different MoE architectures. It does not yet support SSMs/hybrid models,
 models with inhomogeneous layers, and certain novel attention systems.
 You can find a collection of models that have been decensored using Heretic
 [on Hugging Face](https://huggingface.co/collections/p-e-w/the-bestiary).
 ## Usage
 Prepare a Python 3.10+ environment with PyTorch 2.2+ installed as appropriate
 for your hardware. Then run:
 ```
 pip install heretic
 heretic Qwen/Qwen3-4B-Instruct-2507
 ```
 Replace `Qwen/Qwen3-4B-Instruct-2507` with whatever model you want to decensor.
 The process is fully automatic and does not require configuration; however,
 Heretic has a variety of configuration parameters that can be changed for
 greater control. Run `heretic --help` to see available command-line options,
 or look at [`config.default.toml`](config.default.toml) if you prefer to use
 a configuration file.
 At the start of a program run, Heretic benchmarks the system to determine
 the optimal batch size to make the most of the available hardware.
 On an RTX 3090, with the default configuration, decensoring Llama-3.1-8B
 takes about 45 minutes.
 After Heretic has finished decensoring a model, you are given the option to
 save the model, upload it to Hugging Face, chat with it to test how well it works,
 or any combination of those actions.
 ## How it works
 Heretic implements a parametrized variant of directional ablation. For each
 supported transformer component (currently, attention out-projection and
 MLP down-projection), it identifies the associated matrices in each transformer
 layer, and orthogonalizes them with respect to the relevant "refusal direction",
 inhibiting the expression of that direction in the result of multiplications
 with that matrix.
 Refusal directions are computed for each layer as a difference-of-means between
 the first-token residuals for "harmful" and "harmless" example prompts.
 The ablation process is controlled by several optimizable parameters:
 * `direction_index`: Either the index of a refusal direction, or the special
  value `per layer`, indicating that each layer should be ablated using the
  refusal direction associated with that layer.
 * `max_weight`, `max_weight_position`, `min_weight`, and `min_weight_distance`:
  For each component, these parameters describe the shape and position of the
  ablation weight kernel over the layers. The following diagram illustrates this:
 <img width="800" height="500" alt="Explanation" src="https://github.com/user-attachments/assets/82e4b84e-5a82-4faf-b918-ac642f9e4892" />
 <br>
 Heretic's main innovations over existing abliteration systems are:
 * The shape of the ablation weight kernel is highly flexible, which, combined with
  automatic parameter optimization, can improve the compliance/quality tradeoff.
  Non-constant ablation weights were previously explored by Maxime Labonne in
  [gemma-3-12b-it-abliterated-v2](https://huggingface.co/mlabonne/gemma-3-12b-it-abliterated-v2).
 * The refusal direction index is a float rather than an integer. For non-integral
  values, the two nearest refusal direction vectors are linearly interpolated.
  This unlocks a vast space of additional directions beyond the ones identified
  by the difference-of-means computation, and often enables the optimization
  process to find a better direction than that belonging to any individual layer.
 * Ablation parameters are chosen separately for each component. I have found that
  MLP interventions tend to be more damaging to the model than attention interventions,
  so using different ablation weights can squeeze out some extra performance.
 ## Prior art
 I'm aware of the following publicly available implementations of abliteration
 techniques:
 * [AutoAbliteration](https://huggingface.co/posts/mlabonne/714992455492422)
 * [abliterator.py](https://github.com/FailSpy/abliterator)
 * [wassname's Abliterator](https://github.com/wassname/abliterator)
 * [ErisForge](https://github.com/Tsadoq/ErisForge)
 * [Removing refusals with HF Transformers](https://github.com/Sumandora/remove-refusals-with-transformers)
 * [deccp](https://github.com/AUGMXNT/deccp)
 Note that Heretic was written from scratch, and does not reuse code from
 any of those projects.
 ## Acknowledgments
 The development of Heretic was informed by:
 * [The original abliteration paper (Arditi et al. 2024)](https://arxiv.org/abs/2406.11717)
 * [Maxime Labonne's article on abliteration](https://huggingface.co/blog/mlabonne/abliteration),
  as well as some details from the model cards of his own abliterated models (see above)
 * [Jim Lai's article describing "projected abliteration"](https://huggingface.co/blog/grimjim/projected-abliteration)
 ## License
@@ -1,7 +1,7 @@
 [project]
 name = "heretic"
 version = "1.0.0"
-description = "Fully automatic decensoring for transformer language models"
+description = "Fully automatic censorship removal for language models"
 readme = "README.md"
 license = "AGPL-3.0-or-later"
 authors = [