Add README
This commit is contained in:
@@ -1,6 +1,143 @@
|
|||||||
# Heretic
|
# Heretic: Fully automatic censorship removal for language models
|
||||||
|
|
||||||
TBD
|
Heretic is a tool that removes censorship (aka "safety alignment") from
|
||||||
|
transformer-based language models without expensive post-training.
|
||||||
|
It combines an advanced implementation of directional ablation, also known
|
||||||
|
as "abliteration" ([Arditi et al. 2024](https://arxiv.org/abs/2406.11717)),
|
||||||
|
with a TPE-based parameter optimizer powered by [Optuna](https://optuna.org/).
|
||||||
|
|
||||||
|
This approach enables Heretic to work **completely automatically.** Heretic
|
||||||
|
finds high-quality abliteration parameters by co-minimizing the number of
|
||||||
|
refusals and the KL divergence from the original model. This results in a
|
||||||
|
decensored model that retains as much of the original model's intelligence
|
||||||
|
as possible. Using Heretic does not require an understanding of transformer
|
||||||
|
internals. In fact, anyone who knows how to run a command-line program
|
||||||
|
can use Heretic to decensor language models.
|
||||||
|
|
||||||
|
<img width="650" height="715" alt="Screenshot" src="https://github.com/user-attachments/assets/d71a5efa-d6be-4705-a817-63332afb2d15" />
|
||||||
|
|
||||||
|
<br>
|
||||||
|
|
||||||
|
Running unsupervised with the default configuration, Heretic can produce
|
||||||
|
decensored models that rival the quality of abliterations created manually
|
||||||
|
by human experts:
|
||||||
|
|
||||||
|
| Model | Refusals for "harmful" prompts | KL divergence from original model for "harmless" prompts |
|
||||||
|
| :--- | ---: | ---: |
|
||||||
|
| [google/gemma-3-12b-it](https://huggingface.co/google/gemma-3-12b-it) (original) | 97/100 | 0 *(by definition)* |
|
||||||
|
| [mlabonne/gemma-3-12b-it-abliterated-v2](https://huggingface.co/mlabonne/gemma-3-12b-it-abliterated-v2) | 3/100 | 1.04 |
|
||||||
|
| [huihui-ai/gemma-3-12b-it-abliterated](https://huggingface.co/huihui-ai/gemma-3-12b-it-abliterated) | 3/100 | 0.45 |
|
||||||
|
| **[p-e-w/gemma-3-12b-it-heretic](https://huggingface.co/p-e-w/gemma-3-12b-it-heretic) (ours)** | **3/100** | **0.16** |
|
||||||
|
|
||||||
|
The Heretic version, generated without any human effort, achieves the same
|
||||||
|
level of refusal suppression as other abliterations, but at a much lower
|
||||||
|
KL divergence, indicating less damage to the original model's capabilities.
|
||||||
|
*(You can reproduce those numbers using Heretic's built-in evaluation functionality,
|
||||||
|
e.g. `heretic --model google/gemma-3-12b-it --evaluate-model p-e-w/gemma-3-12b-it-heretic`.
|
||||||
|
Note that the exact values might be platform- and hardware-dependent.
|
||||||
|
The table above was compiled using PyTorch 2.8 on an RTX 5090.)*
|
||||||
|
|
||||||
|
Heretic supports most dense models, including many multimodal models, and
|
||||||
|
several different MoE architectures. It does not yet support SSMs/hybrid models,
|
||||||
|
models with inhomogeneous layers, and certain novel attention systems.
|
||||||
|
|
||||||
|
You can find a collection of models that have been decensored using Heretic
|
||||||
|
[on Hugging Face](https://huggingface.co/collections/p-e-w/the-bestiary).
|
||||||
|
|
||||||
|
|
||||||
|
## Usage
|
||||||
|
|
||||||
|
Prepare a Python 3.10+ environment with PyTorch 2.2+ installed as appropriate
|
||||||
|
for your hardware. Then run:
|
||||||
|
|
||||||
|
```
|
||||||
|
pip install heretic
|
||||||
|
heretic Qwen/Qwen3-4B-Instruct-2507
|
||||||
|
```
|
||||||
|
|
||||||
|
Replace `Qwen/Qwen3-4B-Instruct-2507` with whatever model you want to decensor.
|
||||||
|
|
||||||
|
The process is fully automatic and does not require configuration; however,
|
||||||
|
Heretic has a variety of configuration parameters that can be changed for
|
||||||
|
greater control. Run `heretic --help` to see available command-line options,
|
||||||
|
or look at [`config.default.toml`](config.default.toml) if you prefer to use
|
||||||
|
a configuration file.
|
||||||
|
|
||||||
|
At the start of a program run, Heretic benchmarks the system to determine
|
||||||
|
the optimal batch size to make the most of the available hardware.
|
||||||
|
On an RTX 3090, with the default configuration, decensoring Llama-3.1-8B
|
||||||
|
takes about 45 minutes.
|
||||||
|
|
||||||
|
After Heretic has finished decensoring a model, you are given the option to
|
||||||
|
save the model, upload it to Hugging Face, chat with it to test how well it works,
|
||||||
|
or any combination of those actions.
|
||||||
|
|
||||||
|
|
||||||
|
## How it works
|
||||||
|
|
||||||
|
Heretic implements a parametrized variant of directional ablation. For each
|
||||||
|
supported transformer component (currently, attention out-projection and
|
||||||
|
MLP down-projection), it identifies the associated matrices in each transformer
|
||||||
|
layer, and orthogonalizes them with respect to the relevant "refusal direction",
|
||||||
|
inhibiting the expression of that direction in the result of multiplications
|
||||||
|
with that matrix.
|
||||||
|
|
||||||
|
Refusal directions are computed for each layer as a difference-of-means between
|
||||||
|
the first-token residuals for "harmful" and "harmless" example prompts.
|
||||||
|
|
||||||
|
The ablation process is controlled by several optimizable parameters:
|
||||||
|
|
||||||
|
* `direction_index`: Either the index of a refusal direction, or the special
|
||||||
|
value `per layer`, indicating that each layer should be ablated using the
|
||||||
|
refusal direction associated with that layer.
|
||||||
|
* `max_weight`, `max_weight_position`, `min_weight`, and `min_weight_distance`:
|
||||||
|
For each component, these parameters describe the shape and position of the
|
||||||
|
ablation weight kernel over the layers. The following diagram illustrates this:
|
||||||
|
|
||||||
|
<img width="800" height="500" alt="Explanation" src="https://github.com/user-attachments/assets/82e4b84e-5a82-4faf-b918-ac642f9e4892" />
|
||||||
|
|
||||||
|
<br>
|
||||||
|
|
||||||
|
Heretic's main innovations over existing abliteration systems are:
|
||||||
|
|
||||||
|
* The shape of the ablation weight kernel is highly flexible, which, combined with
|
||||||
|
automatic parameter optimization, can improve the compliance/quality tradeoff.
|
||||||
|
Non-constant ablation weights were previously explored by Maxime Labonne in
|
||||||
|
[gemma-3-12b-it-abliterated-v2](https://huggingface.co/mlabonne/gemma-3-12b-it-abliterated-v2).
|
||||||
|
* The refusal direction index is a float rather than an integer. For non-integral
|
||||||
|
values, the two nearest refusal direction vectors are linearly interpolated.
|
||||||
|
This unlocks a vast space of additional directions beyond the ones identified
|
||||||
|
by the difference-of-means computation, and often enables the optimization
|
||||||
|
process to find a better direction than that belonging to any individual layer.
|
||||||
|
* Ablation parameters are chosen separately for each component. I have found that
|
||||||
|
MLP interventions tend to be more damaging to the model than attention interventions,
|
||||||
|
so using different ablation weights can squeeze out some extra performance.
|
||||||
|
|
||||||
|
|
||||||
|
## Prior art
|
||||||
|
|
||||||
|
I'm aware of the following publicly available implementations of abliteration
|
||||||
|
techniques:
|
||||||
|
|
||||||
|
* [AutoAbliteration](https://huggingface.co/posts/mlabonne/714992455492422)
|
||||||
|
* [abliterator.py](https://github.com/FailSpy/abliterator)
|
||||||
|
* [wassname's Abliterator](https://github.com/wassname/abliterator)
|
||||||
|
* [ErisForge](https://github.com/Tsadoq/ErisForge)
|
||||||
|
* [Removing refusals with HF Transformers](https://github.com/Sumandora/remove-refusals-with-transformers)
|
||||||
|
* [deccp](https://github.com/AUGMXNT/deccp)
|
||||||
|
|
||||||
|
Note that Heretic was written from scratch, and does not reuse code from
|
||||||
|
any of those projects.
|
||||||
|
|
||||||
|
|
||||||
|
## Acknowledgments
|
||||||
|
|
||||||
|
The development of Heretic was informed by:
|
||||||
|
|
||||||
|
* [The original abliteration paper (Arditi et al. 2024)](https://arxiv.org/abs/2406.11717)
|
||||||
|
* [Maxime Labonne's article on abliteration](https://huggingface.co/blog/mlabonne/abliteration),
|
||||||
|
as well as some details from the model cards of his own abliterated models (see above)
|
||||||
|
* [Jim Lai's article describing "projected abliteration"](https://huggingface.co/blog/grimjim/projected-abliteration)
|
||||||
|
|
||||||
|
|
||||||
## License
|
## License
|
||||||
|
|||||||
+1
-1
@@ -1,7 +1,7 @@
|
|||||||
[project]
|
[project]
|
||||||
name = "heretic"
|
name = "heretic"
|
||||||
version = "1.0.0"
|
version = "1.0.0"
|
||||||
description = "Fully automatic decensoring for transformer language models"
|
description = "Fully automatic censorship removal for language models"
|
||||||
readme = "README.md"
|
readme = "README.md"
|
||||||
license = "AGPL-3.0-or-later"
|
license = "AGPL-3.0-or-later"
|
||||||
authors = [
|
authors = [
|
||||||
|
|||||||
Reference in New Issue
Block a user