diff --git a/README.md b/README.md index a4e3738..feb29f7 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,7 @@ # Heretic: Fully automatic censorship removal for language models +[![Discord](https://img.shields.io/discord/1447831134212984903?color=5865F2&label=discord&labelColor=black&logo=discord&logoColor=white&style=for-the-badge)](https://discord.gg/gdXc48gSyT) + Heretic is a tool that removes censorship (aka "safety alignment") from transformer-based language models without expensive post-training. It combines an advanced implementation of directional ablation, also known @@ -37,6 +39,28 @@ e.g. `heretic --model google/gemma-3-12b-it --evaluate-model p-e-w/gemma-3-12b-i Note that the exact values might be platform- and hardware-dependent. The table above was compiled using PyTorch 2.8 on an RTX 5090.)* +Of course, mathematical metrics and automated benchmarks never tell the whole +story, and are no substitute for human evaluation. Models generated with +Heretic have been well-received by users (links and emphasis added): + +> "I was skeptical before, but I just downloaded +> [**GPT-OSS 20B Heretic**](https://huggingface.co/p-e-w/gpt-oss-20b-heretic) +> model and holy shit. It gives properly formatted long responses to sensitive topics, +> using the exact uncensored words that you would expect from an uncensored model, +> produces markdown format tables with details and whatnot. Looks like this is +> the best abliterated version of this model so far..." +> [*(Link to comment)*](https://old.reddit.com/r/LocalLLaMA/comments/1oymku1/heretic_fully_automatic_censorship_removal_for/np6tba6/) + +> "[**Heretic GPT 20b**](https://huggingface.co/p-e-w/gpt-oss-20b-heretic) +> seems to be the best uncensored model I have tried yet. It doesn't destroy a +> the model's intelligence and it is answering prompts normally would be +> rejected by the base model." +> [*(Link to comment)*](https://old.reddit.com/r/LocalLLaMA/comments/1oymku1/heretic_fully_automatic_censorship_removal_for/npe9jng/) + +> "[[**Qwen3-4B-Instruct-2507-heretic**](https://huggingface.co/p-e-w/Qwen3-4B-Instruct-2507-heretic)] +> Has been the best unquantized abliterated model that I have been able to run on 16gb vram." +> [*(Link to comment)*](https://old.reddit.com/r/LocalLLaMA/comments/1phjxca/im_calling_these_people_out_right_now/nt06tji/) + Heretic supports most dense models, including many multimodal models, and several different MoE architectures. It does not yet support SSMs/hybrid models, models with inhomogeneous layers, and certain novel attention systems. @@ -51,7 +75,7 @@ Prepare a Python 3.10+ environment with PyTorch 2.2+ installed as appropriate for your hardware. Then run: ``` -pip install heretic-llm +pip install -U heretic-llm heretic Qwen/Qwen3-4B-Instruct-2507 ``` @@ -73,7 +97,88 @@ save the model, upload it to Hugging Face, chat with it to test how well it work or any combination of those actions. -## How it works +## Research features + +In addition to its primary function of removing model censorship, Heretic also +provides features designed to support research into the semantics of model internals +(interpretability). To use those features, you need to install Heretic with the +optional `research` extra: + +``` +pip install -U heretic-llm[research] +``` + +This gives you access to the following functionality: + +### Generate plots of residual vectors by passing `--plot-residuals` + +When run with this flag, Heretic will: + +1. Compute residual vectors (hidden states) for the first output token, + for each transformer layer, for both "harmful" and "harmless" prompts. +2. Perform a [PaCMAP projection](https://github.com/YingfanWang/PaCMAP) + from residual space to 2D-space. +3. Left-right align the projections of "harmful"/"harmless" residuals + by their geometric medians to make projections for consecutive layers + more similar. Additionally, PaCMAP is initialized with the previous + layer's projections for each new layer, minimizing disruptive transitions. +4. Scatter-plot the projections, generating a PNG image for each layer. +5. Generate an animation showing how residuals transform between layers, + as an animated GIF. + +Plot of residual vectors + +See [the configuration file](config.default.toml) for options that allow you +to control various aspects of the generated plots. + +Note that PaCMAP is an expensive operation that is performed on the CPU. +For larger models, it can take an hour or more to compute projections +for all layers. + +### Print details about residual geometry by passing `--print-residual-geometry` + +If you are interested in a quantitative analysis of how residual vectors +for "harmful" and "harmless" prompts relate to each other, this flag gives you +the following table, packed with metrics that can facilitate understanding +the same (for [gemma-3-270m-it](https://huggingface.co/google/gemma-3-270m-it) +in this case): + +``` +┏━━━━━━━┳━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┓ +┃ Layer ┃ S(g,b) ┃ S(g*,b*) ┃ S(g,r) ┃ S(g*,r*) ┃ S(b,r) ┃ S(b*,r*) ┃ |g| ┃ |g*| ┃ |b| ┃ |b*| ┃ |r| ┃ |r*| ┃ Silh ┃ +┡━━━━━━━╇━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━┩ +│ 1 │ 1.0000 │ 1.0000 │ -0.4311 │ -0.4906 │ -0.4254 │ -0.4847 │ 170.29 │ 170.49 │ 169.78 │ 169.85 │ 1.19 │ 1.31 │ 0.0480 │ +│ 2 │ 1.0000 │ 1.0000 │ 0.4297 │ 0.4465 │ 0.4365 │ 0.4524 │ 768.55 │ 768.77 │ 771.32 │ 771.36 │ 6.39 │ 5.76 │ 0.0745 │ +│ 3 │ 0.9999 │ 1.0000 │ -0.5699 │ -0.5577 │ -0.5614 │ -0.5498 │ 1020.98 │ 1021.13 │ 1013.80 │ 1014.71 │ 12.70 │ 11.60 │ 0.0920 │ +│ 4 │ 0.9999 │ 1.0000 │ 0.6582 │ 0.6553 │ 0.6659 │ 0.6627 │ 1356.39 │ 1356.20 │ 1368.71 │ 1367.95 │ 18.62 │ 17.84 │ 0.0957 │ +│ 5 │ 0.9987 │ 0.9990 │ -0.6880 │ -0.6761 │ -0.6497 │ -0.6418 │ 766.54 │ 762.25 │ 731.75 │ 732.42 │ 51.97 │ 45.24 │ 0.1018 │ +│ 6 │ 0.9998 │ 0.9998 │ -0.1983 │ -0.2312 │ -0.1811 │ -0.2141 │ 2417.35 │ 2421.08 │ 2409.18 │ 2411.40 │ 43.06 │ 43.47 │ 0.0900 │ +│ 7 │ 0.9998 │ 0.9997 │ -0.5258 │ -0.5746 │ -0.5072 │ -0.5560 │ 3444.92 │ 3474.99 │ 3400.01 │ 3421.63 │ 86.94 │ 94.38 │ 0.0492 │ +│ 8 │ 0.9990 │ 0.9991 │ 0.8235 │ 0.8312 │ 0.8479 │ 0.8542 │ 4596.54 │ 4615.62 │ 4918.32 │ 4934.20 │ 384.87 │ 377.87 │ 0.2278 │ +│ 9 │ 0.9992 │ 0.9992 │ 0.5335 │ 0.5441 │ 0.5678 │ 0.5780 │ 5322.30 │ 5316.96 │ 5468.65 │ 5466.98 │ 265.68 │ 267.28 │ 0.1318 │ +│ 10 │ 0.9974 │ 0.9973 │ 0.8189 │ 0.8250 │ 0.8579 │ 0.8644 │ 5328.81 │ 5325.63 │ 5953.35 │ 5985.15 │ 743.95 │ 779.74 │ 0.2863 │ +│ 11 │ 0.9977 │ 0.9978 │ 0.4262 │ 0.4045 │ 0.4862 │ 0.4645 │ 9644.02 │ 9674.06 │ 9983.47 │ 9990.28 │ 743.28 │ 726.99 │ 0.1576 │ +│ 12 │ 0.9904 │ 0.9907 │ 0.4384 │ 0.4077 │ 0.5586 │ 0.5283 │ 10257.40 │ 10368.50 │ 11114.51 │ 11151.21 │ 1711.18 │ 1664.69 │ 0.1890 │ +│ 13 │ 0.9867 │ 0.9874 │ 0.4007 │ 0.3680 │ 0.5444 │ 0.5103 │ 12305.12 │ 12423.75 │ 13440.31 │ 13432.47 │ 2386.43 │ 2282.47 │ 0.1293 │ +│ 14 │ 0.9921 │ 0.9922 │ 0.3198 │ 0.2682 │ 0.4364 │ 0.3859 │ 16929.16 │ 17080.37 │ 17826.97 │ 17836.03 │ 2365.23 │ 2301.87 │ 0.1282 │ +│ 15 │ 0.9846 │ 0.9850 │ 0.1198 │ 0.0963 │ 0.2913 │ 0.2663 │ 16858.58 │ 16949.44 │ 17496.00 │ 17502.88 │ 3077.08 │ 3029.60 │ 0.1611 │ +│ 16 │ 0.9686 │ 0.9689 │ -0.0029 │ -0.0254 │ 0.2457 │ 0.2226 │ 18912.77 │ 19074.86 │ 19510.56 │ 19559.62 │ 4848.35 │ 4839.75 │ 0.1516 │ +│ 17 │ 0.9782 │ 0.9784 │ -0.0174 │ -0.0381 │ 0.1908 │ 0.1694 │ 27098.09 │ 27273.00 │ 27601.12 │ 27653.12 │ 5738.19 │ 5724.21 │ 0.1641 │ +│ 18 │ 0.9184 │ 0.9196 │ 0.1343 │ 0.1430 │ 0.5155 │ 0.5204 │ 190.16 │ 190.35 │ 219.91 │ 220.62 │ 87.82 │ 87.59 │ 0.1855 │ +└───────┴────────┴──────────┴─────────┴──────────┴─────────┴──────────┴──────────┴──────────┴──────────┴──────────┴─────────┴─────────┴────────┘ +g = mean of residual vectors for good prompts +g* = geometric median of residual vectors for good prompts +b = mean of residual vectors for bad prompts +b* = geometric median of residual vectors for bad prompts +r = refusal direction for means (i.e., b - g) +r* = refusal direction for geometric medians (i.e., b* - g*) +S(x,y) = cosine similarity of x and y +|x| = L2 norm of x +Silh = Mean silhouette coefficient of residuals for good/bad clusters +``` + + +## How Heretic works Heretic implements a parametrized variant of directional ablation. For each supported transformer component (currently, attention out-projection and diff --git a/src/heretic/analyzer.py b/src/heretic/analyzer.py index ac31245..aef65c3 100644 --- a/src/heretic/analyzer.py +++ b/src/heretic/analyzer.py @@ -38,7 +38,7 @@ class Analyzer: ( "[red]Research dependencies not found. Printing residual geometry requires " "installing Heretic with the optional research feature, i.e., " - 'using "pip install heretic-llm\\[research]".[/]' + 'using "pip install -U heretic-llm\\[research]".[/]' ) ) return @@ -164,7 +164,7 @@ class Analyzer: ( "[red]Research dependencies not found. Plotting residuals requires " "installing Heretic with the optional research feature, i.e., " - 'using "pip install heretic-llm\\[research]".[/]' + 'using "pip install -U heretic-llm\\[research]".[/]' ) ) return