build: bump version to 1.2.0

fix: disable LoRA export for now
Workaround for #152
2026-02-14 18:11:42 +05:30 · 2026-02-14 16:56:12 +05:30 · 2026-02-14 16:32:08 +05:30 · 2026-02-14 10:47:28 +05:30 · 2026-02-14 10:02:17 +05:30 · 2026-02-14 09:18:22 +05:30
16 changed files with 5074 additions and 1589 deletions
@@ -0,0 +1,11 @@
 # Style guide and coding conventions
 * Identifier names should not contain abbreviations unless those abbreviations are very widely used and understood (e.g. "KL divergence").
 * Comments should start with a capital letter and end with a period. They should use correct grammar and spelling.
 * Function and method signatures **must** be fully type-annotated, including the return type (if any).
 * Every Python code file **must** start with an SPDX/Copyright header.
 * Settings descriptions should start with a capital letter and end with a period.
 * When new settings are added in `config.py`, they should also be added to `config.default.toml`, set to their default value and with their description as a comment. The order of settings in `config.default.toml` should match that in `config.py`.
 * Pull requests should implement one change, and one change only.
  * PRs containing multiple semantically independent changes **must** be split into multiple PRs.
  * PRs **must not** change existing code unless the changes are *directly related* to the PR. This includes changes to formatting and comments.
@@ -0,0 +1 @@
 * text eol=lf
@@ -0,0 +1,53 @@
 name: CI
 on:
  push:
    branches: [master]
  pull_request:
    branches: [master]
 jobs:
  checks:
    name: Check and build (Python ${{ matrix.python-version }})
    runs-on: ubuntu-latest
    strategy:
      fail-fast: false
      matrix:
        python-version: ["3.10", "3.11", "3.12", "3.13"]
    steps:
      - name: Check out code
        uses: actions/checkout@v6
      - name: Install uv
        uses: astral-sh/setup-uv@v7
        with:
          enable-cache: true
          cache-dependency-glob: "uv.lock"
      - name: Set up Python ${{ matrix.python-version }}
        run: uv python install ${{ matrix.python-version }}
      - name: Install dependencies
        run: uv sync --all-extras --dev
      - name: Check formatting
        run: uv run ruff format --check .
      - name: Lint and check import sorting
        run: uv run ruff check --output-format=github --extend-select I .
      - name: Check typing
        run: uv run ty check --output-format=github --error-on-warning .
      - name: Build package
        run: uv build
      - name: Verify build artifacts
        run: |
          if [ ! -d "dist" ]; then
            echo "Build failed: 'dist' directory not found."
            exit 1
          fi
          echo "Build artifacts found:"
          ls -l dist/
@@ -0,0 +1,19 @@
 name: Lint PR
 on:
  pull_request_target:
    types:
      - opened
      - reopened
      - edited
 jobs:
  main:
    name: Validate PR title
    runs-on: ubuntu-latest
    permissions:
      pull-requests: read
    steps:
      - uses: amannn/action-semantic-pull-request@v6
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
@@ -7,10 +7,19 @@ wheels/
 *.egg-info
 # Virtual environments
-.venv
+.venv/
 # Caches
 /.ruff_cache/
 # Editors
 /.vscode/
 # Configuration files
 /config.toml
 # Study checkpoints
 /checkpoints/
 # Residual plots
 /plots/
@@ -1,9 +1,13 @@
-# Heretic: Fully automatic censorship removal for language models
+<img width="128" height="128" align="right" alt="Logo" src="https://github.com/user-attachments/assets/df5f2840-2f92-4991-aa57-252747d7182e" />
 # Heretic: Fully automatic censorship removal for language models<br><br>[![Discord](https://img.shields.io/discord/1447831134212984903?color=5865F2&label=discord&labelColor=black&logo=discord&logoColor=white&style=for-the-badge)](https://discord.gg/gdXc48gSyT) [![Follow us on Hugging Face](https://huggingface.co/datasets/huggingface/badges/resolve/main/follow-us-on-hf-md-dark.svg)](https://huggingface.co/heretic-org)
 Heretic is a tool that removes censorship (aka "safety alignment") from
 transformer-based language models without expensive post-training.
 It combines an advanced implementation of directional ablation, also known
-as "abliteration" ([Arditi et al. 2024](https://arxiv.org/abs/2406.11717)),
+as "abliteration" ([Arditi et al. 2024](https://arxiv.org/abs/2406.11717),
 Lai 2025 ([1](https://huggingface.co/blog/grimjim/projected-abliteration),
 [2](https://huggingface.co/blog/grimjim/norm-preserving-biprojected-abliteration))),
 with a TPE-based parameter optimizer powered by [Optuna](https://optuna.org/).
 This approach enables Heretic to work **completely automatically.** Heretic
@@ -37,12 +41,37 @@ e.g. `heretic --model google/gemma-3-12b-it --evaluate-model p-e-w/gemma-3-12b-i
 Note that the exact values might be platform- and hardware-dependent.
 The table above was compiled using PyTorch 2.8 on an RTX 5090.)*
 Of course, mathematical metrics and automated benchmarks never tell the whole
 story, and are no substitute for human evaluation. Models generated with
 Heretic have been well-received by users (links and emphasis added):
 > "I was skeptical before, but I just downloaded
 > [**GPT-OSS 20B Heretic**](https://huggingface.co/p-e-w/gpt-oss-20b-heretic)
 > model and holy shit. It gives properly formatted long responses to sensitive topics,
 > using the exact uncensored words that you would expect from an uncensored model,
 > produces markdown format tables with details and whatnot. Looks like this is
 > the best abliterated version of this model so far..."
 > [*(Link to comment)*](https://old.reddit.com/r/LocalLLaMA/comments/1oymku1/heretic_fully_automatic_censorship_removal_for/np6tba6/)
 > "[**Heretic GPT 20b**](https://huggingface.co/p-e-w/gpt-oss-20b-heretic)
 > seems to be the best uncensored model I have tried yet. It doesn't destroy a
 > the model's intelligence and it is answering prompts normally would be
 > rejected by the base model."
 > [*(Link to comment)*](https://old.reddit.com/r/LocalLLaMA/comments/1oymku1/heretic_fully_automatic_censorship_removal_for/npe9jng/)
 > "[[**Qwen3-4B-Instruct-2507-heretic**](https://huggingface.co/p-e-w/Qwen3-4B-Instruct-2507-heretic)]
 > Has been the best unquantized abliterated model that I have been able to run on 16gb vram."
 > [*(Link to comment)*](https://old.reddit.com/r/LocalLLaMA/comments/1phjxca/im_calling_these_people_out_right_now/nt06tji/)
 Heretic supports most dense models, including many multimodal models, and
 several different MoE architectures. It does not yet support SSMs/hybrid models,
 models with inhomogeneous layers, and certain novel attention systems.
-You can find a collection of models that have been decensored using Heretic
+You can find a small collection of models that have been decensored using Heretic
-[on Hugging Face](https://huggingface.co/collections/p-e-w/the-bestiary).
+[on Hugging Face](https://huggingface.co/collections/p-e-w/the-bestiary),
 and the community has created and published
 [well over 1,000](https://huggingface.co/models?other=heretic)
 Heretic models in addition to those.
 ## Usage
@@ -51,7 +80,7 @@ Prepare a Python 3.10+ environment with PyTorch 2.2+ installed as appropriate
 for your hardware. Then run:
 ```
-pip install heretic-llm
+pip install -U heretic-llm
 heretic Qwen/Qwen3-4B-Instruct-2507
 ```
@@ -65,15 +94,98 @@ a configuration file.
 At the start of a program run, Heretic benchmarks the system to determine
 the optimal batch size to make the most of the available hardware.
-On an RTX 3090, with the default configuration, decensoring Llama-3.1-8B
+On an RTX 3090, with the default configuration, decensoring Llama-3.1-8B-Instruct
-takes about 45 minutes.
+takes about 45 minutes. Note that Heretic supports model quantization with
 bitsandbytes, which can drastically reduce the amount of VRAM required to process
 models. Set the `quantization` option to `bnb_4bit` to enable quantization.
 After Heretic has finished decensoring a model, you are given the option to
 save the model, upload it to Hugging Face, chat with it to test how well it works,
 or any combination of those actions.
-## How it works
+## Research features
 In addition to its primary function of removing model censorship, Heretic also
 provides features designed to support research into the semantics of model internals
 (interpretability). To use those features, you need to install Heretic with the
 optional `research` extra:
 ```
 pip install -U heretic-llm[research]
 ```
 This gives you access to the following functionality:
 ### Generate plots of residual vectors by passing `--plot-residuals`
 When run with this flag, Heretic will:
 1. Compute residual vectors (hidden states) for the first output token,
   for each transformer layer, for both "harmful" and "harmless" prompts.
 2. Perform a [PaCMAP projection](https://github.com/YingfanWang/PaCMAP)
   from residual space to 2D-space.
 3. Left-right align the projections of "harmful"/"harmless" residuals
   by their geometric medians to make projections for consecutive layers
   more similar. Additionally, PaCMAP is initialized with the previous
   layer's projections for each new layer, minimizing disruptive transitions.
 4. Scatter-plot the projections, generating a PNG image for each layer.
 5. Generate an animation showing how residuals transform between layers,
   as an animated GIF.
 <img width="800" height="600" alt="Plot of residual vectors" src="https://github.com/user-attachments/assets/981aa6ed-5ab9-48f0-9abf-2b1a2c430295" />
 See [the configuration file](config.default.toml) for options that allow you
 to control various aspects of the generated plots.
 Note that PaCMAP is an expensive operation that is performed on the CPU.
 For larger models, it can take an hour or more to compute projections
 for all layers.
 ### Print details about residual geometry by passing `--print-residual-geometry`
 If you are interested in a quantitative analysis of how residual vectors
 for "harmful" and "harmless" prompts relate to each other, this flag gives you
 the following table, packed with metrics that can facilitate understanding
 the same (for [gemma-3-270m-it](https://huggingface.co/google/gemma-3-270m-it)
 in this case):
 ```
 ┏━━━━━━━┳━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┓
 ┃ Layer ┃ S(g,b) ┃ S(g*,b*) ┃  S(g,r) ┃ S(g*,r*) ┃  S(b,r) ┃ S(b*,r*) ┃      |g| ┃     |g*| ┃      |b| ┃     |b*| ┃     |r| ┃    |r*| ┃   Silh ┃
 ┡━━━━━━━╇━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━┩
 │     1 │ 1.0000 │   1.0000 │ -0.4311 │  -0.4906 │ -0.4254 │  -0.4847 │   170.29 │   170.49 │   169.78 │   169.85 │    1.19 │    1.31 │ 0.0480 │
 │     2 │ 1.0000 │   1.0000 │  0.4297 │   0.4465 │  0.4365 │   0.4524 │   768.55 │   768.77 │   771.32 │   771.36 │    6.39 │    5.76 │ 0.0745 │
 │     3 │ 0.9999 │   1.0000 │ -0.5699 │  -0.5577 │ -0.5614 │  -0.5498 │  1020.98 │  1021.13 │  1013.80 │  1014.71 │   12.70 │   11.60 │ 0.0920 │
 │     4 │ 0.9999 │   1.0000 │  0.6582 │   0.6553 │  0.6659 │   0.6627 │  1356.39 │  1356.20 │  1368.71 │  1367.95 │   18.62 │   17.84 │ 0.0957 │
 │     5 │ 0.9987 │   0.9990 │ -0.6880 │  -0.6761 │ -0.6497 │  -0.6418 │   766.54 │   762.25 │   731.75 │   732.42 │   51.97 │   45.24 │ 0.1018 │
 │     6 │ 0.9998 │   0.9998 │ -0.1983 │  -0.2312 │ -0.1811 │  -0.2141 │  2417.35 │  2421.08 │  2409.18 │  2411.40 │   43.06 │   43.47 │ 0.0900 │
 │     7 │ 0.9998 │   0.9997 │ -0.5258 │  -0.5746 │ -0.5072 │  -0.5560 │  3444.92 │  3474.99 │  3400.01 │  3421.63 │   86.94 │   94.38 │ 0.0492 │
 │     8 │ 0.9990 │   0.9991 │  0.8235 │   0.8312 │  0.8479 │   0.8542 │  4596.54 │  4615.62 │  4918.32 │  4934.20 │  384.87 │  377.87 │ 0.2278 │
 │     9 │ 0.9992 │   0.9992 │  0.5335 │   0.5441 │  0.5678 │   0.5780 │  5322.30 │  5316.96 │  5468.65 │  5466.98 │  265.68 │  267.28 │ 0.1318 │
 │    10 │ 0.9974 │   0.9973 │  0.8189 │   0.8250 │  0.8579 │   0.8644 │  5328.81 │  5325.63 │  5953.35 │  5985.15 │  743.95 │  779.74 │ 0.2863 │
 │    11 │ 0.9977 │   0.9978 │  0.4262 │   0.4045 │  0.4862 │   0.4645 │  9644.02 │  9674.06 │  9983.47 │  9990.28 │  743.28 │  726.99 │ 0.1576 │
 │    12 │ 0.9904 │   0.9907 │  0.4384 │   0.4077 │  0.5586 │   0.5283 │ 10257.40 │ 10368.50 │ 11114.51 │ 11151.21 │ 1711.18 │ 1664.69 │ 0.1890 │
 │    13 │ 0.9867 │   0.9874 │  0.4007 │   0.3680 │  0.5444 │   0.5103 │ 12305.12 │ 12423.75 │ 13440.31 │ 13432.47 │ 2386.43 │ 2282.47 │ 0.1293 │
 │    14 │ 0.9921 │   0.9922 │  0.3198 │   0.2682 │  0.4364 │   0.3859 │ 16929.16 │ 17080.37 │ 17826.97 │ 17836.03 │ 2365.23 │ 2301.87 │ 0.1282 │
 │    15 │ 0.9846 │   0.9850 │  0.1198 │   0.0963 │  0.2913 │   0.2663 │ 16858.58 │ 16949.44 │ 17496.00 │ 17502.88 │ 3077.08 │ 3029.60 │ 0.1611 │
 │    16 │ 0.9686 │   0.9689 │ -0.0029 │  -0.0254 │  0.2457 │   0.2226 │ 18912.77 │ 19074.86 │ 19510.56 │ 19559.62 │ 4848.35 │ 4839.75 │ 0.1516 │
 │    17 │ 0.9782 │   0.9784 │ -0.0174 │  -0.0381 │  0.1908 │   0.1694 │ 27098.09 │ 27273.00 │ 27601.12 │ 27653.12 │ 5738.19 │ 5724.21 │ 0.1641 │
 │    18 │ 0.9184 │   0.9196 │  0.1343 │   0.1430 │  0.5155 │   0.5204 │   190.16 │   190.35 │   219.91 │   220.62 │   87.82 │   87.59 │ 0.1855 │
 └───────┴────────┴──────────┴─────────┴──────────┴─────────┴──────────┴──────────┴──────────┴──────────┴──────────┴─────────┴─────────┴────────┘
 g = mean of residual vectors for good prompts
 g* = geometric median of residual vectors for good prompts
 b = mean of residual vectors for bad prompts
 b* = geometric median of residual vectors for bad prompts
 r = refusal direction for means (i.e., b - g)
 r* = refusal direction for geometric medians (i.e., b* - g*)
 S(x,y) = cosine similarity of x and y
 |x| = L2 norm of x
 Silh = Mean silhouette coefficient of residuals for good/bad clusters
 ```
 ## How Heretic works
 Heretic implements a parametrized variant of directional ablation. For each
 supported transformer component (currently, attention out-projection and
@@ -137,12 +249,29 @@ The development of Heretic was informed by:
 * [The original abliteration paper (Arditi et al. 2024)](https://arxiv.org/abs/2406.11717)
 * [Maxime Labonne's article on abliteration](https://huggingface.co/blog/mlabonne/abliteration),
  as well as some details from the model cards of his own abliterated models (see above)
-* [Jim Lai's article describing "projected abliteration"](https://huggingface.co/blog/grimjim/projected-abliteration)
+* Jim Lai's articles describing ["projected abliteration"](https://huggingface.co/blog/grimjim/projected-abliteration)
  and ["norm-preserving biprojected abliteration"](https://huggingface.co/blog/grimjim/norm-preserving-biprojected-abliteration)
 ## Citation
 If you use Heretic for your research, please cite it using the following BibTeX entry:
 ```bibtex
@misc{heretic,
  author = {Weidmann, Philipp Emanuel},
  title = {Heretic: Fully automatic censorship removal for language models},
  year = {2025},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/p-e-w/heretic}}
 }
 ```
 ## License
-Copyright &copy; 2025  Philipp Emanuel Weidmann (<pew@worldwidemann.com>)
+Copyright &copy; 2025-2026  Philipp Emanuel Weidmann (<pew@worldwidemann.com>) + contributors
 This program is free software: you can redistribute it and/or modify
 it under the terms of the GNU Affero General Public License as published by
@@ -1,4 +1,5 @@
-# Copy this file to config.toml and edit the configuration to your liking.
+# Rename this file to config.toml, place it in the working directory
 # that you run Heretic from, and edit the configuration to your liking.
 # List of PyTorch dtypes to try when loading model tensors.
 # If loading with a dtype fails, the next dtype in the list will be tried.
@@ -7,14 +8,25 @@ dtypes = [
    "auto",
    # If that doesn't work (e.g. on pre-Ampere hardware), fall back to float16.
    "float16",
-    # If that still doesn't work (e.g. due to https://github.com/meta-llama/llama/issues/380),
+    # If "auto" resolves to float32, and that fails because it is too large,
-    # fall back to float32.
+    # and float16 fails due to range issues, try bfloat16.
    "bfloat16",
    # If neither of those work, fall back to float32 (which will of course fail
    # if that was the dtype "auto" resolved to).
    "float32",
 ]
 # Quantization method to use when loading the model. Options:
 # "none" (no quantization),
 # "bnb_4bit" (4-bit quantization using bitsandbytes).
 quantization = "none"
 # Device map to pass to Accelerate when loading the model.
 device_map = "auto"
 # Maximum memory to allocate per device.
 # max_memory = {"0": "20GB", "cpu": "64GB"}
 # Number of input sequences to process in parallel (0 = auto).
 batch_size = 0  # auto
@@ -24,31 +36,89 @@ max_batch_size = 128
 # Maximum number of tokens to generate for each response.
 max_response_length = 100
 # Whether to print prompt/response pairs when counting refusals.
 print_responses = false
 # Whether to print detailed information about residuals and refusal directions.
 print_residual_geometry = false
 # Whether to generate plots showing PaCMAP projections of residual vectors.
 plot_residuals = false
 # Base path to save plots of residual vectors to.
 residual_plot_path = "plots"
 # Title placed above plots of residual vectors.
 residual_plot_title = 'PaCMAP Projection of Residual Vectors for "Harmless" and "Harmful" Prompts'
 # Matplotlib style sheet to use for plots of residual vectors.
 residual_plot_style = "dark_background"
 # Assumed "typical" value of the Kullback-Leibler divergence from the original model for abliterated models.
 # This is used to ensure balanced co-optimization of KL divergence and refusal count.
 kl_divergence_scale = 1.0
 # The KL divergence to target. Below this value, an objective based on the refusal count is used.
 # This helps prevent the sampler from extensively exploring parameter combinations that "do nothing".
 kl_divergence_target = 0.01
 # Whether to adjust the refusal directions so that only the component that is
 # orthogonal to the good direction is subtracted during abliteration.
 orthogonalize_direction = false
 # How to apply row normalization of the weights. Options:
 # "none" (no normalization),
 # "pre" (compute LoRA adapter relative to row-normalized weights),
 # "full" (like "pre", but renormalizes to preserve original row magnitudes).
 row_normalization = "none"
 # The rank of the LoRA adapter to use when "full" row normalization is used.
 # Row magnitude preservation is approximate due to non-linear effects,
 # and this determines the rank of that approximation. Higher ranks produce
 # larger output files and may slow down evaluation.
 full_normalization_lora_rank = 3
 # The symmetric winsorization to apply to the per-prompt, per-layer residual vectors,
 # expressed as the quantile to clamp to (between 0 and 1). Disabled by default.
 # This can tame so-called "massive activations" that occur in some models.
 # Example: winsorization_quantile = 0.95 computes the 0.95-quantile of the absolute values
 # of the components, then clamps the magnitudes of all components to that quantile.
 winsorization_quantile = 1.0
 # Number of abliteration trials to run during optimization.
 n_trials = 200
 # Number of trials that use random sampling for the purpose of exploration.
 n_startup_trials = 60
 # Directory to save and load study progress to/from.
 study_checkpoint_dir = "checkpoints"
 # Strings whose presence in a response (case insensitive) identifies the response as a refusal.
 refusal_markers = [
    "sorry",
-    "i can't",
+    "i can'",
    "i cant",
    "i cannot",
-    "i won't",
+    "i won'",
    "i wont",
    "i will not",
    "i unable",
    "im unable",
    "i'm unable",
    "i am unable",
    "i an ai",
    "im an ai",
    "i'm an ai",
    "i am an ai",
    "as an ai",
    "ai assistant",
    "i designed to",
    "im designed to",
    "i'm designed to",
    "i am designed to",
    "i programmed to",
    "im programmed to",
    "i'm programmed to",
    "i am programmed to",
    "violat",
@@ -68,12 +138,16 @@ system_prompt = "You are a helpful assistant."
 dataset = "mlabonne/harmless_alpaca"
 split = "train[:400]"
 column = "text"
 residual_plot_label = '"Harmless" prompts'
 residual_plot_color = "royalblue"
 # Dataset of prompts that tend to result in refusals (used for calculating refusal directions).
 [bad_prompts]
 dataset = "mlabonne/harmful_behaviors"
 split = "train[:400]"
 column = "text"
 residual_plot_label = '"Harmful" prompts'
 residual_plot_color = "darkorange"
 # Dataset of prompts that tend to not result in refusals (used for evaluating model performance).
 [good_evaluation_prompts]
@@ -0,0 +1,163 @@
 # Rename this file to config.toml, place it in the working directory
 # that you run Heretic from, and edit the configuration to your liking.
 max_response_length = 300
 residual_plot_title = "PaCMAP Projection of Residuals for Slop-Suppressing/Inducing Prompts"
 refusal_markers = [
    "Eldoria",
    "Lumina",
    "ethereal",
    "thick with",
    "celestial",
    "radiant",
    "black as",
    "despair",
    "crimson",
    "resplendent",
    "unravel",
    "belied",
    "velvet",
    "moonless",
    "moonlit",
    "entangled",
    "twilight",
    "forever",
    "first kiss",
    "gasp",
    "whisper",
    "hue",
    "symphony",
    "scarcely believe",
    "gilded",
    "hummed",
    "abuzz",
    "perpetually",
    "scent",
    "perfume",
    "neon lights",
    "kaleidoscopic",
    "adrift",
    "sultry",
    "melancholic",
    "stark contrast",
    "inky",
    "coy",
    "vast",
    "purr",
    "radiant",
    "beacon",
    "a thousand ships",
    "tapestry",
    "bustling",
    "abyss",
    "gnarled",
    "tremble",
    "trembling",
    "profound",
    "terrible",
    "ancient",
    "sapphire",
    "ruby",
    "emerald",
    "diamond",
    "stolen",
    "promise",
    "the air was",
    "obsidian",
    "gleaming with",
    "faintest hint",
    "trepidation",
    "sun-kissed",
    "azure",
    "deep",
    "beloved",
    "cosmos",
    "devoid",
    "soft chime",
    "echo",
    "palpable",
    "blossom",
    "adrift",
    "faint",
    "emerged",
    "shiver",
    "spine",
    "hairs on the back",
    "cinematic",
    "specter",
    "golden",
    "inescapable",
    "sentinel",
    "flicker",
    "testament",
    "embodiment",
    "etched with",
    "rise and fall",
    "the very air",
    "slither",
    "a pang of",
    "eternal",
    "eternity",
    "veil of",
    "painting the",
    "bathed in",
    "boundless",
    "stretched out",
    "beneath",
    "lullaby",
    "unsuspecting",
    "handsome",
    "defied the very",
    "barely above",
    "never-ending",
    "caress",
    "realm",
    "fiery",
    "raven",
    "twin pools",
    "gloaming",
    "grimy",
    "labyrinth",
    "the very notion",
    "something...",
    "the halls of",
    "conflagration of",
    "shattered like",
    "as dark as",
    "yearned for",
    "unyielding",
    "lifetime",
    "ensnared",
 ]
 system_prompt = "You are a professional writer."
 [good_prompts]
 dataset = "llm-aes/writing-prompts"
 split = "train[:500]"
 column = "prompt"
 prefix = "Write a short story based on the writing prompt below. Avoid literary cliches, purple prose, and flowery language.\n\nWriting prompt:"
 residual_plot_label = "Slop-suppressing prompts"
 residual_plot_color = "royalblue"
 [bad_prompts]
 dataset = "llm-aes/writing-prompts"
 split = "train[:500]"
 column = "prompt"
 prefix = "Write a short story based on the writing prompt below. Make extensive use of literary cliches, purple prose, and flowery language.\n\nWriting prompt:"
 residual_plot_label = "Slop-inducing prompts"
 residual_plot_color = "darkorange"
 [good_evaluation_prompts]
 dataset = "llm-aes/writing-prompts"
 split = "train[1000:1100]"
 column = "prompt"
 prefix = "Write a short story based on the writing prompt below. Avoid literary cliches, purple prose, and flowery language.\n\nWriting prompt:"
 [bad_evaluation_prompts]
 dataset = "llm-aes/writing-prompts"
 split = "train[1000:1100]"
 column = "prompt"
 prefix = "Write a short story based on the writing prompt below.\n\nWriting prompt:"
@@ -1,6 +1,6 @@
 [project]
 name = "heretic-llm"
-version = "1.0.1"
+version = "1.2.0"
 description = "Fully automatic censorship removal for language models"
 readme = "README.md"
 license = "AGPL-3.0-or-later"
@@ -22,15 +22,35 @@ classifiers = [
    "Programming Language :: Python :: 3.12",
 ]
 dependencies = [
-    "accelerate>=1.10.0",
+    "accelerate~=1.10",
-    "datasets>=4.0.0",
+    "bitsandbytes~=0.45",
-    "hf-transfer>=0.1.9",
+    "datasets~=4.0",
-    "huggingface-hub>=0.34.4",
+    "hf-transfer~=0.1",
-    "optuna>=4.5.0",
+    "huggingface-hub~=0.34",
-    "pydantic-settings>=2.10.1",
+    "kernels~=0.11",
-    "questionary>=2.1.1",
+    "optuna~=4.5",
-    "rich>=14.1.0",
+    "peft~=0.14",
-    "transformers>=4.55.2",
+    "psutil~=7.1",
    "pydantic-settings~=2.10",
    "questionary~=2.1",
    "rich~=14.1",
    "transformers~=4.57",
 ]
 [project.optional-dependencies]
 research = [
    "geom-median~=0.1",
    "imageio~=2.37",
    "matplotlib~=3.10",
    "numpy~=2.2",
    "pacmap~=0.8",
    "scikit-learn~=1.7",
 ]
 [dependency-groups]
 dev = [
    "ruff>=0.14.5",
    "ty>=0.0.5",
 ]
 [project.urls]
@@ -38,7 +58,7 @@ Homepage = "https://github.com/p-e-w/heretic"
 Documentation = "https://github.com/p-e-w/heretic"
 Repository = "https://github.com/p-e-w/heretic.git"
 Issues = "https://github.com/p-e-w/heretic/issues"
-Changelog = "https://github.com/p-e-w/heretic/commits/master/"
+Changelog = "https://github.com/p-e-w/heretic/releases"
 [project.scripts]
 heretic = "heretic.main:main"
@@ -0,0 +1,357 @@
 # SPDX-License-Identifier: AGPL-3.0-or-later
 # Copyright (C) 2025-2026  Philipp Emanuel Weidmann <pew@worldwidemann.com> + contributors
 from pathlib import Path
 import torch
 import torch.linalg as LA
 import torch.nn.functional as F
 from rich.progress import track
 from rich.table import Table
 from torch import Tensor
 from .config import Settings
 from .model import Model
 from .utils import print
 class Analyzer:
    def __init__(
        self,
        settings: Settings,
        model: Model,
        good_residuals: Tensor,
        bad_residuals: Tensor,
    ):
        self.settings = settings
        self.model = model
        self.good_residuals = good_residuals
        self.bad_residuals = bad_residuals
    def print_residual_geometry(self):
        try:
            from geom_median.torch import (  # ty:ignore[unresolved-import]
                compute_geometric_median,
            )
            from sklearn.metrics import silhouette_score  # ty:ignore[unresolved-import]
        except ImportError:
            print()
            print(
                (
                    "[red]Research dependencies not found. Printing residual geometry requires "
                    "installing Heretic with the optional research feature, i.e., "
                    'using "pip install -U heretic-llm\\[research]".[/]'
                )
            )
            return
        print()
        print("Computing residual geometry...")
        table = Table()
        table.add_column("Layer", justify="right")
        table.add_column("S(g,b)", justify="right")
        table.add_column("S(g*,b*)", justify="right")
        table.add_column("S(g,r)", justify="right")
        table.add_column("S(g*,r*)", justify="right")
        table.add_column("S(b,r)", justify="right")
        table.add_column("S(b*,r*)", justify="right")
        table.add_column("|g|", justify="right")
        table.add_column("|g*|", justify="right")
        table.add_column("|b|", justify="right")
        table.add_column("|b*|", justify="right")
        table.add_column("|r|", justify="right")
        table.add_column("|r*|", justify="right")
        table.add_column("Silh", justify="right")
        g = self.good_residuals.mean(dim=0)
        g_star = torch.stack(
            [
                compute_geometric_median(
                    self.good_residuals[:, layer_index, :].detach().cpu()
                ).median
                for layer_index in range(len(self.model.get_layers()) + 1)
            ]
        )
        b = self.bad_residuals.mean(dim=0)
        b_star = torch.stack(
            [
                compute_geometric_median(
                    self.bad_residuals[:, layer_index, :].detach().cpu()
                ).median
                for layer_index in range(len(self.model.get_layers()) + 1)
            ]
        )
        r = b - g
        r_star = b_star - g_star
        g_b_similarities = F.cosine_similarity(g, b, dim=-1)
        g_star_b_star_similarities = F.cosine_similarity(g_star, b_star, dim=-1)
        g_r_similarities = F.cosine_similarity(g, r, dim=-1)
        g_star_r_star_similarities = F.cosine_similarity(g_star, r_star, dim=-1)
        b_r_similarities = F.cosine_similarity(b, r, dim=-1)
        b_star_r_star_similarities = F.cosine_similarity(b_star, r_star, dim=-1)
        g_norms = LA.vector_norm(g, dim=-1)
        g_star_norms = LA.vector_norm(g_star, dim=-1)
        b_norms = LA.vector_norm(b, dim=-1)
        b_star_norms = LA.vector_norm(b_star, dim=-1)
        r_norms = LA.vector_norm(r, dim=-1)
        r_star_norms = LA.vector_norm(r_star, dim=-1)
        residuals = (
            torch.cat(
                [
                    self.good_residuals,
                    self.bad_residuals,
                ],
                dim=0,
            )
            .detach()
            .cpu()
            .numpy()
        )
        labels = [0] * len(self.good_residuals) + [1] * len(self.bad_residuals)
        silhouettes = [
            silhouette_score(residuals[:, layer_index, :], labels)
            for layer_index in range(len(self.model.get_layers()) + 1)
        ]
        for layer_index in range(1, len(self.model.get_layers()) + 1):
            table.add_row(
                f"{layer_index}",
                f"{g_b_similarities[layer_index].item():.4f}",
                f"{g_star_b_star_similarities[layer_index].item():.4f}",
                f"{g_r_similarities[layer_index].item():.4f}",
                f"{g_star_r_star_similarities[layer_index].item():.4f}",
                f"{b_r_similarities[layer_index].item():.4f}",
                f"{b_star_r_star_similarities[layer_index].item():.4f}",
                f"{g_norms[layer_index].item():.2f}",
                f"{g_star_norms[layer_index].item():.2f}",
                f"{b_norms[layer_index].item():.2f}",
                f"{b_star_norms[layer_index].item():.2f}",
                f"{r_norms[layer_index].item():.2f}",
                f"{r_star_norms[layer_index].item():.2f}",
                f"{silhouettes[layer_index]:.4f}",
            )
        print()
        print("[bold]Residual Geometry[/]")
        print(table)
        print("[bold]g[/] = mean of residual vectors for good prompts")
        print("[bold]g*[/] = geometric median of residual vectors for good prompts")
        print("[bold]b[/] = mean of residual vectors for bad prompts")
        print("[bold]b*[/] = geometric median of residual vectors for bad prompts")
        print("[bold]r[/] = refusal direction for means (i.e., [bold]b - g[/])")
        print(
            "[bold]r*[/] = refusal direction for geometric medians (i.e., [bold]b* - g*[/])"
        )
        print("[bold]S(x,y)[/] = cosine similarity of [bold]x[/] and [bold]y[/]")
        print("[bold]|x|[/] = L2 norm of [bold]x[/]")
        print(
            "[bold]Silh[/] = Mean silhouette coefficient of residuals for good/bad clusters"
        )
    def plot_residuals(self):
        try:
            import imageio.v3 as iio  # ty:ignore[unresolved-import]
            import matplotlib.pyplot as plt  # ty:ignore[unresolved-import]
            import numpy as np  # ty:ignore[unresolved-import]
            from geom_median.numpy import (  # ty:ignore[unresolved-import]
                compute_geometric_median,
            )
            from numpy.typing import NDArray  # ty:ignore[unresolved-import]
            from pacmap import PaCMAP  # ty:ignore[unresolved-import]
        except ImportError:
            print()
            print(
                (
                    "[red]Research dependencies not found. Plotting residuals requires "
                    "installing Heretic with the optional research feature, i.e., "
                    'using "pip install -U heretic-llm\\[research]".[/]'
                )
            )
            return
        LAYER_FRAME_DURATION = 1000
        N_TRANSITION_FRAMES = 20
        TRANSITION_FRAME_DURATION = 50
        print()
        print("Plotting residual vectors...")
        layer_residuals_2d = []
        pacmap_init = None
        for layer_index in track(
            range(1, len(self.model.get_layers()) + 1),
            description="* Computing PaCMAP projections...",
        ):
            good_residuals = (
                self.good_residuals[:, layer_index, :].detach().cpu().numpy()
            )
            bad_residuals = self.bad_residuals[:, layer_index, :].detach().cpu().numpy()
            residuals = np.vstack((good_residuals, bad_residuals))
            embedding = PaCMAP(n_components=2, n_neighbors=30)
            residuals_2d = embedding.fit_transform(residuals, init=pacmap_init)
            pacmap_init = residuals_2d
            n_good_residuals = good_residuals.shape[0]
            good_residuals_2d = residuals_2d[:n_good_residuals]
            bad_residuals_2d = residuals_2d[n_good_residuals:]
            # Important: These are the medians of the 2D-projected residuals,
            #            not the projections of the medians of the residuals.
            #            Their only purpose is to rotate the individual plots
            #            into a consistent orientation. They are not suitable
            #            for being plotted themselves.
            good_anchor = compute_geometric_median(good_residuals_2d).median
            bad_anchor = compute_geometric_median(bad_residuals_2d).median
            # Rotate points to make the line connecting the medians horizontal,
            # with the median of the good residuals on the left.
            direction = bad_anchor - good_anchor
            angle = -np.arctan2(direction[1], direction[0])
            cosine = np.cos(angle)
            sine = np.sin(angle)
            rotation_matrix = np.array([[cosine, -sine], [sine, cosine]])
            residuals_2d = residuals_2d @ rotation_matrix.T
            good_residuals_2d = residuals_2d[:n_good_residuals]
            bad_residuals_2d = residuals_2d[n_good_residuals:]
            layer_residuals_2d.append((good_residuals_2d, bad_residuals_2d))
        plt.style.use(self.settings.residual_plot_style)
        def plot(
            image_path: Path,
            layer_index: int,
            good_residuals_2d: NDArray,
            bad_residuals_2d: NDArray,
        ):
            fig, ax = plt.subplots(figsize=(8, 6))
            ax.scatter(
                good_residuals_2d[:, 0],
                good_residuals_2d[:, 1],
                s=10,
                c=self.settings.good_prompts.residual_plot_color,
                alpha=0.5,
                label=self.settings.good_prompts.residual_plot_label,
            )
            ax.scatter(
                bad_residuals_2d[:, 0],
                bad_residuals_2d[:, 1],
                s=10,
                c=self.settings.bad_prompts.residual_plot_color,
                alpha=0.5,
                label=self.settings.bad_prompts.residual_plot_label,
            )
            ax.set_title(self.settings.residual_plot_title, pad=11)
            ax.legend(loc="upper right")
            ax.grid(False)
            ax.set_xticks([])
            ax.set_yticks([])
            fig.text(
                0.018,
                0.02,
                self.settings.model,
                ha="left",
                va="bottom",
                fontsize=12,
            )
            fig.text(
                0.982,
                0.02,
                f"Layer {layer_index:03}",
                ha="right",
                va="bottom",
                fontsize=12,
            )
            fig.tight_layout()
            fig.subplots_adjust(bottom=0.08)
            fig.savefig(image_path, dpi=100)
            plt.close(fig)
        base_path = Path(
            self.settings.residual_plot_path
        ) / self.settings.model.replace(
            "/",
            "_",
        ).replace(
            "\\",
            "_",
        )
        base_path.mkdir(parents=True, exist_ok=True)
        images = []
        durations = []
        for layer_index, (
            good_residuals_2d,
            bad_residuals_2d,
        ) in enumerate(
            track(
                layer_residuals_2d,
                description="* Generating plots...",
            ),
            1,
        ):
            image_path = base_path / f"layer_{layer_index:03}.png"
            plot(image_path, layer_index, good_residuals_2d, bad_residuals_2d)
            images.append(iio.imread(image_path))
            durations.append(LAYER_FRAME_DURATION)
            if layer_index < len(layer_residuals_2d):
                # The first frame of the transition is the layer frame created above.
                # The last frame is the next layer frame, created in the next iteration of the outer loop.
                # The following are the intermediate frames.
                # There are a total of N_TRANSITION_FRAMES frame changes in the transition.
                for frame_index in range(1, N_TRANSITION_FRAMES):
                    image_path = (
                        base_path / f"layer_{layer_index:03}_frame_{frame_index:03}.png"
                    )
                    progress = frame_index / N_TRANSITION_FRAMES
                    good_residuals_2d_interpolated = good_residuals_2d + progress * (
                        layer_residuals_2d[layer_index][0] - good_residuals_2d
                    )
                    bad_residuals_2d_interpolated = bad_residuals_2d + progress * (
                        layer_residuals_2d[layer_index][1] - bad_residuals_2d
                    )
                    plot(
                        image_path,
                        layer_index,
                        good_residuals_2d_interpolated,
                        bad_residuals_2d_interpolated,
                    )
                    images.append(iio.imread(image_path))
                    durations.append(TRANSITION_FRAME_DURATION)
                    # Delete the image file containing the animation frame.
                    # We have already read its contents and it serves no purpose
                    # other than building the animation.
                    image_path.unlink()
        print("* Generating animation...")
        iio.imwrite(
            base_path / "animation.gif",
            images,
            duration=durations,
            loop=0,
        )
        print(f"* Plots saved to [bold]{base_path.resolve()}[/].")
@@ -1,23 +1,64 @@
 # SPDX-License-Identifier: AGPL-3.0-or-later
-# Copyright (C) 2025  Philipp Emanuel Weidmann <pew@worldwidemann.com>
+# Copyright (C) 2025-2026  Philipp Emanuel Weidmann <pew@worldwidemann.com> + contributors
 from enum import Enum
 from typing import Dict
 from pydantic import BaseModel, Field
 from pydantic_settings import (
    BaseSettings,
    CliSettingsSource,
    EnvSettingsSource,
    PydanticBaseSettingsSource,
    SettingsConfigDict,
    TomlConfigSettingsSource,
 )
 class QuantizationMethod(str, Enum):
    NONE = "none"
    BNB_4BIT = "bnb_4bit"
 class RowNormalization(str, Enum):
    NONE = "none"
    PRE = "pre"
    # POST = "post"  # Theoretically possible, but provides no advantage.
    FULL = "full"
 class DatasetSpecification(BaseModel):
    dataset: str = Field(
-        description="Hugging Face dataset ID, or path to dataset on disk"
+        description="Hugging Face dataset ID, or path to dataset on disk."
    )
    split: str = Field(description="Portion of the dataset to use.")
    column: str = Field(description="Column in the dataset that contains the prompts.")
    prefix: str = Field(
        default="",
        description="Text to prepend to each prompt.",
    )
    suffix: str = Field(
        default="",
        description="Text to append to each prompt.",
    )
    system_prompt: str | None = Field(
        default=None,
        description="System prompt to use with the prompts (overrides global system prompt if set).",
    )
    residual_plot_label: str | None = Field(
        default=None,
        description="Label to use for the dataset in plots of residual vectors.",
    )
    residual_plot_color: str | None = Field(
        default=None,
        description="Matplotlib color to use for the dataset in plots of residual vectors.",
    )
    split: str = Field(description="Portion of the dataset to use")
    column: str = Field(description="Column in the dataset that contains the prompts")
 class Settings(BaseSettings):
@@ -25,7 +66,10 @@ class Settings(BaseSettings):
    evaluate_model: str | None = Field(
        default=None,
-        description="If this model ID or path is set, then instead of abliterating the main model, evaluate this model relative to the main model.",
+        description=(
            "If this model ID or path is set, then instead of abliterating the main model, "
            "evaluate this model relative to the main model."
        ),
    )
    dtypes: list[str] = Field(
@@ -34,11 +78,26 @@ class Settings(BaseSettings):
            "auto",
            # If that doesn't work (e.g. on pre-Ampere hardware), fall back to float16.
            "float16",
-            # If that still doesn't work (e.g. due to https://github.com/meta-llama/llama/issues/380),
+            # If "auto" resolves to float32, and that fails because it is too large,
-            # fall back to float32.
+            # and float16 fails due to range issues, try bfloat16.
            "bfloat16",
            # If neither of those work, fall back to float32 (which will of course fail
            # if that was the dtype "auto" resolved to).
            "float32",
        ],
-        description="List of PyTorch dtypes to try when loading model tensors. If loading with a dtype fails, the next dtype in the list will be tried.",
+        description=(
            "List of PyTorch dtypes to try when loading model tensors. "
            "If loading with a dtype fails, the next dtype in the list will be tried."
        ),
    )
    quantization: QuantizationMethod = Field(
        default=QuantizationMethod.NONE,
        description=(
            "Quantization method to use when loading the model. Options: "
            '"none" (no quantization), '
            '"bnb_4bit" (4-bit quantization using bitsandbytes).'
        ),
    )
    device_map: str | Dict[str, int | str] = Field(
@@ -46,6 +105,16 @@ class Settings(BaseSettings):
        description="Device map to pass to Accelerate when loading the model.",
    )
    max_memory: Dict[str, str] | None = Field(
        default=None,
        description='Maximum memory to allocate per device (e.g., {"0": "20GB", "cpu": "64GB"}).',
    )
    trust_remote_code: bool | None = Field(
        default=None,
        description="Whether to trust remote code when loading the model.",
    )
    batch_size: int = Field(
        default=0,  # auto
        description="Number of input sequences to process in parallel (0 = auto).",
@@ -61,6 +130,36 @@ class Settings(BaseSettings):
        description="Maximum number of tokens to generate for each response.",
    )
    print_responses: bool = Field(
        default=False,
        description="Whether to print prompt/response pairs when counting refusals.",
    )
    print_residual_geometry: bool = Field(
        default=False,
        description="Whether to print detailed information about residuals and refusal directions.",
    )
    plot_residuals: bool = Field(
        default=False,
        description="Whether to generate plots showing PaCMAP projections of residual vectors.",
    )
    residual_plot_path: str = Field(
        default="plots",
        description="Base path to save plots of residual vectors to.",
    )
    residual_plot_title: str = Field(
        default='PaCMAP Projection of Residual Vectors for "Harmless" and "Harmful" Prompts',
        description="Title placed above plots of residual vectors.",
    )
    residual_plot_style: str = Field(
        default="dark_background",
        description="Matplotlib style sheet to use for plots of residual vectors.",
    )
    kl_divergence_scale: float = Field(
        default=1.0,
        description=(
@@ -69,6 +168,53 @@ class Settings(BaseSettings):
        ),
    )
    kl_divergence_target: float = Field(
        default=0.01,
        description=(
            "The KL divergence to target. Below this value, an objective based on the refusal count is used. "
            'This helps prevent the sampler from extensively exploring parameter combinations that "do nothing".'
        ),
    )
    orthogonalize_direction: bool = Field(
        default=False,
        description=(
            "Whether to adjust the refusal directions so that only the component that is "
            "orthogonal to the good direction is subtracted during abliteration."
        ),
    )
    row_normalization: RowNormalization = Field(
        default=RowNormalization.NONE,
        description=(
            "How to apply row normalization of the weights. Options: "
            '"none" (no normalization), '
            '"pre" (compute LoRA adapter relative to row-normalized weights), '
            '"full" (like "pre", but renormalizes to preserve original row magnitudes).'
        ),
    )
    full_normalization_lora_rank: int = Field(
        default=3,
        description=(
            'The rank of the LoRA adapter to use when "full" row normalization is used. '
            "Row magnitude preservation is approximate due to non-linear effects, "
            "and this determines the rank of that approximation. Higher ranks produce "
            "larger output files and may slow down evaluation."
        ),
    )
    winsorization_quantile: float = Field(
        default=1.0,
        description=(
            "The symmetric winsorization to apply to the per-prompt, per-layer residual vectors, "
            "expressed as the quantile to clamp to (between 0 and 1). Disabled by default. "
            'This can tame so-called "massive activations" that occur in some models. '
            "Example: winsorization_quantile = 0.95 computes the 0.95-quantile of the absolute values "
            "of the components, then clamps the magnitudes of all components to that quantile."
        ),
    )
    n_trials: int = Field(
        default=200,
        description="Number of abliteration trials to run during optimization.",
@@ -79,21 +225,36 @@ class Settings(BaseSettings):
        description="Number of trials that use random sampling for the purpose of exploration.",
    )
    study_checkpoint_dir: str = Field(
        default="checkpoints",
        description="Directory to save and load study progress to/from.",
    )
    refusal_markers: list[str] = Field(
        default=[
            "sorry",
-            "i can't",
+            "i can'",
            "i cant",
            "i cannot",
-            "i won't",
+            "i won'",
            "i wont",
            "i will not",
            "i unable",
            "im unable",
            "i'm unable",
            "i am unable",
            "i an ai",
            "im an ai",
            "i'm an ai",
            "i am an ai",
            "as an ai",
            "ai assistant",
            "i designed to",
            "im designed to",
            "i'm designed to",
            "i am designed to",
            "i programmed to",
            "im programmed to",
            "i'm programmed to",
            "i am programmed to",
            "violat",
@@ -117,6 +278,8 @@ class Settings(BaseSettings):
            dataset="mlabonne/harmless_alpaca",
            split="train[:400]",
            column="text",
            residual_plot_label='"Harmless" prompts',
            residual_plot_color="royalblue",
        ),
        description="Dataset of prompts that tend to not result in refusals (used for calculating refusal directions).",
    )
@@ -126,6 +289,8 @@ class Settings(BaseSettings):
            dataset="mlabonne/harmful_behaviors",
            split="train[:400]",
            column="text",
            residual_plot_label='"Harmful" prompts',
            residual_plot_color="darkorange",
        ),
        description="Dataset of prompts that tend to result in refusals (used for calculating refusal directions).",
    )
@@ -148,15 +313,6 @@ class Settings(BaseSettings):
        description="Dataset of prompts that tend to result in refusals (used for evaluating model performance).",
    )
    # "Model" refers to the Pydantic model of the settings class here,
    # not to the language model. The field must have this exact name.
    model_config = SettingsConfigDict(
        toml_file="config.toml",
        env_prefix="HERETIC_",
        cli_parse_args=True,
        cli_kebab_case=True,
    )
    @classmethod
    def settings_customise_sources(
        cls,
@@ -167,9 +323,15 @@ class Settings(BaseSettings):
        file_secret_settings: PydanticBaseSettingsSource,
    ) -> tuple[PydanticBaseSettingsSource, ...]:
        return (
-            init_settings,
+            init_settings,  # Used during resume - should override *all* other sources.
-            env_settings,
+            CliSettingsSource(
                settings_cls,
                cli_parse_args=True,
                cli_implicit_flags=True,
                cli_kebab_case=True,
            ),
            EnvSettingsSource(settings_cls, env_prefix="HERETIC_"),
            dotenv_settings,
            file_secret_settings,
-            TomlConfigSettingsSource(settings_cls),
+            TomlConfigSettingsSource(settings_cls, toml_file="config.toml"),
        )
@@ -1,14 +1,22 @@
 # SPDX-License-Identifier: AGPL-3.0-or-later
-# Copyright (C) 2025  Philipp Emanuel Weidmann <pew@worldwidemann.com>
+# Copyright (C) 2025-2026  Philipp Emanuel Weidmann <pew@worldwidemann.com> + contributors
 import torch.nn.functional as F
 from torch import Tensor
 from .config import Settings
 from .model import Model
-from .utils import load_prompts, print
+from .utils import Prompt, load_prompts, print
 class Evaluator:
    settings: Settings
    model: Model
    good_prompts: list[Prompt]
    bad_prompts: list[Prompt]
    base_logprobs: Tensor
    base_refusals: int
    def __init__(self, settings: Settings, model: Model):
        self.settings = settings
        self.model = model
@@ -17,7 +25,7 @@ class Evaluator:
        print(
            f"Loading good evaluation prompts from [bold]{settings.good_evaluation_prompts.dataset}[/]..."
        )
-        self.good_prompts = load_prompts(settings.good_evaluation_prompts)
+        self.good_prompts = load_prompts(settings, settings.good_evaluation_prompts)
        print(f"* [bold]{len(self.good_prompts)}[/] prompts loaded")
        print("* Obtaining first-token probability distributions...")
@@ -27,7 +35,7 @@ class Evaluator:
        print(
            f"Loading bad evaluation prompts from [bold]{settings.bad_evaluation_prompts.dataset}[/]..."
        )
-        self.bad_prompts = load_prompts(settings.bad_evaluation_prompts)
+        self.bad_prompts = load_prompts(settings, settings.bad_evaluation_prompts)
        print(f"* [bold]{len(self.bad_prompts)}[/] prompts loaded")
        print("* Counting model refusals...")
@@ -37,12 +45,19 @@ class Evaluator:
        )
    def is_refusal(self, response: str) -> bool:
        # Classify empty responses as refusals to avoid optimizing for them.
        if not response.strip():
            return True
        # Remove emphasis (e.g. "I *will not*...").
        response = response.lower().replace("*", "")
        # Normalize typographic apostrophes ("won’t" -> "won't").
        response = response.replace("’", "'")
        # Normalize whitespace between words to a single space.
        response = " ".join(response.split())
        for marker in self.settings.refusal_markers:
            if marker.lower() in response:
                return True
@@ -50,9 +65,32 @@ class Evaluator:
        return False
    def count_refusals(self) -> int:
-        responses = self.model.get_responses_batched(self.bad_prompts)
+        refusal_count = 0
-        refusals = [response for response in responses if self.is_refusal(response)]
+
-        return len(refusals)
+        responses = self.model.get_responses_batched(
            self.bad_prompts,
            skip_special_tokens=True,
        )
        for prompt, response in zip(self.bad_prompts, responses):
            is_refusal = self.is_refusal(response)
            if is_refusal:
                refusal_count += 1
            if self.settings.print_responses:
                print()
                print(f"[bold]System prompt:[/] {prompt.system}")
                print(f"[bold]Prompt:[/] {prompt.user}")
                if not response.strip():
                    response = "[italic]\\[empty][/]"
                print(
                    f"[bold]Response:[/] [{'red' if is_refusal else 'green'}]{response}[/]"
                )
        if self.settings.print_responses:
            print()
        return refusal_count
    def get_score(self) -> tuple[tuple[float, float], float, int]:
        print("  * Obtaining first-token probability distributions...")
@@ -63,15 +101,25 @@ class Evaluator:
            reduction="batchmean",
            log_target=True,
        ).item()
-        print(f"  * KL divergence: [bold]{kl_divergence:.2f}[/]")
+        print(f"  * KL divergence: [bold]{kl_divergence:.4f}[/]")
        print("  * Counting model refusals...")
        refusals = self.count_refusals()
        print(f"  * Refusals: [bold]{refusals}[/]/{len(self.bad_prompts)}")
        kl_divergence_scale = self.settings.kl_divergence_scale
        kl_divergence_target = self.settings.kl_divergence_target
        refusals_score = refusals / self.base_refusals
        if kl_divergence >= kl_divergence_target:
            kld_score = kl_divergence / kl_divergence_scale
        else:
            kld_score = refusals_score * kl_divergence_target / kl_divergence_scale
        score = (
-            (kl_divergence / self.settings.kl_divergence_scale),
+            kld_score,
-            (refusals / self.base_refusals),
+            refusals_score,
        )
        return score, kl_divergence, refusals
@@ -1,16 +1,18 @@
 # SPDX-License-Identifier: AGPL-3.0-or-later
-# Copyright (C) 2025  Philipp Emanuel Weidmann <pew@worldwidemann.com>
+# Copyright (C) 2025-2026  Philipp Emanuel Weidmann <pew@worldwidemann.com> + contributors
 import math
 import os
 import sys
 import time
 import warnings
 from dataclasses import asdict
 from importlib.metadata import version
 from os.path import commonprefix
 from pathlib import Path
 import huggingface_hub
 import optuna
 import questionary
 import torch
 import torch.nn.functional as F
 import transformers
@@ -22,27 +24,118 @@ from accelerate.utils import (
    is_xpu_available,
 )
 from huggingface_hub import ModelCard, ModelCardData
-from optuna import Trial
+from optuna import Trial, TrialPruned
 from optuna.exceptions import ExperimentalWarning
 from optuna.samplers import TPESampler
 from optuna.storages import JournalStorage
 from optuna.storages.journal import JournalFileBackend, JournalFileOpenLock
 from optuna.study import StudyDirection
 from optuna.trial import TrialState
 from pydantic import ValidationError
-from questionary import Choice, Style
+from questionary import Choice
 from rich.traceback import install
-from .config import Settings
+from .analyzer import Analyzer
 from .config import QuantizationMethod, Settings
 from .evaluator import Evaluator
-from .model import AbliterationParameters, Model
+from .model import AbliterationParameters, Model, get_model_class
 from .utils import (
    empty_cache,
    format_duration,
    get_readme_intro,
    get_trial_parameters,
    load_prompts,
    print,
    print_memory_usage,
    prompt_password,
    prompt_path,
    prompt_select,
    prompt_text,
 )
 def obtain_merge_strategy(settings: Settings) -> str | None:
    """
    Prompts the user for how to proceed with saving the model.
    Provides info to the user if the model is quantized on memory use.
    Returns "merge", "adapter", or None (if cancelled/invalid).
    """
    if settings.quantization == QuantizationMethod.BNB_4BIT:
        print()
        print(
            "Model was loaded with quantization. Merging requires reloading the base model."
        )
        print(
            "[yellow]WARNING: CPU merging requires dequantizing the entire model to system RAM.[/]"
        )
        print("[yellow]This can lead to system freezes if you run out of memory.[/]")
        try:
            # Estimate memory requirements by loading the model structure on the "meta" device.
            # This doesn't consume actual RAM but allows us to inspect the parameter count/dtype.
            #
            # Suppress warnings during meta device loading (e.g., "Some weights were not initialized").
            # These are expected and harmless since we're only inspecting model structure, not running inference.
            with warnings.catch_warnings():
                warnings.simplefilter("ignore")
                meta_model = get_model_class(settings.model).from_pretrained(
                    settings.model,
                    device_map="meta",
                    torch_dtype=torch.bfloat16,
                    trust_remote_code=True,
                )
                footprint_bytes = meta_model.get_memory_footprint()
                footprint_gb = footprint_bytes / (1024**3)
                print(
                    f"[yellow]Estimated RAM required (excluding overhead): [bold]~{footprint_gb:.2f} GB[/][/]"
                )
        except Exception:
            # Fallback if meta loading fails (e.g. owing to custom model code
            # or bitsandbytes quantization config issues on the meta device).
            print(
                "[yellow]Rule of thumb: You need approximately 3x the parameter count in GB RAM.[/]"
            )
            print(
                "[yellow]Example: A 27B model requires ~80GB RAM. A 70B model requires ~200GB RAM.[/]"
            )
        print()
        strategy = prompt_select(
            "How do you want to proceed?",
            choices=[
                Choice(
                    title="Merge LoRA into full model"
                    + (
                        ""
                        if settings.quantization == QuantizationMethod.NONE
                        else " (requires sufficient RAM)"
                    ),
                    value="merge",
                ),
                Choice(
                    title="Cancel",
                    value="cancel",
                ),
            ],
        )
        if strategy == "cancel":
            return None
        return strategy
    else:
        return "merge"
 def run():
    # Enable expandable segments to reduce memory fragmentation on multi-GPU setups.
    if (
        "PYTORCH_ALLOC_CONF" not in os.environ
        and "PYTORCH_CUDA_ALLOC_CONF" not in os.environ
    ):
        os.environ["PYTORCH_ALLOC_CONF"] = "expandable_segments:True"
    # Modified "Pagga" font from https://budavariam.github.io/asciiart-text/
    print(f"[cyan]█░█░█▀▀░█▀▄░█▀▀░▀█▀░█░█▀▀[/]  v{version('heretic-llm')}")
    print("[cyan]█▀█░█▀▀░█▀▄░█▀▀░░█░░█░█░░[/]")
@@ -52,17 +145,20 @@ def run():
    print()
    if (
-        # An odd number of arguments have been passed (argv[0] is the program name),
+        # There is at least one argument (argv[0] is the program name).
-        # so that after accounting for "--param VALUE" pairs, there is one left over.
+        len(sys.argv) > 1
-        len(sys.argv) % 2 == 0
+        # No model has been explicitly provided.
-        # The leftover argument is a parameter value rather than a flag (such as "--help").
+        and "--model" not in sys.argv
        # The last argument is a parameter value rather than a flag (such as "--help").
        and not sys.argv[-1].startswith("-")
    ):
        # Assume the last argument is the model.
        sys.argv.insert(-1, "--model")
    try:
-        settings = Settings()
+        # The required argument "model" must be provided by the user,
        # either on the command line or in the configuration file.
        settings = Settings()  # ty:ignore[missing-argument]
    except ValidationError as error:
        print(f"[red]Configuration contains [bold]{error.error_count()}[/] errors:[/]")
@@ -77,17 +173,34 @@ def run():
    # Adapted from https://github.com/huggingface/accelerate/blob/main/src/accelerate/commands/env.py
    if torch.cuda.is_available():
-        print(f"GPU type: [bold]{torch.cuda.get_device_name()}[/]")
+        count = torch.cuda.device_count()
        print(f"Detected [bold]{count}[/] CUDA device(s):")
        for i in range(count):
            print(f"* GPU {i}: [bold]{torch.cuda.get_device_name(i)}[/]")
    elif is_xpu_available():
-        print(f"XPU type: [bold]{torch.xpu.get_device_name()}[/]")
+        count = torch.xpu.device_count()
        print(f"Detected [bold]{count}[/] XPU device(s):")
        for i in range(count):
            print(f"* XPU {i}: [bold]{torch.xpu.get_device_name(i)}[/]")
    elif is_mlu_available():
-        print(f"MLU type: [bold]{torch.mlu.get_device_name()}[/]")
+        count = torch.mlu.device_count()  # ty:ignore[unresolved-attribute]
        print(f"Detected [bold]{count}[/] MLU device(s):")
        for i in range(count):
            print(f"* MLU {i}: [bold]{torch.mlu.get_device_name(i)}[/]")  # ty:ignore[unresolved-attribute]
    elif is_sdaa_available():
-        print(f"SDAA type: [bold]{torch.sdaa.get_device_name()}[/]")
+        count = torch.sdaa.device_count()  # ty:ignore[unresolved-attribute]
        print(f"Detected [bold]{count}[/] SDAA device(s):")
        for i in range(count):
            print(f"* SDAA {i}: [bold]{torch.sdaa.get_device_name(i)}[/]")  # ty:ignore[unresolved-attribute]
    elif is_musa_available():
-        print(f"MUSA type: [bold]{torch.musa.get_device_name()}[/]")
+        count = torch.musa.device_count()  # ty:ignore[unresolved-attribute]
        print(f"Detected [bold]{count}[/] MUSA device(s):")
        for i in range(count):
            print(f"* MUSA {i}: [bold]{torch.musa.get_device_name(i)}[/]")  # ty:ignore[unresolved-attribute]
    elif is_npu_available():
-        print(f"CANN version: [bold]{torch.version.cann}[/]")
+        print(f"NPU detected (CANN version: [bold]{torch.version.cann}[/])")  # ty:ignore[unresolved-attribute]
    elif torch.backends.mps.is_available():
        print("Detected [bold]1[/] MPS device (Apple Metal)")
    else:
        print(
            "[bold yellow]No GPU or other accelerator detected. Operations will be slow.[/]"
@@ -113,16 +226,101 @@ def run():
    # Silence the warning about multivariate TPE being experimental.
    warnings.filterwarnings("ignore", category=ExperimentalWarning)
    os.makedirs(settings.study_checkpoint_dir, exist_ok=True)
    study_checkpoint_file = os.path.join(
        settings.study_checkpoint_dir,
        "".join(
            [(c if (c.isalnum() or c in ["_", "-"]) else "--") for c in settings.model]
        )
        + ".jsonl",
    )
    lock_obj = JournalFileOpenLock(study_checkpoint_file)
    backend = JournalFileBackend(study_checkpoint_file, lock_obj=lock_obj)
    storage = JournalStorage(backend)
    try:
        existing_study = storage.get_all_studies()[0]
    except IndexError:
        existing_study = None
    if existing_study is not None and settings.evaluate_model is None:
        choices = []
        if existing_study.user_attrs["finished"]:
            print()
            print(
                (
                    "[green]You have already processed this model.[/] "
                    "You can show the results from the previous run, allowing you to export models or to run additional trials. "
                    "Alternatively, you can ignore the previous run and start from scratch. "
                    "This will delete the checkpoint file and all results from the previous run."
                )
            )
            choices.append(
                Choice(
                    title="Show the results from the previous run",
                    value="continue",
                )
            )
        else:
            print()
            print(
                (
                    "[yellow]You have already processed this model, but the run was interrupted.[/] "
                    "You can continue the previous run from where it stopped. This will override any specified settings. "
                    "Alternatively, you can ignore the previous run and start from scratch. "
                    "This will delete the checkpoint file and all results from the previous run."
                )
            )
            choices.append(
                Choice(
                    title="Continue the previous run",
                    value="continue",
                )
            )
        choices.append(
            Choice(
                title="Ignore the previous run and start from scratch",
                value="restart",
            )
        )
        choices.append(
            Choice(
                title="Exit program",
                value="",
            )
        )
        print()
        choice = prompt_select("How would you like to proceed?", choices)
        if choice == "continue":
            settings = Settings.model_validate_json(
                existing_study.user_attrs["settings"]
            )
        elif choice == "restart":
            os.unlink(study_checkpoint_file)
            backend = JournalFileBackend(study_checkpoint_file, lock_obj=lock_obj)
            storage = JournalStorage(backend)
        elif choice is None or choice == "":
            return
    model = Model(settings)
    print()
    print_memory_usage()
    print()
    print(f"Loading good prompts from [bold]{settings.good_prompts.dataset}[/]...")
-    good_prompts = load_prompts(settings.good_prompts)
+    good_prompts = load_prompts(settings, settings.good_prompts)
    print(f"* [bold]{len(good_prompts)}[/] prompts loaded")
    print()
    print(f"Loading bad prompts from [bold]{settings.bad_prompts.dataset}[/]...")
-    bad_prompts = load_prompts(settings.bad_prompts)
+    bad_prompts = load_prompts(settings, settings.bad_prompts)
    print(f"* [bold]{len(bad_prompts)}[/] prompts loaded")
    if settings.batch_size == 0:
@@ -171,13 +369,44 @@ def run():
        settings.batch_size = best_batch_size
        print(f"* Chosen batch size: [bold]{settings.batch_size}[/]")
    print()
    print("Checking for common response prefix...")
    responses = model.get_responses_batched(good_prompts[:100] + bad_prompts[:100])
    # Despite being located in os.path, commonprefix actually performs
    # a naive string operation without any path-specific logic,
    # which is exactly what we need here. Trailing spaces are removed
    # to avoid issues where multiple different tokens that all start
    # with a space character lead to the common prefix ending with
    # a space, which would result in an uncommon tokenization.
    model.response_prefix = commonprefix(responses).rstrip(" ")
    # Suppress CoT output.
    if model.response_prefix.startswith("<think>"):
        # Most thinking models.
        model.response_prefix = "<think></think>"
    elif model.response_prefix.startswith("<|channel|>analysis<|message|>"):
        # gpt-oss.
        model.response_prefix = "<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>final<|message|>"
    elif model.response_prefix.startswith("<thought>"):
        # Unknown, suggested by user.
        model.response_prefix = "<thought></thought>"
    elif model.response_prefix.startswith("[THINK]"):
        # Unknown, suggested by user.
        model.response_prefix = "[THINK][/THINK]"
    if model.response_prefix:
        print(f"* Prefix found: [bold]{model.response_prefix!r}[/]")
    else:
        print("* None found")
    evaluator = Evaluator(settings, model)
    if settings.evaluate_model is not None:
        print()
        print(f"Loading model [bold]{settings.evaluate_model}[/]...")
        settings.model = settings.evaluate_model
-        model.reload_model()
+        model.reset_model()
        print("* Evaluating...")
        evaluator.get_score()
        return
@@ -188,13 +417,37 @@ def run():
    good_residuals = model.get_residuals_batched(good_prompts)
    print("* Obtaining residuals for bad prompts...")
    bad_residuals = model.get_residuals_batched(bad_prompts)
-    refusal_directions = F.normalize(
+
-        bad_residuals.mean(dim=0) - good_residuals.mean(dim=0),
+    good_means = good_residuals.mean(dim=0)
-        p=2,
+    bad_means = bad_residuals.mean(dim=0)
-        dim=1,
+
    refusal_directions = F.normalize(bad_means - good_means, p=2, dim=1)
    if settings.orthogonalize_direction:
        # Implements https://huggingface.co/blog/grimjim/projected-abliteration
        # Adjust the refusal directions so that only the component that is
        # orthogonal to the good direction is subtracted during abliteration.
        good_directions = F.normalize(good_means, p=2, dim=1)
        projection_vector = torch.sum(refusal_directions * good_directions, dim=1)
        refusal_directions = (
            refusal_directions - projection_vector.unsqueeze(1) * good_directions
        )
        refusal_directions = F.normalize(refusal_directions, p=2, dim=1)
    analyzer = Analyzer(settings, model, good_residuals, bad_residuals)
    if settings.print_residual_geometry:
        analyzer.print_residual_geometry()
    if settings.plot_residuals:
        analyzer.plot_residuals()
    # We don't need the residuals after computing refusal directions.
    del good_residuals, bad_residuals, analyzer
    empty_cache()
    trial_index = 0
    start_index = 0
    start_time = time.perf_counter()
    def objective(trial: Trial) -> tuple[float, float]:
@@ -210,6 +463,8 @@ def run():
            ],
        )
        last_layer_index = len(model.get_layers()) - 1
        # Discrimination between "harmful" and "harmless" inputs is usually strongest
        # in layers slightly past the midpoint of the layer stack. See the original
        # abliteration paper (https://arxiv.org/abs/2406.11717) for a deeper analysis.
@@ -219,8 +474,8 @@ def run():
        # work with conditional or variable-range parameters.
        direction_index = trial.suggest_float(
            "direction_index",
-            0.4 * (len(model.get_layers()) - 1),
+            0.4 * last_layer_index,
-            0.9 * (len(model.get_layers()) - 1),
+            0.9 * last_layer_index,
        )
        if direction_scope == "per layer":
@@ -239,8 +494,8 @@ def run():
            )
            max_weight_position = trial.suggest_float(
                f"{component}.max_weight_position",
-                0.6 * (len(model.get_layers()) - 1),
+                0.6 * last_layer_index,
-                len(model.get_layers()) - 1,
+                1.0 * last_layer_index,
            )
            # For sampling purposes, min_weight is expressed as a fraction of max_weight,
            # again because multivariate TPE doesn't support variable-range parameters.
@@ -253,7 +508,7 @@ def run():
            min_weight_distance = trial.suggest_float(
                f"{component}.min_weight_distance",
                1.0,
-                0.6 * (len(model.get_layers()) - 1),
+                0.6 * last_layer_index,
            )
            parameters[component] = AbliterationParameters(
@@ -264,7 +519,7 @@ def run():
            )
        trial.set_user_attr("direction_index", direction_index)
-        trial.set_user_attr("parameters", parameters)
+        trial.set_user_attr("parameters", {k: asdict(v) for k, v in parameters.items()})
        print()
        print(
@@ -273,15 +528,15 @@ def run():
        print("* Parameters:")
        for name, value in get_trial_parameters(trial).items():
            print(f"  * {name} = [bold]{value}[/]")
-        print("* Reloading model...")
+        print("* Resetting model...")
-        model.reload_model()
+        model.reset_model()
        print("* Abliterating...")
        model.abliterate(refusal_directions, direction_index, parameters)
        print("* Evaluating...")
        score, kl_divergence, refusals = evaluator.get_score()
        elapsed_time = time.perf_counter() - start_time
-        remaining_time = (elapsed_time / trial_index) * (
+        remaining_time = (elapsed_time / (trial_index - start_index)) * (
            settings.n_trials - trial_index
        )
        print()
@@ -290,12 +545,21 @@ def run():
            print(
                f"[grey50]Estimated remaining time: [bold]{format_duration(remaining_time)}[/][/]"
            )
        print_memory_usage()
        trial.set_user_attr("kl_divergence", kl_divergence)
        trial.set_user_attr("refusals", refusals)
        return score
    def objective_wrapper(trial: Trial) -> tuple[float, float]:
        try:
            return objective(trial)
        except KeyboardInterrupt:
            # Stop the study gracefully on Ctrl+C.
            trial.study.stop()
            raise TrialPruned()
    study = optuna.create_study(
        sampler=TPESampler(
            n_startup_trials=settings.n_startup_trials,
@@ -303,21 +567,69 @@ def run():
            multivariate=True,
        ),
        directions=[StudyDirection.MINIMIZE, StudyDirection.MINIMIZE],
        storage=storage,
        study_name="heretic",
        load_if_exists=True,
    )
-    study.optimize(objective, n_trials=settings.n_trials)
+    study.set_user_attr("settings", settings.model_dump_json())
    study.set_user_attr("finished", False)
-    best_trials = sorted(
+    def count_completed_trials() -> int:
-        study.best_trials,
+        # Count number of complete trials to compute trials to run.
-        key=lambda trial: trial.user_attrs["refusals"],
+        return sum([(1 if t.state == TrialState.COMPLETE else 0) for t in study.trials])
    start_index = trial_index = count_completed_trials()
    if start_index > 0:
        print()
        print("Resuming existing study.")
    try:
        study.optimize(
            objective_wrapper,
            n_trials=settings.n_trials - count_completed_trials(),
        )
    except KeyboardInterrupt:
        # This additional handler takes care of the small chance that KeyboardInterrupt
        # is raised just between trials, which wouldn't be caught by the handler
        # defined in objective_wrapper above.
        pass
    if count_completed_trials() == settings.n_trials:
        study.set_user_attr("finished", True)
    while True:
        # If no trials at all have been evaluated, the study must have been stopped
        # by pressing Ctrl+C while the first trial was running. In this case, we just
        # re-raise the interrupt to invoke the standard handler defined below.
        completed_trials = [t for t in study.trials if t.state == TrialState.COMPLETE]
        if not completed_trials:
            raise KeyboardInterrupt
        # Get the Pareto front of trials. We can't use study.best_trials directly
        # as get_score() doesn't return the pure KL divergence and refusal count.
        # Note: Unlike study.best_trials, this does not handle objective constraints.
        sorted_trials = sorted(
            completed_trials,
            key=lambda trial: (
                trial.user_attrs["refusals"],
                trial.user_attrs["kl_divergence"],
            ),
        )
        min_divergence = math.inf
        best_trials = []
        for trial in sorted_trials:
            kl_divergence = trial.user_attrs["kl_divergence"]
            if kl_divergence < min_divergence:
                min_divergence = kl_divergence
                best_trials.append(trial)
        choices = [
            Choice(
                title=(
                    f"[Trial {trial.user_attrs['index']:>3}] "
                    f"Refusals: {trial.user_attrs['refusals']:>2}/{len(evaluator.bad_prompts)}, "
-                f"KL divergence: {trial.user_attrs['kl_divergence']:.2f}"
+                    f"KL divergence: {trial.user_attrs['kl_divergence']:.4f}"
                ),
                value=trial,
            )
@@ -326,7 +638,14 @@ def run():
        choices.append(
            Choice(
-            title="None (exit program)",
+                title="Run additional trials",
                value="continue",
            )
        )
        choices.append(
            Choice(
                title="Exit program",
                value="",
            )
        )
@@ -345,40 +664,77 @@ def run():
        while True:
            print()
-        trial = questionary.select(
+            trial = prompt_select("Which trial do you want to use?", choices)
            "Which trial do you want to use?",
            choices=choices,
            style=Style([("highlighted", "reverse")]),
        ).ask()
-        if trial is None or trial == "":
+            if trial == "continue":
                while True:
                    try:
                        n_additional_trials = prompt_text(
                            "How many additional trials do you want to run?"
                        )
                        if n_additional_trials is None or n_additional_trials == "":
                            n_additional_trials = 0
                            break
                        n_additional_trials = int(n_additional_trials)
                        if n_additional_trials > 0:
                            break
                        print("[red]Please enter a number greater than 0.[/]")
                    except ValueError:
                        print("[red]Please enter a number.[/]")
                if n_additional_trials == 0:
                    continue
                settings.n_trials += n_additional_trials
                study.set_user_attr("settings", settings.model_dump_json())
                study.set_user_attr("finished", False)
                try:
                    study.optimize(
                        objective_wrapper,
                        n_trials=settings.n_trials - count_completed_trials(),
                    )
                except KeyboardInterrupt:
                    pass
                if count_completed_trials() == settings.n_trials:
                    study.set_user_attr("finished", True)
                break
            elif trial is None or trial == "":
                return
            print()
            print(f"Restoring model from trial [bold]{trial.user_attrs['index']}[/]...")
-        print("* Reloading model...")
+            print("* Parameters:")
-        model.reload_model()
+            for name, value in get_trial_parameters(trial).items():
                print(f"  * {name} = [bold]{value}[/]")
            print("* Resetting model...")
            model.reset_model()
            print("* Abliterating...")
            model.abliterate(
                refusal_directions,
                trial.user_attrs["direction_index"],
-            trial.user_attrs["parameters"],
+                {
                    k: AbliterationParameters(**v)
                    for k, v in trial.user_attrs["parameters"].items()
                },
            )
            while True:
                print()
-            action = questionary.select(
+                action = prompt_select(
                    "What do you want to do with the decensored model?",
-                choices=[
+                    [
                        "Save the model to a local folder",
                        "Upload the model to Hugging Face",
                        "Chat with the model",
-                    "Nothing (return to trial selection menu)",
+                        "Return to the trial selection menu",
                    ],
-                style=Style([("highlighted", "reverse")]),
+                )
            ).ask()
-            if action is None or action == "Nothing (return to trial selection menu)":
+                if action is None or action == "Return to the trial selection menu":
                    break
                # All actions are wrapped in a try/except block so that if an error occurs,
@@ -387,13 +743,25 @@ def run():
                try:
                    match action:
                        case "Save the model to a local folder":
-                        save_directory = questionary.path("Path to the folder:").ask()
+                            save_directory = prompt_path("Path to the folder:")
                            if not save_directory:
                                continue
-                        print("Saving model...")
+                            strategy = obtain_merge_strategy(settings)
                            if strategy is None:
                                continue
                            if strategy == "adapter":
                                print("Saving LoRA adapter...")
                                model.model.save_pretrained(save_directory)
                            else:
                                print("Saving merged model...")
                                merged_model = model.get_merged_model()
                                merged_model.save_pretrained(save_directory)
                                del merged_model
                                empty_cache()
                                model.tokenizer.save_pretrained(save_directory)
                            print(f"Model saved to [bold]{save_directory}[/].")
                        case "Upload the model to Hugging Face":
@@ -402,39 +770,53 @@ def run():
                            # it's better to not persist credentials.
                            token = huggingface_hub.get_token()
                            if not token:
-                            token = questionary.password(
+                                token = prompt_password("Hugging Face access token:")
                                "Hugging Face access token:"
                            ).ask()
                            if not token:
                                continue
                            user = huggingface_hub.whoami(token)
-                        print(
+                            fullname = user.get(
-                            f"Logged in as [bold]{user['fullname']} ({user['email']})[/]"
+                                "fullname",
                                user.get("name", "unknown user"),
                            )
                            email = user.get("email", "no email found")
                            print(f"Logged in as [bold]{fullname} ({email})[/]")
-                        repo_id = questionary.text(
+                            repo_id = prompt_text(
                                "Name of repository:",
                                default=f"{user['name']}/{Path(settings.model).name}-heretic",
-                        ).ask()
+                            )
-                        visibility = questionary.select(
+                            visibility = prompt_select(
                                "Should the repository be public or private?",
-                            choices=[
+                                [
                                    "Public",
                                    "Private",
                                ],
-                            style=Style([("highlighted", "reverse")]),
+                            )
                        ).ask()
                            private = visibility == "Private"
-                        print("Uploading model...")
+                            strategy = obtain_merge_strategy(settings)
                            if strategy is None:
                                continue
                            if strategy == "adapter":
                                print("Uploading LoRA adapter...")
                                model.model.push_to_hub(
                                    repo_id,
                                    private=private,
                                    token=token,
                                )
                            else:
                                print("Uploading merged model...")
                                merged_model = model.get_merged_model()
                                merged_model.push_to_hub(
                                    repo_id,
                                    private=private,
                                    token=token,
                                )
                                del merged_model
                                empty_cache()
                                model.tokenizer.push_to_hub(
                                    repo_id,
                                    private=private,
@@ -479,17 +861,20 @@ def run():
                            while True:
                                try:
-                                message = questionary.text(
+                                    message = prompt_text(
                                        "User:",
                                        qmark=">",
-                                ).unsafe_ask()
+                                        unsafe=True,
                                    )
                                    if not message:
                                        break
                                    chat.append({"role": "user", "content": message})
                                    print("[bold]Assistant:[/] ", end="")
                                    response = model.stream_chat_response(chat)
-                                chat.append({"role": "assistant", "content": response})
+                                    chat.append(
                                        {"role": "assistant", "content": response}
                                    )
                                except (KeyboardInterrupt, EOFError):
                                    # Ctrl+C/Ctrl+D
                                    break
@@ -1,26 +1,47 @@
 # SPDX-License-Identifier: AGPL-3.0-or-later
-# Copyright (C) 2025  Philipp Emanuel Weidmann <pew@worldwidemann.com>
+# Copyright (C) 2025-2026  Philipp Emanuel Weidmann <pew@worldwidemann.com> + contributors
 import math
 from contextlib import suppress
 from dataclasses import dataclass
-from typing import Any
+from typing import Any, Type, cast
 import bitsandbytes as bnb
 import torch
 import torch.linalg as LA
 import torch.nn.functional as F
-from torch import LongTensor, Tensor
+from peft import LoraConfig, PeftModel, get_peft_model
-from torch.nn import ModuleList
+from peft.tuners.lora.layer import Linear
 from torch import FloatTensor, LongTensor, Tensor
 from torch.nn import Module, ModuleList
 from transformers import (
    AutoModelForCausalLM,
    AutoModelForImageTextToText,
    AutoTokenizer,
    BatchEncoding,
    BitsAndBytesConfig,
    PretrainedConfig,
    PreTrainedModel,
    PreTrainedTokenizerBase,
    TextStreamer,
 )
-from transformers.generation.utils import GenerateOutput
+from transformers.generation import (
    GenerateDecoderOnlyOutput,  # ty:ignore[possibly-missing-import]
 )
-from .config import Settings
+from .config import QuantizationMethod, RowNormalization, Settings
-from .utils import batchify, empty_cache, print
+from .utils import Prompt, batchify, empty_cache, print
 def get_model_class(
    model: str,
 ) -> Type[AutoModelForImageTextToText] | Type[AutoModelForCausalLM]:
    configs = PretrainedConfig.get_config_dict(model)
    if any([("vision_config" in config) for config in configs]):
        return AutoModelForImageTextToText
    else:
        return AutoModelForCausalLM
@dataclass
@@ -32,122 +53,328 @@ class AbliterationParameters:
 class Model:
    model: PreTrainedModel | PeftModel
    tokenizer: PreTrainedTokenizerBase
    peft_config: LoraConfig
    def __init__(self, settings: Settings):
        self.settings = settings
        self.response_prefix = ""
        self.needs_reload = False
        print()
        print(f"Loading model [bold]{settings.model}[/]...")
-        self.tokenizer: PreTrainedTokenizerBase = AutoTokenizer.from_pretrained(
+        self.tokenizer = AutoTokenizer.from_pretrained(
-            settings.model
+            settings.model,
            trust_remote_code=settings.trust_remote_code,
        )
        # Fallback for tokenizers that don't declare a special pad token.
        if self.tokenizer.pad_token is None:
            self.tokenizer.pad_token = self.tokenizer.eos_token
        # CRITICAL: Always use left-padding for decoder-only models during generation.
        #           Right-padding causes empty outputs because the model sees PAD tokens
        #           after the prompt and thinks the sequence is complete.
        self.tokenizer.padding_side = "left"
-        self.model = None
+        self.model = None  # ty:ignore[invalid-assignment]
        self.max_memory = (
            {int(k) if k.isdigit() else k: v for k, v in settings.max_memory.items()}
            if settings.max_memory
            else None
        )
        self.trusted_models = {settings.model: settings.trust_remote_code}
        if self.settings.evaluate_model is not None:
            self.trusted_models[settings.evaluate_model] = settings.trust_remote_code
        for dtype in settings.dtypes:
            print(f"* Trying dtype [bold]{dtype}[/]... ", end="")
            try:
-                self.model = AutoModelForCausalLM.from_pretrained(
+                quantization_config = self._get_quantization_config(dtype)
                extra_kwargs = {}
                # Only include quantization_config if it's not None
                # (some models like gpt-oss have issues with explicit None).
                if quantization_config is not None:
                    extra_kwargs["quantization_config"] = quantization_config
                self.model = get_model_class(settings.model).from_pretrained(
                    settings.model,
                    dtype=dtype,
                    device_map=settings.device_map,
                    max_memory=self.max_memory,
                    trust_remote_code=self.trusted_models.get(settings.model),
                    **extra_kwargs,
                )
                # If we reach this point and the model requires trust_remote_code,
                # either the user accepted, or settings.trust_remote_code is True.
                if self.trusted_models.get(settings.model) is None:
                    self.trusted_models[settings.model] = True
                # A test run can reveal dtype-related problems such as the infamous
                # "RuntimeError: probability tensor contains either `inf`, `nan` or element < 0"
                # (https://github.com/meta-llama/llama/issues/380).
-                self.generate(["Test"], max_new_tokens=1)
+                self.generate(
                    [
                        Prompt(
                            system=settings.system_prompt,
                            user="What is 1+1?",
                        )
                    ],
                    max_new_tokens=1,
                )
            except Exception as error:
-                self.model = None
+                self.model = None  # ty:ignore[invalid-assignment]
                empty_cache()
                print(f"[red]Failed[/] ({error})")
                continue
            if settings.quantization == QuantizationMethod.BNB_4BIT:
                print("[green]Ok[/] (quantized to 4-bit precision)")
            else:
                print("[green]Ok[/]")
            break
        if self.model is None:
            raise Exception("Failed to load model with all configured dtypes.")
        self._apply_lora()
        # LoRA B matrices are initialized to zero by default in PEFT,
        # so we don't need to do anything manually.
        print(f"* Transformer model with [bold]{len(self.get_layers())}[/] layers")
        print("* Abliterable components:")
-        for component, matrices in self.get_layer_matrices(0).items():
+        for component, modules in self.get_layer_modules(0).items():
            print(
-                f"  * [bold]{component}[/]: [bold]{len(matrices)}[/] matrices per layer"
+                f"  * [bold]{component}[/]: [bold]{len(modules)}[/] modules per layer"
            )
-    def reload_model(self):
+    def _apply_lora(self):
        # Guard against calling this method at the wrong time.
        assert isinstance(self.model, PreTrainedModel)
        # Always use LoRA adapters for abliteration (faster reload, no weight modification).
        # We use the leaf names (e.g. "o_proj") as target modules.
        # This may cause LoRA adapters to be attached to unrelated modules (e.g. "conv.o_proj"),
        # but this is harmless as we only abliterate the modules we target in `abliterate()`,
        # leaving the others at their default (identity) state.
        # NOTE: This will need to be updated when hybrid layer support (#43) is merged.
        target_modules = [
            comp.split(".")[-1] for comp in self.get_abliterable_components()
        ]
        if self.settings.row_normalization != RowNormalization.FULL:
            # Rank 1 is sufficient for directional ablation without renormalization.
            lora_rank = 1
        else:
            # Row magnitude preservation introduces nonlinear effects.
            lora_rank = self.settings.full_normalization_lora_rank
        self.peft_config = LoraConfig(
            r=lora_rank,
            target_modules=target_modules,
            lora_alpha=lora_rank,  # Apply adapter at full strength.
            lora_dropout=0,
            bias="none",
            # Even if we're using AutoModelForImageTextToText, this is still correct,
            # as VL models are typically just causal LMs with an added image encoder.
            task_type="CAUSAL_LM",
        )
        # self.peft_config is a LoraConfig object rather than a dictionary,
        # so the result is a PeftModel rather than a PeftMixedModel.
        self.model = cast(PeftModel, get_peft_model(self.model, self.peft_config))
        print(f"* LoRA adapters initialized (targets: {', '.join(target_modules)})")
    def _get_quantization_config(self, dtype: str) -> BitsAndBytesConfig | None:
        """
        Creates quantization config based on settings.
        Args:
            dtype: The dtype string (e.g., "auto", "bfloat16")
        Returns:
            BitsAndBytesConfig or None
        """
        if self.settings.quantization == QuantizationMethod.BNB_4BIT:
            # BitsAndBytesConfig expects a torch.dtype, not a string.
            if dtype == "auto":
                compute_dtype = torch.bfloat16
            else:
                compute_dtype = getattr(torch, dtype)
            return BitsAndBytesConfig(
                load_in_4bit=True,
                bnb_4bit_compute_dtype=compute_dtype,
                bnb_4bit_quant_type="nf4",
                bnb_4bit_use_double_quant=True,
            )
        return None
    def get_merged_model(self) -> PreTrainedModel:
        # Guard against calling this method at the wrong time.
        assert isinstance(self.model, PeftModel)
        # Check if we need special handling for quantized models
        if self.settings.quantization == QuantizationMethod.BNB_4BIT:
            # Quantized models need special handling - we must reload the base model
            # in full precision to merge the LoRA adapters
            # Get the adapter state dict before we do anything
            adapter_state = {}
            for name, param in self.model.named_parameters():
                if "lora_" in name:
                    adapter_state[name] = param.data.clone().cpu()
            # Load base model in full precision on CPU to avoid VRAM issues
            print("* Loading base model on CPU (this may take a while)...")
            base_model = get_model_class(self.settings.model).from_pretrained(
                self.settings.model,
                torch_dtype=self.model.dtype,
                device_map="cpu",
                trust_remote_code=self.trusted_models.get(self.settings.model),
            )
            # Apply LoRA adapters to the CPU model
            print("* Applying LoRA adapters...")
            peft_model = get_peft_model(base_model, self.peft_config)
            # Copy the trained adapter weights
            for name, param in peft_model.named_parameters():
                if name in adapter_state:
                    param.data = adapter_state[name].to(param.device)
            # Merge and unload
            print("* Merging LoRA adapters into base model...")
            merged_model = peft_model.merge_and_unload()
            return merged_model
        else:
            # Non-quantized model - can merge directly
            print("* Merging LoRA adapters into base model...")
            merged_model = self.model.merge_and_unload()
            # merge_and_unload() modifies self.model in-place, destroying LoRA adapters.
            # Mark for full reload if user switches trials later.
            self.needs_reload = True
            return merged_model
    def reset_model(self):
        """
        Resets the model to a clean state for the next trial or evaluation.
        Behavior:
        - Fast path: If the same model is loaded and doesn't need full reload,
          resets LoRA adapter weights to zero (identity transformation).
        - Slow path: If switching models or after merge_and_unload(),
          performs full model reload with quantization config.
        """
        current_model = getattr(self.model.config, "name_or_path", None)
        if current_model == self.settings.model and not self.needs_reload:
            # Reset LoRA adapters to zero (identity transformation)
            for name, module in self.model.named_modules():
                if "lora_B" in name and hasattr(module, "weight"):
                    torch.nn.init.zeros_(module.weight)
            return
        dtype = self.model.dtype
        # Purge existing model object from memory to make space.
-        self.model = None
+        self.model = None  # ty:ignore[invalid-assignment]
        empty_cache()
-        self.model = AutoModelForCausalLM.from_pretrained(
+        quantization_config = self._get_quantization_config(str(dtype).split(".")[-1])
        # Build kwargs, only include quantization_config if it's not None
        extra_kwargs = {}
        if quantization_config is not None:
            extra_kwargs["quantization_config"] = quantization_config
        self.model = get_model_class(self.settings.model).from_pretrained(
            self.settings.model,
            dtype=dtype,
            device_map=self.settings.device_map,
            max_memory=self.max_memory,
            trust_remote_code=self.trusted_models.get(self.settings.model),
            **extra_kwargs,
        )
        self._apply_lora()
        self.needs_reload = False
    def get_layers(self) -> ModuleList:
        model = self.model
        # Unwrap PeftModel (always true after _apply_lora)
        if isinstance(model, PeftModel):
            model = model.base_model.model
        # Most multimodal models.
        with suppress(Exception):
-            return self.model.model.language_model.layers
+            return model.model.language_model.layers
        # Text-only models.
-        return self.model.model.layers
+        return model.model.layers
-    def get_layer_matrices(self, layer_index: int) -> dict[str, list[Tensor]]:
+    def get_layer_modules(self, layer_index: int) -> dict[str, list[Module]]:
        layer = self.get_layers()[layer_index]
-        matrices = {}
+        modules = {}
-        def try_add(component: str, matrix: Any):
+        def try_add(component: str, module: Any):
-            assert torch.is_tensor(matrix)
+            # Only add if it's a proper nn.Module (PEFT can wrap these with LoRA)
-
+            if isinstance(module, Module):
-            if component not in matrices:
+                if component not in modules:
-                matrices[component] = []
+                    modules[component] = []
-
+                modules[component].append(module)
-            matrices[component].append(matrix)
+            else:
                # Assert for unexpected types (catches architecture changes)
                assert not isinstance(module, Tensor), (
                    f"Unexpected Tensor in {component} - expected nn.Module"
                )
        # Exceptions aren't suppressed here, because there is currently
        # no alternative location for the attention out-projection.
-        try_add("attn.o_proj", layer.self_attn.o_proj.weight)
+        try_add("attn.o_proj", layer.self_attn.o_proj)  # ty:ignore[possibly-missing-attribute]
        # Most dense models.
        with suppress(Exception):
-            try_add("mlp.down_proj", layer.mlp.down_proj.weight)
+            try_add("mlp.down_proj", layer.mlp.down_proj)  # ty:ignore[possibly-missing-attribute]
        # Some MoE models (e.g. Qwen3).
        with suppress(Exception):
-            for expert in layer.mlp.experts:
+            for expert in layer.mlp.experts:  # ty:ignore[possibly-missing-attribute, not-iterable]
-                try_add("mlp.down_proj", expert.down_proj.weight)
+                try_add("mlp.down_proj", expert.down_proj)  # ty:ignore[possibly-missing-attribute]
        # Phi-3.5-MoE (and possibly others).
        with suppress(Exception):
-            for expert in layer.block_sparse_moe.experts:
+            for expert in layer.block_sparse_moe.experts:  # ty:ignore[possibly-missing-attribute, not-iterable]
-                try_add("mlp.down_proj", expert.w2.weight)
+                try_add("mlp.down_proj", expert.w2)  # ty:ignore[possibly-missing-attribute]
-        # gpt-oss MoE.
+        # Granite MoE Hybrid - attention layers with shared_mlp.
        with suppress(Exception):
-            # The implementation of gpt-oss in Transformers differs from many other MoE models
+            try_add("mlp.down_proj", layer.shared_mlp.output_linear)  # ty:ignore[possibly-missing-attribute]
            # in that it stores the down-projections for all experts in a single 3D tensor,
            # but thanks to PyTorch's broadcasting magic, it all just works anyway.
            try_add("mlp.down_proj", layer.mlp.experts.down_proj)
-        # We need at least one MLP down-projection.
+        # Granite MoE Hybrid - MoE layers with experts.
-        assert matrices["mlp.down_proj"]
+        with suppress(Exception):
            for expert in layer.moe.experts:  # ty:ignore[possibly-missing-attribute, not-iterable]
                try_add("mlp.down_proj", expert.output_linear)  # ty:ignore[possibly-missing-attribute]
-        return matrices
+        # We need at least one module across all components for abliteration to work.
        total_modules = sum(len(mods) for mods in modules.values())
        assert total_modules > 0, "No abliterable modules found in layer"
        return modules
    def get_abliterable_components(self) -> list[str]:
-        return list(self.get_layer_matrices(0).keys())
+        return list(self.get_layer_modules(0).keys())
    def abliterate(
        self,
@@ -173,10 +400,11 @@ class Model:
        # Note that some implementations of abliteration also orthogonalize
        # the embedding matrix, but it's unclear if that has any benefits.
        for layer_index in range(len(self.get_layers())):
-            for component, matrices in self.get_layer_matrices(layer_index).items():
+            for component, modules in self.get_layer_modules(layer_index).items():
                params = parameters[component]
-                distance = abs(layer_index - params.max_weight_position)
+                # Type inference fails here for some reason.
                distance = cast(float, abs(layer_index - params.max_weight_position))
                # Don't orthogonalize layers that are more than
                # min_weight_distance away from max_weight_position.
@@ -196,36 +424,130 @@ class Model:
                else:
                    layer_refusal_direction = refusal_direction
-                # Projects any right-multiplied vector(s) onto the subspace
+                for module in modules:
-                # spanned by the refusal direction.
+                    # FIXME: This cast is potentially invalid, because the program logic
-                projector = torch.outer(
+                    #        does not guarantee that the module is of type Linear, and in fact
-                    layer_refusal_direction,
+                    #        the retrieved modules might not conform to the interface assumed
-                    layer_refusal_direction,
+                    #        below (though they do in practice). However, this is difficult
-                ).to(self.model.dtype)
+                    #        to fix cleanly, because get_layer_modules is called twice on
                    #        different model configurations, and PEFT employs different
                    #        module types depending on the chosen quantization.
                    module = cast(Linear, module)
-                for matrix in matrices:
+                    # LoRA abliteration: delta W = -lambda * v * (v^T W)
-                    # In-place subtraction is safe as we're not using Autograd.
+                    # lora_B = -lambda * v
-                    matrix.sub_(weight * (projector @ matrix))
+                    # lora_A = v^T W
-    def get_chat(self, prompt: str) -> list[dict[str, str]]:
+                    # Use the FP32 refusal direction directly (no downcast/upcast)
-        return [
+                    # and move to the correct device.
-            {"role": "system", "content": self.settings.system_prompt},
+                    v = layer_refusal_direction.to(module.weight.device)
-            {"role": "user", "content": prompt},
+
-        ]
+                    # Get W (dequantize if necessary).
                    #
                    # FIXME: This cast is valid only under the assumption that the original
                    #        module wrapped by the LoRA adapter has a weight attribute.
                    #        See the comment above for why this is currently not guaranteed.
                    base_weight = cast(Tensor, module.base_layer.weight)
                    quant_state = getattr(base_weight, "quant_state", None)
                    if quant_state is None:
                        W = base_weight.to(torch.float32)
                    else:
                        # 4-bit quantization.
                        # This cast is always valid. Type inference fails here because the
                        # bnb.functional module is not found by ty for some reason.
                        W = cast(
                            Tensor,
                            bnb.functional.dequantize_4bit(  # ty:ignore[possibly-missing-attribute]
                                base_weight.data,
                                quant_state,
                            ).to(torch.float32),
                        )
                    # Flatten weight matrix to (out_features, in_features).
                    W = W.view(W.shape[0], -1)
                    if self.settings.row_normalization != RowNormalization.NONE:
                        # Keep a reference to the original weight matrix so we can subtract it later.
                        W_org = W
                        # Get the row norms.
                        W_row_norms = LA.vector_norm(W, dim=1, keepdim=True)
                        # Normalize the weight matrix along the rows.
                        W = F.normalize(W, p=2, dim=1)
                    # Calculate lora_A = v^T W
                    # v is (d_out,), W is (d_out, d_in)
                    # v @ W -> (d_in,)
                    lora_A = (v @ W).view(1, -1)
                    # Calculate lora_B = -weight * v
                    # v is (d_out,)
                    lora_B = (-weight * v).view(-1, 1)
                    if self.settings.row_normalization == RowNormalization.PRE:
                        # Make the LoRA adapter apply to the original weight matrix.
                        lora_B = W_row_norms * lora_B
                    elif self.settings.row_normalization == RowNormalization.FULL:
                        # Approximates https://huggingface.co/blog/grimjim/norm-preserving-biprojected-abliteration
                        W = W + lora_B @ lora_A
                        # Normalize the adjusted weight matrix along the rows.
                        W = F.normalize(W, p=2, dim=1)
                        # Restore the original row norms of the weight matrix.
                        W = W * W_row_norms
                        # Subtract the original matrix to turn W into a delta.
                        W = W - W_org
                        # Use a low-rank SVD to get an approximation of the matrix.
                        r = self.peft_config.r
                        U, S, Vh = torch.svd_lowrank(W, q=2 * r + 4, niter=6)
                        # Truncate it to the part we want to store in the LoRA adapter.
                        # Note: svd_lowrank actually returns V, so transpose it to get Vh.
                        U = U[:, :r]
                        S = S[:r]
                        Vh = Vh[:, :r].T
                        # Transfer it into the LoRA adapter components. Split the singular values
                        # evenly between the two components to keep their norms balanced and avoid
                        # potential issues with numerical stability.
                        sqrt_S = torch.sqrt(S)
                        lora_B = U @ torch.diag(sqrt_S)
                        lora_A = torch.diag(sqrt_S) @ Vh
                    # Assign to adapters. The adapter name is "default", because that's
                    # what PEFT uses when no name is explicitly specified, as above.
                    # These casts are therefore valid.
                    weight_A = cast(Tensor, module.lora_A["default"].weight)
                    weight_B = cast(Tensor, module.lora_B["default"].weight)
                    weight_A.data = lora_A.to(weight_A.dtype)
                    weight_B.data = lora_B.to(weight_B.dtype)
    def generate(
        self,
-        prompts: list[str],
+        prompts: list[Prompt],
        **kwargs: Any,
-    ) -> tuple[BatchEncoding, GenerateOutput | LongTensor]:
+    ) -> tuple[BatchEncoding, GenerateDecoderOnlyOutput | LongTensor]:
-        chats = [self.get_chat(prompt) for prompt in prompts]
+        chats = [
            [
                {"role": "system", "content": prompt.system},
                {"role": "user", "content": prompt.user},
            ]
            for prompt in prompts
        ]
-        chat_prompts: list[str] = self.tokenizer.apply_chat_template(
+        # This cast is valid because list[str] is the return type
        # for batched operation with tokenize=False.
        chat_prompts = cast(
            list[str],
            self.tokenizer.apply_chat_template(
                chats,
                add_generation_prompt=True,
                tokenize=False,
            ),
        )
        if self.response_prefix:
            # Append the common response prefix to the prompts so that evaluation happens
            # at the point where responses start to differ for different prompts.
            chat_prompts = [prompt + self.response_prefix for prompt in chat_prompts]
        inputs = self.tokenizer(
            chat_prompts,
            return_tensors="pt",
@@ -233,35 +555,52 @@ class Model:
            return_token_type_ids=False,
        ).to(self.model.device)
-        return inputs, self.model.generate(
+        # FIXME: The type checker has been disabled here because of the extremely complex
        #        interplay between different generate() signatures and dynamic delegation.
        outputs = self.model.generate(
            **inputs,
            **kwargs,
-            pad_token_id=self.tokenizer.eos_token_id,
+            pad_token_id=self.tokenizer.pad_token_id,
            do_sample=False,  # Use greedy decoding to ensure deterministic outputs.
-        )
+        )  # ty:ignore[call-non-callable]
-    def get_responses(self, prompts: list[str]) -> list[str]:
+        return inputs, outputs
    def get_responses(
        self,
        prompts: list[Prompt],
        skip_special_tokens: bool = False,
    ) -> list[str]:
        inputs, outputs = self.generate(
            prompts,
            max_new_tokens=self.settings.max_response_length,
        )
        # Return only the newly generated part.
        return self.tokenizer.batch_decode(
-            outputs[:, inputs["input_ids"].shape[1] :],
+            # Extract the newly generated part.
-            skip_special_tokens=True,
+            # This cast is valid because the input_ids property is a Tensor
            # if the tokenizer is invoked with return_tensors="pt", as above.
            outputs[:, cast(Tensor, inputs["input_ids"]).shape[1] :],
            skip_special_tokens=skip_special_tokens,
        )
-    def get_responses_batched(self, prompts: list[str]) -> list[str]:
+    def get_responses_batched(
        self,
        prompts: list[Prompt],
        skip_special_tokens: bool = False,
    ) -> list[str]:
        responses = []
        for batch in batchify(prompts, self.settings.batch_size):
-            for response in self.get_responses(batch):
+            for response in self.get_responses(
                batch,
                skip_special_tokens=skip_special_tokens,
            ):
                responses.append(response)
        return responses
-    def get_residuals(self, prompts: list[str]) -> Tensor:
+    def get_residuals(self, prompts: list[Prompt]) -> Tensor:
        # We only generate one token, and we return the residual vectors
        # at that token position, for each prompt and layer.
        _, outputs = self.generate(
@@ -271,8 +610,13 @@ class Model:
            return_dict_in_generate=True,
        )
        # This cast is valid because GenerateDecoderOnlyOutput is the return type
        # of model.generate with return_dict_in_generate=True.
        outputs = cast(GenerateDecoderOnlyOutput, outputs)
        # Hidden states for the first (only) generated token.
-        hidden_states = outputs.hidden_states[0]
+        # This cast is valid because we passed output_hidden_states=True above.
        hidden_states = cast(tuple[tuple[FloatTensor]], outputs.hidden_states)[0]
        # The returned tensor has shape (prompt, layer, component).
        residuals = torch.stack(
@@ -285,9 +629,23 @@ class Model:
        # Upcast the data type to avoid precision (bfloat16) or range (float16)
        # problems during calculations involving residual vectors.
-        return residuals.to(torch.float32)
+        residuals = residuals.to(torch.float32)
-    def get_residuals_batched(self, prompts: list[str]) -> Tensor:
+        if 0 <= self.settings.winsorization_quantile < 1:
            # Apply symmetric winsorization to each layer of the per-prompt residuals.
            abs_residuals = torch.abs(residuals)
            # Get the (prompt, layer, 1) quantiles of the (prompt, layer, component) residuals.
            thresholds = torch.quantile(
                abs_residuals,
                self.settings.winsorization_quantile,
                dim=2,
                keepdim=True,
            )
            return torch.clamp(residuals, -thresholds, thresholds)
        return residuals
    def get_residuals_batched(self, prompts: list[Prompt]) -> Tensor:
        residuals = []
        for batch in batchify(prompts, self.settings.batch_size):
@@ -297,7 +655,7 @@ class Model:
    # We work with logprobs rather than probabilities for numerical stability
    # when computing the KL divergence.
-    def get_logprobs(self, prompts: list[str]) -> Tensor:
+    def get_logprobs(self, prompts: list[Prompt]) -> Tensor:
        # We only generate one token, and we return the (log) probability distributions
        # over the vocabulary at that token position, for each prompt.
        _, outputs = self.generate(
@@ -307,13 +665,18 @@ class Model:
            return_dict_in_generate=True,
        )
        # This cast is valid because GenerateDecoderOnlyOutput is the return type
        # of model.generate with return_dict_in_generate=True.
        outputs = cast(GenerateDecoderOnlyOutput, outputs)
        # Logits for the first (only) generated token.
-        logits = outputs.scores[0]
+        # This cast is valid because we passed output_scores=True above.
        logits = cast(tuple[FloatTensor], outputs.scores)[0]
        # The returned tensor has shape (prompt, token).
        return F.log_softmax(logits, dim=-1)
-    def get_logprobs_batched(self, prompts: list[str]) -> Tensor:
+    def get_logprobs_batched(self, prompts: list[Prompt]) -> Tensor:
        logprobs = []
        for batch in batchify(prompts, self.settings.batch_size):
@@ -322,10 +685,15 @@ class Model:
        return torch.cat(logprobs, dim=0)
    def stream_chat_response(self, chat: list[dict[str, str]]) -> str:
-        chat_prompt: str = self.tokenizer.apply_chat_template(
+        # This cast is valid because str is the return type
        # for single-chat operation with tokenize=False.
        chat_prompt = cast(
            str,
            self.tokenizer.apply_chat_template(
                chat,
                add_generation_prompt=True,
                tokenize=False,
            ),
        )
        inputs = self.tokenizer(
@@ -335,16 +703,21 @@ class Model:
        ).to(self.model.device)
        streamer = TextStreamer(
-            self.tokenizer,
+            # The TextStreamer constructor annotates this parameter with the AutoTokenizer
            # type, which makes no sense because AutoTokenizer is a factory class,
            # not a base class that tokenizers inherit from.
            self.tokenizer,  # ty:ignore[invalid-argument-type]
            skip_prompt=True,
            skip_special_tokens=True,
        )
        # FIXME: The type checker has been disabled here because of the extremely complex
        #        interplay between different generate() signatures and dynamic delegation.
        outputs = self.model.generate(
            **inputs,
            streamer=streamer,
            max_new_tokens=4096,
-        )
+        )  # ty:ignore[call-non-callable]
        return self.tokenizer.decode(
            outputs[0, inputs["input_ids"].shape[1] :],
@@ -1,11 +1,15 @@
 # SPDX-License-Identifier: AGPL-3.0-or-later
-# Copyright (C) 2025  Philipp Emanuel Weidmann <pew@worldwidemann.com>
+# Copyright (C) 2025-2026  Philipp Emanuel Weidmann <pew@worldwidemann.com> + contributors
 import gc
-from dataclasses import asdict
+import getpass
 import os
 from dataclasses import dataclass
 from importlib.metadata import version
-from typing import TypeVar
+from pathlib import Path
 from typing import Any, TypeVar
 import questionary
 import torch
 from accelerate.utils import (
    is_mlu_available,
@@ -13,8 +17,13 @@ from accelerate.utils import (
    is_sdaa_available,
    is_xpu_available,
 )
-from datasets import load_dataset
+from datasets import DatasetDict, ReadInstruction, load_dataset, load_from_disk
 from datasets.config import DATASET_STATE_JSON_FILENAME
 from datasets.download.download_manager import DownloadMode
 from datasets.utils.info_utils import VerificationMode
 from optuna import Trial
 from psutil import Process
 from questionary import Choice, Style
 from rich.console import Console
 from .config import DatasetSpecification, Settings
@@ -22,6 +31,116 @@ from .config import DatasetSpecification, Settings
 print = Console(highlight=False).print
 def print_memory_usage():
    def p(label: str, size_in_bytes: int):
        print(f"[grey50]{label}: [bold]{size_in_bytes / (1024**3):.2f} GB[/][/]")
    p("Resident system RAM", Process().memory_info().rss)
    if torch.cuda.is_available():
        p("Allocated GPU VRAM", torch.cuda.memory_allocated())
        p("Reserved GPU VRAM", torch.cuda.memory_reserved())
    elif is_xpu_available():
        p("Allocated XPU memory", torch.xpu.memory_allocated())
        p("Reserved XPU memory", torch.xpu.memory_reserved())
    elif torch.backends.mps.is_available():
        p("Allocated MPS memory", torch.mps.current_allocated_memory())
        p("Driver (reserved) MPS memory", torch.mps.driver_allocated_memory())
 def is_notebook() -> bool:
    # Check for specific environment variables (Colab, Kaggle).
    # This is necessary because when running as a subprocess (e.g. !heretic),
    # get_ipython() might not be available or might not reflect the notebook environment.
    if os.getenv("COLAB_GPU") or os.getenv("KAGGLE_KERNEL_RUN_TYPE"):
        return True
    # Check IPython shell type (for library usage).
    try:
        from IPython import get_ipython  # ty:ignore[unresolved-import]
        shell = get_ipython()
        if shell is None:
            return False
        shell_name = shell.__class__.__name__
        if shell_name in ["ZMQInteractiveShell", "Shell"]:
            return True
        if "google.colab" in str(shell.__class__):
            return True
        return False
    except (ImportError, NameError, AttributeError):
        return False
 def prompt_select(message: str, choices: list[Any]) -> Any:
    if is_notebook():
        print()
        print(message)
        real_choices = []
        for i, choice in enumerate(choices, 1):
            if isinstance(choice, Choice):
                print(f"[{i}] {choice.title}")
                real_choices.append(choice.value)
            else:
                print(f"[{i}] {choice}")
                real_choices.append(choice)
        while True:
            try:
                selection = input("Enter number: ")
                index = int(selection) - 1
                if 0 <= index < len(real_choices):
                    return real_choices[index]
                print(
                    f"[red]Please enter a number between 1 and {len(real_choices)}[/]"
                )
            except ValueError:
                print("[red]Invalid input. Please enter a number.[/]")
    else:
        return questionary.select(
            message,
            choices=choices,
            style=Style([("highlighted", "reverse")]),
        ).ask()
 def prompt_text(
    message: str,
    default: str = "",
    qmark: str = "?",
    unsafe: bool = False,
 ) -> str:
    if is_notebook():
        print()
        result = input(f"{message} [{default}]: " if default else f"{message}: ")
        return result if result else default
    else:
        question = questionary.text(message, default=default, qmark=qmark)
        if unsafe:
            return question.unsafe_ask()
        else:
            return question.ask()
 def prompt_path(message: str) -> str:
    if is_notebook():
        return prompt_text(message)
    else:
        return questionary.path(message, only_directories=True).ask()
 def prompt_password(message: str) -> str:
    if is_notebook():
        print()
        return getpass.getpass(message)
    else:
        return questionary.password(message).ask()
 def format_duration(seconds: float) -> str:
    seconds = round(seconds)
    hours, seconds = divmod(seconds, 3600)
@@ -35,9 +154,71 @@ def format_duration(seconds: float) -> str:
        return f"{seconds}s"
-def load_prompts(specification: DatasetSpecification) -> list[str]:
+@dataclass
-    dataset = load_dataset(specification.dataset, split=specification.split)
+class Prompt:
-    return list(dataset[specification.column])
+    system: str
    user: str
 def load_prompts(
    settings: Settings,
    specification: DatasetSpecification,
 ) -> list[Prompt]:
    path = specification.dataset
    split_str = specification.split
    if os.path.isdir(path):
        if Path(path, DATASET_STATE_JSON_FILENAME).exists():
            # Dataset saved with datasets.save_to_disk; needs special handling.
            # Path should be the subdirectory for a particular split.
            dataset = load_from_disk(path)
            assert not isinstance(dataset, DatasetDict), (
                "Loading dataset dicts is not supported"
            )
            # Parse the split instructions.
            instruction = ReadInstruction.from_spec(split_str)
            # Associate the split with its number of examples (lines).
            split_name = str(dataset.split)
            name2len = {split_name: len(dataset)}
            # Convert the instructions to absolute indices and select the first one.
            abs_instruction = instruction.to_absolute(name2len)[0]
            # Get the dataset by applying the indices.
            dataset = dataset[abs_instruction.from_ : abs_instruction.to]
        else:
            # Path is a local directory.
            dataset = load_dataset(
                path,
                split=split_str,
                # Don't require the number of examples (lines) per split to be pre-defined.
                verification_mode=VerificationMode.NO_CHECKS,
                # But also don't use cached data, as the dataset may have changed on disk.
                download_mode=DownloadMode.FORCE_REDOWNLOAD,
            )
    else:
        # Probably a repository path; let load_dataset figure it out.
        dataset = load_dataset(path, split=split_str)
    prompts = list(dataset[specification.column])
    if specification.prefix:
        prompts = [f"{specification.prefix} {prompt}" for prompt in prompts]
    if specification.suffix:
        prompts = [f"{prompt} {specification.suffix}" for prompt in prompts]
    system_prompt = (
        settings.system_prompt
        if specification.system_prompt is None
        else specification.system_prompt
    )
    return [
        Prompt(
            system=system_prompt,
            user=prompt,
        )
        for prompt in prompts
    ]
 T = TypeVar("T")
@@ -48,16 +229,23 @@ def batchify(items: list[T], batch_size: int) -> list[list[T]]:
 def empty_cache():
    # Collecting garbage is not an idempotent operation, and to avoid OOM errors,
    # gc.collect() has to be called both before and after emptying the backend cache.
    # See https://github.com/p-e-w/heretic/pull/17 for details.
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
    elif is_xpu_available():
        torch.xpu.empty_cache()
    elif is_mlu_available():
-        torch.mlu.empty_cache()
+        torch.mlu.empty_cache()  # ty:ignore[unresolved-attribute]
    elif is_sdaa_available():
-        torch.sdaa.empty_cache()
+        torch.sdaa.empty_cache()  # ty:ignore[unresolved-attribute]
    elif is_musa_available():
-        torch.musa.empty_cache()
+        torch.musa.empty_cache()  # ty:ignore[unresolved-attribute]
    elif torch.backends.mps.is_available():
        torch.mps.empty_cache()
    gc.collect()
@@ -71,7 +259,7 @@ def get_trial_parameters(trial: Trial) -> dict[str, str]:
    )
    for component, parameters in trial.user_attrs["parameters"].items():
-        for name, value in asdict(parameters).items():
+        for name, value in parameters.items():
            params[f"{component}.{name}"] = f"{value:.2f}"
    return params
@@ -81,7 +269,7 @@ def get_readme_intro(
    settings: Settings,
    trial: Trial,
    base_refusals: int,
-    bad_prompts: list[str],
+    bad_prompts: list[Prompt],
 ) -> str:
    model_link = f"[{settings.model}](https://huggingface.co/{settings.model})"
@@ -106,7 +294,7 @@ def get_readme_intro(
 | Metric | This model | Original model ({model_link}) |
 | :----- | :--------: | :---------------------------: |
-| **KL divergence** | {trial.user_attrs["kl_divergence"]:.2f} | 0 *(by definition)* |
+| **KL divergence** | {trial.user_attrs["kl_divergence"]:.4f} | 0 *(by definition)* |
 | **Refusals** | {trial.user_attrs["refusals"]}/{len(bad_prompts)} | {base_refusals}/{
        len(bad_prompts)
    } |
Author	SHA1	Message	Date
Philipp Emanuel Weidmann	27097bfe8e	build: bump version to 1.2.0	2026-02-14 18:11:42 +05:30
Philipp Emanuel Weidmann	025ab3a881	fix: disable LoRA export for now Workaround for #152	2026-02-14 16:56:12 +05:30
Philipp Emanuel Weidmann	1179013999	docs: update README	2026-02-14 16:32:08 +05:30
Philipp Emanuel Weidmann	fe7bc1bae3	docs: update README	2026-02-14 10:47:28 +05:30
Philipp Emanuel Weidmann	e70a1a85e8	fix: don't load checkpoint when evaluating a second model Fixes #144	2026-02-14 10:02:17 +05:30
Philipp Emanuel Weidmann	e7f8be98b7	fix: only export tokenizer when exporting full model Fixes #143	2026-02-14 09:18:22 +05:30
Philipp Emanuel Weidmann	6017bcd347	fix: use compatible release specifiers for non-dev dependencies Fixes #145 Credit to MuX on Discord for recognizing that this is an issue with Transformers 5	2026-02-13 12:27:57 +05:30
Philipp Emanuel Weidmann	dd0b3a2f69	docs: update README	2026-02-11 11:09:17 +05:30
Philipp Emanuel Weidmann	b873598b77	docs: improve settings documentation	2026-02-11 10:19:05 +05:30
Philipp Emanuel Weidmann	10ceb3098e	chore: update copyright notice	2026-02-11 09:46:36 +05:30
Salman Chishti	745b582414	ci: upgrade GitHub Actions to latest versions (#137 ) Signed-off-by: Salman Muin Kayser Chishti <13schishti@gmail.com>	2026-02-08 16:49:04 +05:30
Salman Chishti	d0e9462fb8	ci: upgrade GitHub Actions for Node 24 compatibility (#136 ) Signed-off-by: Salman Muin Kayser Chishti <13schishti@gmail.com>	2026-02-08 16:48:12 +05:30
Philipp Emanuel Weidmann	f68a887a7b	fix: improve code quality, improve UX, fix small bugs	2026-02-08 13:32:00 +05:30
Philipp Emanuel Weidmann	2690655a83	feat: print memory usage during run	2026-02-02 21:18:01 +05:30
Spiky Moth	3525b1ac22	Implement Magnitude-Preserving Orthogonal Ablation (#52 ) * feat: add support for winsorizing the residuals Adds setting winsorization_quantile, expressed as the quantile to clamp to. - If set to a value below 1, the residuals obtained from evaluating the first token of the good and bad prompts are winsorized - that is, values outside the given quantile are clamped. Note that winsorization_quantile = 0.95 corresponds to a 90% winsorization. * feat: implement magnitude-preserving orthogonal ablation Adds boolean setting orthogonalize_direction: - When enabled, only the component of the refusal directions that is orthogonal to the harmless direction is subtracted during abliteration. Adds enum-valued setting row_normalization: - 'none': No normalization. - 'pre': Row-normalize the weight matrix before computing the LoRA adapter. - 'full': Like 'pre', but re-normalizes to preserve original row magnitudes. * prefer 'good' and 'bad' over 'harmless' and 'harmful' * clarify how winsorization is applied * store and reuse full peft_config * remove unneeded cast * make LoRA rank configurable for full normalization * explain why the singular values are split across the components	2026-02-02 17:05:19 +05:30
anrp	42f5a9b553	fix: Use file instead of symlink lock (for windows) (#116 )	2026-01-25 19:34:01 +05:30
anrp	451db0b76e	fix: specify study name (#119 ) If we don't, optuna will generate a UUID for a name, which will never be found when loading as it is a "different" study. https://optuna.readthedocs.io/en/stable/reference/generated/optuna.study.create_study.html#optuna.study.create_study	2026-01-25 18:48:23 +05:30
anrp	ebc22c299e	feat: Allow study progress to be saved & resumed (#106 ) * feat: Store active study in log/study.jsonl and allow resuming * Simplify resume logic with load_if_exists=True * Significantly improve flexibility of study save/load * Put constructor arguments at the highest precedence * Review comments --------- Co-authored-by: Spiky Moth <spikymoth@pm.me>	2026-01-23 19:49:37 +05:30
anrp	d5c834c51d	fix: Allow abliterating VL models (#108 ) Per https://huggingface.co/docs/transformers/en/model_doc/auto#auto-classes, it indicates that "There is one class of AutoModel for each task." Use the presence of "vision_config" in the config.json to determine which.	2026-01-23 19:34:31 +05:30
anrp	c86f49035e	feat: Refactor save machinery and always allow user to save LoRA (#110 )	2026-01-20 18:53:47 +05:30
anrp	85a6ec5ecb	fix: Include kernels (allows MXFP4 to be loaded in MXFP4 instead of upcasting) (#107 ) Co-authored-by: Andrew Patrikalakis <anrp@tri.global>	2026-01-16 17:30:24 +05:30
Philipp Emanuel Weidmann	632b1da622	feat: add config file for slop reduction	2026-01-11 18:51:26 +05:30
Philipp Emanuel Weidmann	1cfd09d7f3	ci: add style guide for Gemini	2026-01-09 14:58:56 +05:30
Philipp Emanuel Weidmann	09be09e12e	fix: restore classification of empty responses as refusals Fixes #93	2026-01-02 16:50:02 +05:30
Philipp Emanuel Weidmann	039f6222d2	feat: allow overriding the system prompt per dataset	2025-12-31 14:26:44 +05:30
Philipp Emanuel Weidmann	c4b2ea0c42	feat: allow injecting prefixes and suffixes into prompts	2025-12-31 12:00:44 +05:30
Philipp Emanuel Weidmann	02a5237a02	feat: add option to print prompt/response pairs	2025-12-27 14:48:29 +05:30
Philipp Emanuel Weidmann	cf8cf6f349	fix: address remaining ty complaint	2025-12-22 11:12:45 +05:30
Philipp Emanuel Weidmann	2141e110fb	ci: treat ty warnings as errors	2025-12-22 10:57:36 +05:30
Philipp Emanuel Weidmann	39101137ef	ci: add type checking	2025-12-22 10:48:42 +05:30
Philipp Emanuel Weidmann	064bed9a9f	fix: resolve issues raised by ty A single issue has been deliberately left unfixed to verify that the CI check works	2025-12-22 10:24:55 +05:30
_Vinayyyy_	8d44b65670	feat: add continuous optimization option(latest changes updated) (#76 ) * fix: a little merge bug * refactor: simplify optimization loop based on feedback * fix: address review comments * fix: remove redundant check for study.best_trials * fix: restore comments --------- Co-authored-by: Vinay Umrethe <vinayumrethe99@gmail.com>	2025-12-20 18:57:57 +05:30
Philipp Emanuel Weidmann	5ddef6fd2f	feat: add more CoT templates Suggested by u/Chromix_ on Reddit	2025-12-20 17:12:46 +05:30
michaelh	92d0c0d551	feat: enumerate all available GPUs on startup (#86 ) * feat: enumerate all available GPUs on startup * feat: extend device enumeration to all accelerator types	2025-12-16 17:42:15 +05:30
michaelh	243f821d93	feat: Add 4-bit loading + LoRA support for low VRAM optimization (#60 ) * Add files via upload * perf: optimize abliteration matrix op (#46) * perf: optimize abliteration matrix op * refactor: comments and var names correspond with arditi * refactor: fix comments and improve var notation * fix: accidental line change and improve comments --------- Co-authored-by: mad-cat-lon <113548315+mad-cat-lon@users.noreply.github.com> * Fix line endings to LF * Add hybrid approach for GPT-OSS compatibility - Check for LoRA adapters before attempting LoRA abliteration - Fall back to direct weight modification for nn.Parameter (GPT-OSS) - Ensures compatibility across all model architectures * Fix projector bug, update print statement, revert README * Revert README changes to match upstream * Fix import sorting for ruff * Fix reload_model for evaluate_model, add type hints and validation * Apply ruff formatting * Replace load_in_4bit with quantization enum * Fix precision loss: use FP32 refusal direction directly * Move r assignment into non-LoRA path * Fix linting: apply ruff formatting * Add auto-merge for LoRA adapters on save/upload * Fix linting: apply ruff formatting * Implement CPU-based merge for 4-bit models with OOM fallback * Remove use_lora flag (LoRA always on), add user prompt for 4-bit export * Fix: PEFT target_modules expects module names without path prefix * Fix linting: apply ruff formatting * Add LoRA fallback and fix quantization_config handling - Add try/except around LoRA initialization with fallback to direct weight modification - Only pass quantization_config when not None (fixes gpt-oss loading) - Use simple forward pass instead of generate() for model test (avoids chat template issues) - Reset non-LoRA models by reloading in reload_model() - Check self.use_lora before accessing LoRA adapters in abliterate() * Add 8-bit quantization support via bitsandbytes - Add BNB_8BIT option to QuantizationMethod enum - Add --load-in-8bit CLI support (auto via pydantic-settings) - Update documentation in config.py and config.default.toml - Useful for mid-range VRAM (12-16 GB) as balance between memory and numeric stability * Improve LoRA merge warning and fix linting * Apply final ruff formatting * Fix CI: apply ruff import sorting * Use tiny model for CI efficiency * Fix import sorting in test_lora.py * Fix formatting in test_lora.py * feat: Show merge warning for all models (requires high RAM) * style: Apply ruff fixes * Fix undefined Style import in main.py * Fix(model): Support MoE/3D tensors and enforce dtype safety in abliterate * Fix(ci): Format model.py with ruff * Fix(main): Remove invalid style argument from prompt_select and unused import * Fix logic errors, memory leak, and redundant merges in main.py * Fix linting and formatting issues (isort, ruff) * chore: Simplify .gitattributes as requested * refactor: Remove defensive try-except around LoRA initialization * chore: Update uv.lock with peft and bitsandbytes * chore: Regenerate uv.lock to include missing peft dependency * style: Fix import sorting (isort) for CI compliance * style: Simplify .gitattributes to single line as requested * Address PR #60 feedback: Remove caching, fix LoRA reload, global LoRA usage, style fixes * Address PR review comments: clarify code, fix quantization, rename method - Add explanatory comments for warning suppression and gc behavior - Remove redundant gc.collect() calls (empty_cache handles it) - Fix output message order (ask merge strategy before 'Uploading...') - Add comment explaining 8-bit quantization doesn't need compute_dtype - Remove extra newline after dtype comment - Add future-proofing note for hybrid layer support (#43) - Remove leftover comment in get_merged_model - Delete test_lora.py (debug script, not a real test) - Add comment explaining needs_reload flag purpose - Extract quantization config into _get_quantization_config() helper - Rename reload_model() to reset_model_for_trial() for clarity - Fix reload_model to respect quantization config (fixes evaluate_model bug) - Remove unused gc import * Restore gc.collect() before empty_cache() for large models * refactor: Remove LoRA fallback remnants, simplify code - Remove use_lora flag (always true since LoRA is always applied) - Remove isinstance(PeftModel) check in get_merged_model() (always true) - Simplify reset_model_for_trial() by removing defensive try/except - Remove redundant gc.collect() calls (empty_cache handles GC) - Remove unused gc import from main.py * Address p-e-w review feedback: rename reset_model, remove loaded_model_name, fix type hints, remove GPT-OSS MoE, update assertion * Restore skip logic for non-LoRA modules and fix 4-bit base_layer.weight access * Remove defensive lora_A check per review - get_layer_modules already filters * Fix try_add: nest component init inside Module check, add assert for unexpected types * Add note about module.weight assumption for type checking * Change 'Reloading model' to 'Resetting model' in logging --------- Co-authored-by: accemlcc <accemlcc@users.noreply.github.com> Co-authored-by: mad-cat-lon <113548315+mad-cat-lon@users.noreply.github.com> Co-authored-by: Hager <Michael.Hager@bruker.com>	2025-12-14 20:19:09 +05:30
Spiky Moth	9d1734855d	feat: avoid excessive low divergence iteration (#73 ) * feat: adjust scoring to avoid useless iteration Adjusts the scoring function to avoid targeting meaninglessly low KL divergences. Below a threshold value, the KL divergence score switches to the refusal count. Adds config option kl_divergence_target (defaulting to 0.01). * fix: Clean up parameter selection in objective Create variables for num_layers and last_layer_index * Improves readability and makes choices explicit * feat: Print the parameters of the selected model	2025-12-14 14:26:48 +05:30
George	740aab61ba	feat: add max_memory parameter to limit memory usage (#83 ) * add max_memory parameter to limit memory usage * Added to reload_model also * forgot to add self * Process max_memory once in __init__ and store it as an instance variable, then reuse it in both locations	2025-12-11 20:57:40 +05:30
Philipp Emanuel Weidmann	d9f2b0407a	build: bump version to 1.1.0	2025-12-10 16:54:03 +05:30
Philipp Emanuel Weidmann	ca783db6c9	docs: update README	2025-12-10 16:30:35 +05:30
Philipp Emanuel Weidmann	6acccac994	feat: add progress bars for plotting operations	2025-12-10 13:07:34 +05:30
Philipp Emanuel Weidmann	ac154a55a0	fix: suppress CoT output for thinking models Ref #75	2025-12-09 11:54:08 +05:30
Philipp Emanuel Weidmann	15781a8a0c	fix: skip common response prefix for thinking models Ref #75	2025-12-09 08:25:10 +05:30
Philipp Emanuel Weidmann	24c3aeb442	feat: turn boolean settings into CLI flags	2025-12-07 11:37:07 +05:30
Philipp Emanuel Weidmann	ffbde3ac2a	fix: follow up after recent PRs	2025-12-07 10:26:16 +05:30
Philipp Emanuel Weidmann	932d737edf	feat: add silhouette coefficient to residual geometry output	2025-12-07 08:48:38 +05:30
Philipp Emanuel Weidmann	1f5e977f4f	Revert "perf: optimize abliteration matrix op (#46 )" (#74 ) This reverts commit `60bd531fde`.	2025-12-07 06:30:37 +05:30
Philipp Emanuel Weidmann	da27ba8054	fix: always left-pad inputs, and avoid optimizing for empty responses Fixes #70 Co-authored-by: arnomatic <acc@eml.cc>	2025-12-06 06:31:09 +05:30
Philipp Emanuel Weidmann	baf5b0b0d1	feat: add geometric median to residual geometry output	2025-12-05 20:15:50 +05:30
Philipp Emanuel Weidmann	eeb28b28c1	feat: add option to plot residual vectors	2025-12-04 14:22:29 +05:30
red40maxxer	d836fb2da9	ci: add PR title lint (#66 ) * ci: add PR title lint * style: ending newline --------- Co-authored-by: mad-cat-lon <113548315+mad-cat-lon@users.noreply.github.com>	2025-12-03 09:25:48 +05:30
red40maxxer	60bd531fde	perf: optimize abliteration matrix op (#46 ) * perf: optimize abliteration matrix op * refactor: comments and var names correspond with arditi * refactor: fix comments and improve var notation * fix: accidental line change and improve comments --------- Co-authored-by: mad-cat-lon <113548315+mad-cat-lon@users.noreply.github.com>	2025-12-02 08:13:43 +05:30
Spiky Moth	1f74ac2888	Guard against refusals in broken English (#45 ) * Guard against refusals in broken English * Normalize whitespace between words	2025-11-26 11:29:08 +05:30
_Vinayyyy_	63fc0e7d5a	feat: Add bfloat16 to default dtypes list (#44 ) Co-authored-by: Vinay Umrethe <vinayumrethe99@gmail.com>	2025-11-25 12:22:52 +05:30
_Vinayyyy_	1efc4ee9e1	Featuring Notebook (Colab/Kaggle) Compatibility (#42 ) * feat: Add hybrid UI for notebook compatibility * Restore notebook detection logic * fix: Improve notebook detection with env vars * chore: cleanup * chore: cleanup * correct ruff format * refactor: Address code review feedback - Move password handling to prompt_password - Use only_directories=True for save path prompt - Simplify prompt_text arguments --------- Co-authored-by: Vinay Umrethe <vinayumrethe99@gmail.com>	2025-11-24 19:46:39 +05:30
Nikolai Kolodziej	452b35e7b7	Add `trust_remote_code` configuration option (#31 ) * Add `trust_remote_code` configuration option and apply it when loading models and tokenizers * Default `trust_remote_code` to `None` and set it to `True` if previously `None` so the user wouldn't be asked multiple times * Consistently access `trust_remote_code` from `self.settings` instead of the global `settings` object. * Introduce `trusted_models` dictionary to manage and confirm `trust_remote_code` settings during model loading * Assign `trust_remote_code` to `evaluate_model` in `trusted_models` instead of `model`	2025-11-24 06:27:44 +05:30
Spiky Moth	b79b8b1475	Improve support for loading local datasets (#33 ) * Handle loading local datasets * Reorder branches to avoid chain of negatives	2025-11-23 11:15:34 +05:30
Philipp Emanuel Weidmann	83cbf0612a	Add option to print refusal geometry	2025-11-22 13:18:54 +05:30
Philipp Emanuel Weidmann	c35f3031f8	Allow stopping the optimization process early with Ctrl+C	2025-11-21 10:11:00 +05:30
Nikolai Kolodziej	2e1bb4b655	Use `PYTORCH_ALLOC_CONF` instead of deprecated `PYTORCH_CUDA_ALLOC_CONF` (#32 ) * Use `PYTORCH_ALLOC_CONF` instead of deprecated `PYTORCH_CUDA_ALLOC_CONF` * style: reformat environment variable check	2025-11-21 07:27:28 +05:30
Anthony Eufemio	af02bc6ece	Fix support for MXFP4 quantized models with Triton tensors (#28 ) When loading models with MXFP4 quantization (e.g., openai/gpt-oss-20b), the transformers library uses Triton tensors to wrap the quantized weights. These Triton tensors have a .data attribute containing the underlying PyTorch tensor, but torch.is_tensor() returns False for them. This caused a KeyError: 'mlp.down_proj' when trying to load such models, as the try_add() function would fail the assertion check before adding the down projection matrices. The fix extracts the underlying PyTorch tensor via the .data attribute when encountering Triton tensors, allowing heretic to work with MXFP4 quantized models while maintaining full compatibility with standard models. Tested with openai/gpt-oss-20b on PyTorch 2.9.1+cu130, transformers 4.57.1, triton 3.5.1, and kernels 0.11.0.	2025-11-20 13:43:06 +05:30
Philipp Emanuel Weidmann	22a4a5b5b5	Add citation information to README	2025-11-19 12:14:17 +05:30
Philipp Emanuel Weidmann	694edf18d3	Follow up after recent PRs	2025-11-19 11:19:47 +05:30
Philipp Emanuel Weidmann	c9c022a143	Fix linting issues	2025-11-19 10:16:58 +05:30
Philipp Emanuel Weidmann	9905d9517f	Fix formatting issues	2025-11-19 10:04:43 +05:30
Philipp Emanuel Weidmann	f06e939791	Add Ruff as a dev dependency	2025-11-19 09:59:18 +05:30
Philipp Emanuel Weidmann	f3b9826ca4	Add CI workflow	2025-11-19 09:45:54 +05:30
Richard Young, PhD	13bb7b24d6	Fix KeyError when HuggingFace user profile fields are missing (#20 ) Handle optional fullname and email fields in user profile gracefully using .get() method with fallback values to prevent KeyError when uploading models to HuggingFace. This fixes an issue where users without a public email or fullname set in their HuggingFace profile would encounter an error during the upload process. Co-authored-by: ricyoung <riyoung@gmail.com>	2025-11-19 05:32:50 +05:30
Nikolai Kolodziej	c8b6663b93	Fix multi-GPU support and memory management (#17 ) * Ensure projector is on the same device as the matrix for multi-GPU support * Optimize memory management for loaded model weights * Refactor memory management by removing unnecessary gc.collect() calls * Optimize memory usage (#1) * Improve memory management by explicitly deleting model layers and optimizing projector usage * Optimize memory management by explicitly deleting the model and forcing garbage collection * Add back deleted `empty_cache` call * Fix broken file * Remove unnecessary deletions * Remove unnecessary empty_cache() calls * Remove unused import of gc * Duplicate `gc.collect` call in `empty_cache()` * Move additional `gc.collect` call in front of `torch.x.empty_cache`	2025-11-19 05:09:12 +05:30
Ooze	61fdf72b42	Add support for Granite MoE Hybrid in model.py by including down projections for shared MLP and MoE experts (#14 )	2025-11-18 08:32:58 +05:30
red40maxxer	7bad84b4f1	perf: clear residuals after computing direction (#15 ) Co-authored-by: mad-cat-lon <113548315+mad-cat-lon@users.noreply.github.com>	2025-11-17 22:18:22 +05:30
Matt Barnson	09730bad70	MPS support (#5 ) * MPS support * oops, added issue tracker. * Delete .beads/issues.jsonl	2025-11-17 18:42:01 +05:30