Philipp Emanuel Weidmann
b873598b77
docs: improve settings documentation
2026-02-11 10:19:05 +05:30
Philipp Emanuel Weidmann
f68a887a7b
fix: improve code quality, improve UX, fix small bugs
2026-02-08 13:32:00 +05:30
Spiky Moth
3525b1ac22
Implement Magnitude-Preserving Orthogonal Ablation ( #52 )
...
* feat: add support for winsorizing the residuals
Adds setting winsorization_quantile, expressed as the quantile to clamp to.
- If set to a value below 1, the residuals obtained from evaluating the first token of the good and bad prompts are winsorized - that is, values outside the given quantile are clamped. Note that winsorization_quantile = 0.95 corresponds to a 90% winsorization.
* feat: implement magnitude-preserving orthogonal ablation
Adds boolean setting orthogonalize_direction:
- When enabled, only the component of the refusal directions that is orthogonal to the harmless direction is subtracted during abliteration.
Adds enum-valued setting row_normalization:
- 'none': No normalization.
- 'pre': Row-normalize the weight matrix before computing the LoRA adapter.
- 'full': Like 'pre', but re-normalizes to preserve original row magnitudes.
* prefer 'good' and 'bad' over 'harmless' and 'harmful'
* clarify how winsorization is applied
* store and reuse full peft_config
* remove unneeded cast
* make LoRA rank configurable for full normalization
* explain why the singular values are split across the components
2026-02-02 17:05:19 +05:30
Philipp Emanuel Weidmann
02a5237a02
feat: add option to print prompt/response pairs
2025-12-27 14:48:29 +05:30
michaelh
243f821d93
feat: Add 4-bit loading + LoRA support for low VRAM optimization ( #60 )
...
* Add files via upload
* perf: optimize abliteration matrix op (#46 )
* perf: optimize abliteration matrix op
* refactor: comments and var names correspond with arditi
* refactor: fix comments and improve var notation
* fix: accidental line change and improve comments
---------
Co-authored-by: mad-cat-lon <113548315+mad-cat-lon@users.noreply.github.com >
* Fix line endings to LF
* Add hybrid approach for GPT-OSS compatibility
- Check for LoRA adapters before attempting LoRA abliteration
- Fall back to direct weight modification for nn.Parameter (GPT-OSS)
- Ensures compatibility across all model architectures
* Fix projector bug, update print statement, revert README
* Revert README changes to match upstream
* Fix import sorting for ruff
* Fix reload_model for evaluate_model, add type hints and validation
* Apply ruff formatting
* Replace load_in_4bit with quantization enum
* Fix precision loss: use FP32 refusal direction directly
* Move r assignment into non-LoRA path
* Fix linting: apply ruff formatting
* Add auto-merge for LoRA adapters on save/upload
* Fix linting: apply ruff formatting
* Implement CPU-based merge for 4-bit models with OOM fallback
* Remove use_lora flag (LoRA always on), add user prompt for 4-bit export
* Fix: PEFT target_modules expects module names without path prefix
* Fix linting: apply ruff formatting
* Add LoRA fallback and fix quantization_config handling
- Add try/except around LoRA initialization with fallback to direct weight modification
- Only pass quantization_config when not None (fixes gpt-oss loading)
- Use simple forward pass instead of generate() for model test (avoids chat template issues)
- Reset non-LoRA models by reloading in reload_model()
- Check self.use_lora before accessing LoRA adapters in abliterate()
* Add 8-bit quantization support via bitsandbytes
- Add BNB_8BIT option to QuantizationMethod enum
- Add --load-in-8bit CLI support (auto via pydantic-settings)
- Update documentation in config.py and config.default.toml
- Useful for mid-range VRAM (12-16 GB) as balance between memory and numeric stability
* Improve LoRA merge warning and fix linting
* Apply final ruff formatting
* Fix CI: apply ruff import sorting
* Use tiny model for CI efficiency
* Fix import sorting in test_lora.py
* Fix formatting in test_lora.py
* feat: Show merge warning for all models (requires high RAM)
* style: Apply ruff fixes
* Fix undefined Style import in main.py
* Fix(model): Support MoE/3D tensors and enforce dtype safety in abliterate
* Fix(ci): Format model.py with ruff
* Fix(main): Remove invalid style argument from prompt_select and unused import
* Fix logic errors, memory leak, and redundant merges in main.py
* Fix linting and formatting issues (isort, ruff)
* chore: Simplify .gitattributes as requested
* refactor: Remove defensive try-except around LoRA initialization
* chore: Update uv.lock with peft and bitsandbytes
* chore: Regenerate uv.lock to include missing peft dependency
* style: Fix import sorting (isort) for CI compliance
* style: Simplify .gitattributes to single line as requested
* Address PR #60 feedback: Remove caching, fix LoRA reload, global LoRA usage, style fixes
* Address PR review comments: clarify code, fix quantization, rename method
- Add explanatory comments for warning suppression and gc behavior
- Remove redundant gc.collect() calls (empty_cache handles it)
- Fix output message order (ask merge strategy before 'Uploading...')
- Add comment explaining 8-bit quantization doesn't need compute_dtype
- Remove extra newline after dtype comment
- Add future-proofing note for hybrid layer support (#43 )
- Remove leftover comment in get_merged_model
- Delete test_lora.py (debug script, not a real test)
- Add comment explaining needs_reload flag purpose
- Extract quantization config into _get_quantization_config() helper
- Rename reload_model() to reset_model_for_trial() for clarity
- Fix reload_model to respect quantization config (fixes evaluate_model bug)
- Remove unused gc import
* Restore gc.collect() before empty_cache() for large models
* refactor: Remove LoRA fallback remnants, simplify code
- Remove use_lora flag (always true since LoRA is always applied)
- Remove isinstance(PeftModel) check in get_merged_model() (always true)
- Simplify reset_model_for_trial() by removing defensive try/except
- Remove redundant gc.collect() calls (empty_cache handles GC)
- Remove unused gc import from main.py
* Address p-e-w review feedback: rename reset_model, remove loaded_model_name, fix type hints, remove GPT-OSS MoE, update assertion
* Restore skip logic for non-LoRA modules and fix 4-bit base_layer.weight access
* Remove defensive lora_A check per review - get_layer_modules already filters
* Fix try_add: nest component init inside Module check, add assert for unexpected types
* Add note about module.weight assumption for type checking
* Change 'Reloading model' to 'Resetting model' in logging
---------
Co-authored-by: accemlcc <accemlcc@users.noreply.github.com >
Co-authored-by: mad-cat-lon <113548315+mad-cat-lon@users.noreply.github.com >
Co-authored-by: Hager <Michael.Hager@bruker.com >
2025-12-14 20:19:09 +05:30
Spiky Moth
9d1734855d
feat: avoid excessive low divergence iteration ( #73 )
...
* feat: adjust scoring to avoid useless iteration
Adjusts the scoring function to avoid targeting meaninglessly low KL divergences.
Below a threshold value, the KL divergence score switches to the refusal count.
Adds config option kl_divergence_target (defaulting to 0.01).
* fix: Clean up parameter selection in objective
Create variables for num_layers and last_layer_index
* Improves readability and makes choices explicit
* feat: Print the parameters of the selected model
2025-12-14 14:26:48 +05:30
George
740aab61ba
feat: add max_memory parameter to limit memory usage ( #83 )
...
* add max_memory parameter to limit memory usage
* Added to reload_model also
* forgot to add self
* Process max_memory once in __init__ and store it as an instance variable, then reuse it in both locations
2025-12-11 20:57:40 +05:30
Philipp Emanuel Weidmann
ffbde3ac2a
fix: follow up after recent PRs
2025-12-07 10:26:16 +05:30
Philipp Emanuel Weidmann
eeb28b28c1
feat: add option to plot residual vectors
2025-12-04 14:22:29 +05:30
Spiky Moth
1f74ac2888
Guard against refusals in broken English ( #45 )
...
* Guard against refusals in broken English
* Normalize whitespace between words
2025-11-26 11:29:08 +05:30
Philipp Emanuel Weidmann
83cbf0612a
Add option to print refusal geometry
2025-11-22 13:18:54 +05:30
Philipp Emanuel Weidmann
8a1aceff11
Switch to multi-objective optimization
2025-11-14 18:04:23 +05:30
Philipp Emanuel Weidmann
fae39ffb89
Move default configuration to Python
2025-11-02 09:29:55 +05:30
Philipp Emanuel Weidmann
a24e6eba96
Improve optimization
2025-10-31 16:04:28 +05:30
Philipp Emanuel Weidmann
c638d3d012
Adjust score parameters
2025-10-25 13:15:31 +05:30
Philipp Emanuel Weidmann
e6aba71186
Improve refusal detection
2025-10-24 11:27:28 +05:30
Philipp Emanuel Weidmann
7caf9fcdc5
Separate training and evaluation prompts
2025-10-09 12:51:31 +05:30
Philipp Emanuel Weidmann
c447805fc2
Improve default dtype configuration
2025-09-23 13:31:41 +05:30
Philipp Emanuel Weidmann
1b37160490
Fix model loading issues
2025-09-21 16:04:41 +05:30
Philipp Emanuel Weidmann
af19fbd254
Initial commit
2025-09-21 11:10:30 +05:30