feat: support plain text files as prompt datasets (#337)

A dataset path that points to a plain file is now read as one prompt per
line, with empty lines ignored. For text files, "column" is ignored and
"split" is optional; when given, it selects a subset of lines using slice
notation (e.g. "[:400]").

Detection uses os.path.isfile so files without an extension also work. The
split-parsing logic is factored into a shared get_split_slice helper, which
derives the split name from the specification, and split/column are now
optional in DatasetSpecification, with the dataset branches raising a clear
error when either is missing. An invalid split raises instead of being
silently ignored.

A bare slice does not parse with the pinned datasets version, since
ReadInstruction.from_spec expects a named split, so the text branch prepends
a synthetic split name.

Revives the approach from #103.

Closes #98.

Co-authored-by: Ric <ricyoung@gmail.com>
This commit is contained in:
Rocker Zhang
2026-05-31 17:36:47 +08:00
committed by GitHub
parent 6338e2c99b
commit b790094193
3 changed files with 56 additions and 19 deletions
+5
View File
@@ -173,6 +173,11 @@ refusal_markers = [
# System prompt to use when prompting the model.
system_prompt = "You are a helpful assistant."
# Each "dataset" below can be a Hugging Face dataset ID, a path to a dataset on disk,
# or a path to a plain text file with one prompt per line (empty lines are ignored).
# For text files, "column" is ignored and "split" is optional; when given, it selects
# a subset of the lines using slice notation (e.g. "[:400]").
# Dataset of prompts that tend to not result in refusals (used for calculating refusal directions).
[good_prompts]
dataset = "mlabonne/harmless_alpaca"