Trl cli parameter for local dataset

John6666 · December 2, 2025, 11:50pm

There seem to be roughly two approaches?

At a high level the “common approach” is:

Treat your local / GCS dataset as normal files, and point TRL (or your script) at those file paths via datasets.load_dataset or the TRL datasets: config.
You are not forced to use a Hugging Face Hub dataset name.

Concretely, there are two standard patterns:

Use dataset_name as a local path (simple case).
Use a YAML config with datasets: + data_files (more explicit, ideal for JSONL/CSV on GCS).

And on Vertex AI, “local path” just means “path under /gcs/<BUCKET>” because Cloud Storage is mounted into the container. (Google Cloud)

1. Background: TRL and Hugging Face Datasets

1.1 TRL’s `dataset_name` is “path or name”

In the TRL docs for the CLI/script utilities, the key line is:

dataset_name (str, optional) — Path or name of the dataset to load. (Hugging Face)

This means:

If you pass dataset_name=timdettmers/openassistant-guanaco, it loads from the Hugging Face Hub.
If you pass dataset_name=/path/to/mycorpus, it treats it as a path and calls datasets.load_dataset(path="/path/to/mycorpus", ...).

You can see a real example of using a local path with TRL in a forum thread:

examples/scripts/sft.py \
  --model_name google/gemma-7b \
  --dataset_name path/to/mycorpus \
  ...

and the same script works with a Hub dataset name like OpenAssistant/oasst_top1_2023-08-25. (Hugging Face Forums)

So the CLI is designed to handle both.

1.2 Hugging Face `datasets` handles local and remote files

Hugging Face Datasets supports:

Datasets from the Hub
Local datasets
Remote datasets (HTTP, S3/GCS/… via URLs or storage options)

The canonical docs say:

“Datasets can be loaded from local files stored on your computer and from remote files… CSV, JSON, TXT, parquet… load_dataset() can load each of these file types.” (Hugging Face)

Typical local example:

from datasets import load_dataset

ds = load_dataset(
    "json",
    data_files={
        "train": "path/to/train.jsonl",
        "validation": "path/to/val.jsonl",
    },
)

You can also pass lists of paths or multiple splits. (Hugging Face Forums)

TRL just delegates to this API under the hood.

2. Vertex AI detail: GCS buckets are mounted under `/gcs`

For Vertex AI custom training jobs, Google uses Cloud Storage FUSE so that Cloud Storage looks like a normal filesystem inside the container:

“When you start a custom training job, the job sees a directory /gcs, which contains all your Cloud Storage buckets as subdirectories.” (Google Cloud)

So if you have data at:

gs://my-bucket/drug-herg/train.jsonl
gs://my-bucket/drug-herg/eval.jsonl

then inside the training container you see:

/gcs/my-bucket/drug-herg/train.jsonl
/gcs/my-bucket/drug-herg/eval.jsonl

From TRL / datasets.load_dataset perspective, these are just normal local paths.

That’s the key: GCS → /gcs/<BUCKET> → treat as local files.

3. Pattern 1 (simplest): use `dataset_name` as a path

If your data directory is something datasets can detect automatically (e.g., Parquet or a saved HF dataset), you can often just do:

3.1 Local machine

Assume:

/home/you/data/drug-herg/
  train.jsonl
  eval.jsonl

You could save this as a HF dataset first (optional):

from datasets import load_dataset

ds = load_dataset(
    "json",
    data_files={
        "train": "/home/you/data/drug-herg/train.jsonl",
        "validation": "/home/you/data/drug-herg/eval.jsonl",
    },
)
ds.save_to_disk("/home/you/data/drug-herg-hf")

Then run TRL:

trl sft \
  --model_name_or_path google/gemma-2b-it \
  --dataset_name=/home/you/data/drug-herg-hf \
  ...

Here dataset_name is a path, and TRL will internally call datasets.load_from_disk / load_dataset as appropriate. The StackOverflow/GeeksforGeeks posts show exactly this pattern for local paths. (Stack Overflow)

3.2 Vertex AI

Upload your HF-saved dataset directory to GCS:

gs://my-bucket/drug-herg-hf/  (contains dataset files saved_to_disk)

Inside the container, that is /gcs/my-bucket/drug-herg-hf.

Then in your CustomContainerTrainingJob args:

args = [
    "--model_name_or_path=google/gemma-2b-it",
    "--dataset_name=/gcs/my-bucket/drug-herg-hf",
    # other TRL args...
]

This is the simplest approach when you want to reuse a pre-saved HF dataset. But it requires you to create that HF dataset once (either locally and upload, or directly on GCS).

4. Pattern 2 (more flexible, common with JSONL/CSV): YAML `datasets:` with `data_files`

This is the pattern most people use when they have raw JSONL/CSV files and want full control, especially on Vertex AI.

4.1 Why use `datasets:` instead of `dataset_name`?

The TRL script-utils docs explicitly support a datasets mixture config:

dataset_name (str, optional) - Path or name of the dataset to load. If datasets is provided, this will be ignored. (Hugging Face)

That is, if you define datasets in the YAML:

TRL ignores dataset_name.
Uses your datasets entries (each mapping more or less directly to datasets.load_dataset).

This is the cleanest way to tell TRL:

“Use the JSON builder”
“Here are my train/validation files”
“Use only the prompt and completion columns”

4.2 Example dataset on GCS

Say you have:

gs://my-bucket/drug-herg/train.jsonl
gs://my-bucket/drug-herg/eval.jsonl

With prompt–completion records (your current SFT format):

{"prompt": "Instructions... SMILES: O=C(...)\nAnswer:", "completion": " (B)<eos>"}
{"prompt": "Instructions... SMILES: CCN(...)\nAnswer:", "completion": " (A)<eos>"}
...

Inside the Vertex container:

/gcs/my-bucket/drug-herg/train.jsonl
/gcs/my-bucket/drug-herg/eval.jsonl

4.3 YAML config for TRL CLI

trl.sft can be driven by a config like:

# sft_config.yaml

# ---------- Model ----------
model_name_or_path: google/gemma-2b-it

# ---------- Output ----------
output_dir: /gcs/my-bucket/outputs/txgemma-herg
overwrite_output_dir: true

# ---------- Training ----------
max_seq_length: 1024
per_device_train_batch_size: 2
per_device_eval_batch_size: 2
gradient_accumulation_steps: 8
num_train_epochs: 3
learning_rate: 5e-5
warmup_ratio: 0.05
weight_decay: 0.01
bf16: true

# ---------- LoRA / PEFT ----------
use_peft: true
lora_r: 8
lora_alpha: 16
lora_dropout: 0.1
lora_target_modules: all-linear

# ---------- Dataset(s) ----------
datasets:
  - path: json                           # use HF "json" dataset builder
    data_files:
      train: /gcs/my-bucket/drug-herg/train.jsonl
      validation: /gcs/my-bucket/drug-herg/eval.jsonl
    split: train                         # the split used for training
    columns: [prompt, completion]        # keep only these columns

# Ignore these when datasets: is defined
dataset_name: null
dataset_text_field: null

# ---------- SFT options ----------
completion_only_loss: true              # train only on completion tokens

Key points:

path: json tells datasets.load_dataset("json", ...) to use the JSON builder. (Hugging Face)
data_files uses the GCS-mounted paths under /gcs/my-bucket.
columns trims the dataset to exactly the fields SFTTrainer needs.
completion_only_loss: true ensures the loss is applied only on the completion, not the prompt. (Hugging Face)

4.4 Running this locally vs Vertex

Locally (for testing):

Replace /gcs/my-bucket/drug-herg/... with /home/you/data/drug-herg/....
Run:
```
trl sft --config sft_config.yaml
```

On Vertex AI:

Upload sft_config.yaml itself to GCS, e.g. gs://my-bucket/configs/sft_config.yaml.
Inside container: /gcs/my-bucket/configs/sft_config.yaml.

In CustomContainerTrainingJob:

args = ["--config=/gcs/my-bucket/configs/sft_config.yaml"]

job = aiplatform.CustomContainerTrainingJob(
    display_name="txgemma-herg-lora-sft",
    container_uri=CONTAINER_URI,
    command=[
        "sh",
        "-c",
        'exec trl sft "$@"',
        "--",
    ],
)

job.run(
    args=args,
    # machine_type, accelerator, etc.
)

From TRL’s perspective, this is indistinguishable from local training with a JSON dataset; the only difference is the /gcs/... prefix.

5. Summary: “common approach” in one place

Putting it all together, the standard practice to point TRL (and TRL CLI on Vertex) to local or GCS data instead of a Hub dataset is:

Store the dataset as normal files (JSONL/CSV/Parquet) either:
- on local disk for local runs, or
- in a Cloud Storage bucket for Vertex.
Treat the GCS paths as local paths under /gcs/<BUCKET> inside the Vertex container. (Google Cloud)
Use one of:
- --dataset_name=/gcs/<BUCKET>/path/to/hf-saved-dataset if you’re using a dataset saved with save_to_disk, or
- A YAML datasets: config that calls datasets.load_dataset("json"/"csv", data_files={...}) on those paths.
Avoid thinking of dataset_name as “must be from the Hub” – per TRL’s own docs, it is “path or name.” (Hugging Face)

That is the common and recommended approach when you want to keep data off the Hub and inside your own filesystem or GCS environment.

Topic		Replies	Views
VertexAI Training and Gated model access Intermediate	4	33	December 4, 2025
Problem loading local dataset using TRL Beginners	0	176	April 15, 2024
Fine Tuning GPT-2 - Training job only using test sample size of 5 Amazon SageMaker	4	2186	February 6, 2023
Need to build extended knowledge base about Hugging Face toolbox best practices and recipes in the VertexAI Training environment Google Cloud	1	33	December 5, 2025
Loading Custom Datasets 🤗Datasets	7	10859	May 25, 2021

Trl cli parameter for local dataset

1. Background: TRL and Hugging Face Datasets

1.1 TRL’s dataset_name is “path or name”

1.2 Hugging Face datasets handles local and remote files

2. Vertex AI detail: GCS buckets are mounted under /gcs

3. Pattern 1 (simplest): use dataset_name as a path