There seem to be roughly two approaches?
At a high level the “common approach” is:
Treat your local / GCS dataset as normal files, and point TRL (or your script) at those file paths via
datasets.load_datasetor the TRLdatasets:config.
You are not forced to use a Hugging Face Hub dataset name.
Concretely, there are two standard patterns:
- Use
dataset_nameas a local path (simple case). - Use a YAML config with
datasets:+data_files(more explicit, ideal for JSONL/CSV on GCS).
And on Vertex AI, “local path” just means “path under /gcs/<BUCKET>” because Cloud Storage is mounted into the container. (Google Cloud)
1. Background: TRL and Hugging Face Datasets
1.1 TRL’s dataset_name is “path or name”
In the TRL docs for the CLI/script utilities, the key line is:
dataset_name (str, optional) — Path or name of the dataset to load.(Hugging Face)
This means:
- If you pass
dataset_name=timdettmers/openassistant-guanaco, it loads from the Hugging Face Hub. - If you pass
dataset_name=/path/to/mycorpus, it treats it as a path and callsdatasets.load_dataset(path="/path/to/mycorpus", ...).
You can see a real example of using a local path with TRL in a forum thread:
examples/scripts/sft.py \
--model_name google/gemma-7b \
--dataset_name path/to/mycorpus \
...
and the same script works with a Hub dataset name like OpenAssistant/oasst_top1_2023-08-25. (Hugging Face Forums)
So the CLI is designed to handle both.
1.2 Hugging Face datasets handles local and remote files
Hugging Face Datasets supports:
- Datasets from the Hub
- Local datasets
- Remote datasets (HTTP, S3/GCS/… via URLs or storage options)
The canonical docs say:
“Datasets can be loaded from local files stored on your computer and from remote files… CSV, JSON, TXT, parquet…
load_dataset()can load each of these file types.” (Hugging Face)
Typical local example:
from datasets import load_dataset
ds = load_dataset(
"json",
data_files={
"train": "path/to/train.jsonl",
"validation": "path/to/val.jsonl",
},
)
You can also pass lists of paths or multiple splits. (Hugging Face Forums)
TRL just delegates to this API under the hood.
2. Vertex AI detail: GCS buckets are mounted under /gcs
For Vertex AI custom training jobs, Google uses Cloud Storage FUSE so that Cloud Storage looks like a normal filesystem inside the container:
“When you start a custom training job, the job sees a directory
/gcs, which contains all your Cloud Storage buckets as subdirectories.” (Google Cloud)
So if you have data at:
gs://my-bucket/drug-herg/train.jsonl
gs://my-bucket/drug-herg/eval.jsonl
then inside the training container you see:
/gcs/my-bucket/drug-herg/train.jsonl
/gcs/my-bucket/drug-herg/eval.jsonl
From TRL / datasets.load_dataset perspective, these are just normal local paths.
That’s the key: GCS → /gcs/<BUCKET> → treat as local files.
3. Pattern 1 (simplest): use dataset_name as a path
If your data directory is something datasets can detect automatically (e.g., Parquet or a saved HF dataset), you can often just do:
3.1 Local machine
Assume:
/home/you/data/drug-herg/
train.jsonl
eval.jsonl
You could save this as a HF dataset first (optional):
from datasets import load_dataset
ds = load_dataset(
"json",
data_files={
"train": "/home/you/data/drug-herg/train.jsonl",
"validation": "/home/you/data/drug-herg/eval.jsonl",
},
)
ds.save_to_disk("/home/you/data/drug-herg-hf")
Then run TRL:
trl sft \
--model_name_or_path google/gemma-2b-it \
--dataset_name=/home/you/data/drug-herg-hf \
...
Here dataset_name is a path, and TRL will internally call datasets.load_from_disk / load_dataset as appropriate. The StackOverflow/GeeksforGeeks posts show exactly this pattern for local paths. (Stack Overflow)
3.2 Vertex AI
Upload your HF-saved dataset directory to GCS:
gs://my-bucket/drug-herg-hf/ (contains dataset files saved_to_disk)
Inside the container, that is /gcs/my-bucket/drug-herg-hf.
Then in your CustomContainerTrainingJob args:
args = [
"--model_name_or_path=google/gemma-2b-it",
"--dataset_name=/gcs/my-bucket/drug-herg-hf",
# other TRL args...
]
This is the simplest approach when you want to reuse a pre-saved HF dataset. But it requires you to create that HF dataset once (either locally and upload, or directly on GCS).
4. Pattern 2 (more flexible, common with JSONL/CSV): YAML datasets: with data_files
This is the pattern most people use when they have raw JSONL/CSV files and want full control, especially on Vertex AI.
4.1 Why use datasets: instead of dataset_name?
The TRL script-utils docs explicitly support a datasets mixture config:
dataset_name (str, optional) - Path or name of the dataset to load. If datasets is provided, this will be ignored.(Hugging Face)
That is, if you define datasets in the YAML:
- TRL ignores
dataset_name. - Uses your
datasetsentries (each mapping more or less directly todatasets.load_dataset).
This is the cleanest way to tell TRL:
- “Use the JSON builder”
- “Here are my train/validation files”
- “Use only the
promptandcompletioncolumns”
4.2 Example dataset on GCS
Say you have:
gs://my-bucket/drug-herg/train.jsonl
gs://my-bucket/drug-herg/eval.jsonl
With prompt–completion records (your current SFT format):
{"prompt": "Instructions... SMILES: O=C(...)\nAnswer:", "completion": " (B)<eos>"}
{"prompt": "Instructions... SMILES: CCN(...)\nAnswer:", "completion": " (A)<eos>"}
...
Inside the Vertex container:
/gcs/my-bucket/drug-herg/train.jsonl
/gcs/my-bucket/drug-herg/eval.jsonl
4.3 YAML config for TRL CLI
trl.sft can be driven by a config like:
# sft_config.yaml
# ---------- Model ----------
model_name_or_path: google/gemma-2b-it
# ---------- Output ----------
output_dir: /gcs/my-bucket/outputs/txgemma-herg
overwrite_output_dir: true
# ---------- Training ----------
max_seq_length: 1024
per_device_train_batch_size: 2
per_device_eval_batch_size: 2
gradient_accumulation_steps: 8
num_train_epochs: 3
learning_rate: 5e-5
warmup_ratio: 0.05
weight_decay: 0.01
bf16: true
# ---------- LoRA / PEFT ----------
use_peft: true
lora_r: 8
lora_alpha: 16
lora_dropout: 0.1
lora_target_modules: all-linear
# ---------- Dataset(s) ----------
datasets:
- path: json # use HF "json" dataset builder
data_files:
train: /gcs/my-bucket/drug-herg/train.jsonl
validation: /gcs/my-bucket/drug-herg/eval.jsonl
split: train # the split used for training
columns: [prompt, completion] # keep only these columns
# Ignore these when datasets: is defined
dataset_name: null
dataset_text_field: null
# ---------- SFT options ----------
completion_only_loss: true # train only on completion tokens
Key points:
path: jsontellsdatasets.load_dataset("json", ...)to use the JSON builder. (Hugging Face)data_filesuses the GCS-mounted paths under/gcs/my-bucket.columnstrims the dataset to exactly the fields SFTTrainer needs.completion_only_loss: trueensures the loss is applied only on the completion, not the prompt. (Hugging Face)
4.4 Running this locally vs Vertex
Locally (for testing):
-
Replace
/gcs/my-bucket/drug-herg/...with/home/you/data/drug-herg/.... -
Run:
trl sft --config sft_config.yaml
On Vertex AI:
-
Upload
sft_config.yamlitself to GCS, e.g.gs://my-bucket/configs/sft_config.yaml. -
Inside container:
/gcs/my-bucket/configs/sft_config.yaml. -
In
CustomContainerTrainingJob:args = ["--config=/gcs/my-bucket/configs/sft_config.yaml"] job = aiplatform.CustomContainerTrainingJob( display_name="txgemma-herg-lora-sft", container_uri=CONTAINER_URI, command=[ "sh", "-c", 'exec trl sft "$@"', "--", ], ) job.run( args=args, # machine_type, accelerator, etc. )
From TRL’s perspective, this is indistinguishable from local training with a JSON dataset; the only difference is the /gcs/... prefix.
5. Summary: “common approach” in one place
Putting it all together, the standard practice to point TRL (and TRL CLI on Vertex) to local or GCS data instead of a Hub dataset is:
-
Store the dataset as normal files (JSONL/CSV/Parquet) either:
- on local disk for local runs, or
- in a Cloud Storage bucket for Vertex.
-
Treat the GCS paths as local paths under
/gcs/<BUCKET>inside the Vertex container. (Google Cloud) -
Use one of:
--dataset_name=/gcs/<BUCKET>/path/to/hf-saved-datasetif you’re using a dataset saved withsave_to_disk, or- A YAML
datasets:config that callsdatasets.load_dataset("json"/"csv", data_files={...})on those paths.
-
Avoid thinking of
dataset_nameas “must be from the Hub” – per TRL’s own docs, it is “path or name.” (Hugging Face)
That is the common and recommended approach when you want to keep data off the Hub and inside your own filesystem or GCS environment.