The logic is simple since num_workers should specify the number of Python processes for the loader’s workers. Images will likely be preprocessed automatically in some way right before being passed to the model unless you consciously preprocess them yourself beforehand… If you want to preserve the aspect ratio and avoid cropping the image, padding is probably the way to go.
If disk space allows, preprocess and save the images using .map beforehand. This avoids burdening the CPU or GPU right before or during training.
You’re right that dataset.map() is only run where you call it, and there’s no hidden second .map inside Trainer. Given that and no group_by_length, the remaining suspects are:
- How data is loaded (DataLoader /
num_workers / batch size).
- How heavy each sample is (large, variable-size images + preprocessing).
I’ll walk through:
- What
dataloader_num_workers is and how “sample size” is determined.
- Why image size and variability matter, and what to do about them.
- How this all affects startup time specifically.
- A concrete checklist to improve startup performance.
Along the way I’ll also touch on your idea about dataset.with_transform(pipeline(...)).
1. num_workers and “sample size”: what they actually control
1.1 Where num_workers comes from in your setup
Under the hood, Hugging Face’s Trainer creates a PyTorch DataLoader for your train and eval datasets. In plain PyTorch you might write:
from torch.utils.data import DataLoader
train_loader = DataLoader(
train_dataset,
batch_size=4,
shuffle=True,
num_workers=0, # this is the key parameter
pin_memory=False,
)
In Trainer, those parameters are controlled by TrainingArguments:
per_device_train_batch_size → batch_size
dataloader_num_workers → num_workers
dataloader_pin_memory → pin_memory
You didn’t set dataloader_num_workers, so it defaults to 0. That means:
PyTorch’s own tuning guide summarizes the effect:
num_workers=0 → only the main process loads batches (can be a bottleneck).
num_workers>0 → separate worker processes load data asynchronously and overlap with training.
So one major knob you haven’t touched yet is:
TrainingArguments(
...,
dataloader_num_workers=2, # or 4 if CPU cores allow
dataloader_pin_memory=True,
)
1.2 What is “sample size” here?
There are two separate notions of “size”:
-
Dataset size:
- Number of images in
ds["train"] / ds["test"].
- Determined by how many files
imagefolder found in your directories.
-
Batch (or mini-batch) size:
-
per_device_train_batch_size in TrainingArguments.
-
The effective batch of samples whose gradients are averaged before one optimizer step is:
[
\text{effective_batch} = \text{per_device_train_batch_size} \times \text{num_gpus} \times \text{gradient_accumulation_steps}
]
-
In your case (1 GPU, batch=4, grad_accum=16):
- Effective batch = (4 \times 1 \times 16 = 64) images per optimizer step.
Each DataLoader worker loads batches of batch_size samples in parallel. So:
num_workers controls how many processes are fetching / transforming batches concurrently.
If num_workers=0, only the main process does it; that’s usually okay on tiny datasets, but for images and on-the-fly transforms it becomes a bottleneck.
2. Variable image sizes and resizing: why it matters
You said:
dataset images are various different sizes … I didn’t want to crop or skew the images. Would it probably help if I make the images a certain size before running through the tokenizer?
2.1 What the image processor already does
For vision models, Hugging Face’s AutoImageProcessor (or older FeatureExtractor) usually:
- Resizes the image (often with center crop).
- Normalizes pixel values (mean/std).
- Converts to a tensor of shape
[batch_size, channels, height, width].
From the docs:
To achieve this, an image is resized (center cropped) and the pixel values are normalized and rescaled to the model’s expected values.
So even if your raw images are various sizes, processor(image, return_tensors="pt") is almost certainly:
- Resizing them to a fixed target size (e.g. 224×224),
- Or at least to a predictable size (e.g. shortest edge to a fixed value + crop).
That means:
- The model itself expects standardized size.
- You’re not really avoiding crop/resize; the processor is doing it anyway.
2.2 Why resizing ahead of time can still help
If your originals are very large (e.g. 2000×1500, 4K photos):
is expensive.
Pre-resizing the raw images (offline or via a one-time script) to a “reasonable” size, like 256–512 pixels on the shortest edge, will:
- Reduce CPU work per image.
- Reduce I/O (smaller files).
- Make per-sample preprocessing cheaper.
Typical vision pipelines (including many Kaggle + HF examples) apply transforms like:
- Resize,
- Center crop,
- Random horizontal flip,
- Normalize.
So yes: if your raw images are much larger than your model’s expected size, downscaling them before training can noticeably help.
If you don’t like hard crops or distortion:
- Use a resize that preserves aspect ratio (e.g. shortest edge to 256).
- Optionally pad to square instead of cropping (letterboxing).
- Let the processor do the final center crop to its native size.
2.3 Do you need uniform size before the processor?
No. For correctness:
- AutoImageProcessor can handle varying input sizes and resize them appropriately.
Uniform size before the processor is mainly about performance:
- Smaller files + fewer pixels to process.
- Less per-image CPU work when building batches.
So the priority order is:
- Fix the data pipeline (no huge
.map, use with_transform, multiple workers).
- If raw images are very large, pre-resize them once offline.
3. About dataset.with_transform(pipeline(...)) and evaluator
You wrote:
I was considering using dataset.with_transform(pipeline()) to run a dataset through a pipeline, but based on what you said I may want to avoid it since I have some additional post-processing that needs to happen for evaluator.
That instinct is correct.
3.1 What with_transform(pipeline()) would do
Using pipeline inside with_transform would mean:
For evaluation with extra post-processing, HF’s evaluate library recommends:
- Either pass the pipeline directly to an
evaluator, which handles looping under the hood, or
- Write a simple
for loop over the dataset, call pipeline, and feed predictions into metrics manually.
So your plan:
Instead I’ll be using a for loop based on one of the examples…
is the right approach for that separate evaluation task.
4. Why startup is still slow even with one .map
You clarified:
dataset.map() isn’t being run more than once … no group_by_length. I guess the issue is data loading?
Given that:
-
One heavy .map before Trainer:
- You already pay the cost of decoding + processing every image up front.
- You then store large
pixel_values arrays in Arrow.
- Trainer’s DataLoader now reads large tensors from disk rather than small image files.
-
dataloader_num_workers=0:
- Main process does all the reading + collation.
- There is no overlap; GPU waits for CPU I/O.
-
Variable large images:
- If your
process() function is decoding large images and resizing them on the fly, the .map step itself is heavy.
- If you instead kept raw images and used
with_transform, the per-batch preprocessing becomes heavy unless you add workers.
So even if .map runs only once, startup still includes:
- Scanning ImageFolder (once).
- Running that
.map (once, but heavy).
- Building DataLoaders on a tensor-heavy dataset (which may require scanning metadata).
And then each epoch still has to read those large tensors from disk.
5. Concrete ways to speed up training startup in your situation
Here’s a prioritized list tailored to your setup.
5.1 If you can change the data pipeline (recommended)
-
Stop precomputing tensors with .map for training runs.
-
Keep dataset as {"image": Image, "labels": int} from imagefolder.
-
Use with_transform (or set_transform) to apply the processor lazily per batch:
def transform(batch):
inputs = processor(batch["image"], return_tensors="pt")
inputs["labels"] = batch["labels"]
return inputs
prepared_ds = dataset.with_transform(transform)
-
This is the pattern shown in the HF image classification guide and ViT examples.
-
Set dataloader_num_workers and dataloader_pin_memory.
training_args = TrainingArguments(
...,
dataloader_num_workers=2, # or 4 if CPU allows
dataloader_pin_memory=True,
remove_unused_columns=False, # important with transforms
)
From HF’s own performance guide:
Configure dataloader_pin_memory and dataloader_num_workers to allocate pinned memory and increase the number of workers.
And PyTorch’s tuning guide:
Setting num_workers > 0 enables asynchronous data loading and overlap between training and data loading. pin_memory=True speeds up host→GPU transfers.
-
Optionally resize raw images offline if they are huge.
- Downsample originals to a modest size (e.g. 256–512 on shortest edge).
- Preserve aspect ratio to avoid distortion; crop or pad later if needed.
- Let the image processor do the final resize/center-crop to model’s expected size.
This shrinks per-sample CPU work and disk I/O.
With this setup, startup time becomes:
- Folder scan (once).
- Very light
Trainer initialization (no heavy .map).
- Workers begin prefetching the first batches while GPU waits, so you see the first training logs faster.
5.2 If you insist on using .map for preprocessing
If you really want to stick to .map (e.g. you prefer to pay preprocessing cost once and reuse the dataset many times), you can still improve startup by:
-
Preprocess once in a separate script and save to disk.
# preprocess.py
from datasets import load_dataset
raw = load_dataset("imagefolder", data_dir="folder_datasets")
def process(batch):
feats = processor(batch["image"], return_tensors="pt")
feats["labels"] = batch["labels"]
return feats
ds = raw.map(process, batched=True)
ds = ds.remove_columns("image")
ds.save_to_disk("processed_ds")
Then in your training notebook:
from datasets import load_from_disk
ds = load_from_disk("processed_ds")
load_from_disk is generally much faster than recomputing the map; the Datasets docs recommend this when you have expensive preprocessing you don’t want to repeat.
-
Still use dataloader_num_workers and pin_memory.
Even with a preprocessed dataset, using multiple workers helps reading and batching these large tensors. HF and PyTorch both recommend non-zero num_workers for performance.
-
Keep resolution reasonable.
If you precompute pixel_values, make sure you’re not storing huge resolutions you don’t need. Stick close to the model’s expected size.
This doesn’t remove the cost of the initial .map (you pay it once in preprocess.py), but it removes that cost from training startup and makes training notebooks lighter.
6. Summary
-
dataset.map() is only executed where you call it; Trainer does not call it internally.
-
Your slow startup is now mainly from:
- Heavy preprocessing (large images → tensors) in
.map and/or processor.
- DataLoader running in the main process (
dataloader_num_workers=0).
-
num_workers controls how many CPU worker processes load and preprocess batches in parallel; set it in TrainingArguments.dataloader_num_workers.
-
Standardizing image size is already done by the AutoImageProcessor (resize + crop + normalize). Pre-resizing large originals offline can still significantly reduce CPU and I/O.
-
dataset.with_transform(pipeline(...)) is not a good idea for training; pipelines are for inference, and using them in with_transform would run full inference at each access. For evaluator + custom post-processing, a simple for loop around the pipeline (as in the docs) is the right pattern.
If you (a) switch to with_transform for training, (b) set a non-zero dataloader_num_workers, and (c) avoid extremely large raw images, you should see a noticeable reduction in how long it takes for trainer.train() to move from “starting…” to actually printing the first training progress.