Trainer taking a long time to start

PurpleSwans · November 28, 2025, 7:34pm

I’m making a model for image classification using fine-tuning, and have started experimenting with TrainingArguments to find what combination give me the best accuracy and other metrics. However, I’ve noticed that the Training loop takes a long time to load up and start, on top of taking several hours to do the training. Since I’m running this training loop on my laptop which only has 1 GPU I expect the training to take a long time, but it does seem to increase every time I restart even if the metric changes should have resulted in a faster loop.

From looking at other forum questions and documentation, I think I have two issues: Something related to IO (hence why Trainer takes so long to start) and something related to my cache. Unforunatly rebooting the computer several times hasn’t had the desired effect.

Bellow is my code. I’m running it in a Jupyter Notebook enviroment, and loading the dataset from my personal files using the ImageFolder method described here. My computer’s GPU is NVIDIA GeForce RTX 3050 4GB. Is there anything you can see that should fix my IO issue and speed up the training process?

from transformers import AutoImageProcessor, AutoModelForImageClassification, TrainingArguments, Trainer
from datasets import load_dataset
import numpy as np
import evaluate

dataset = load_dataset(“folder_datasets”)

processor = AutoImageProcessor.from_pretrained(‘deepmind/vision-perceiver-learned’)
#I’m using that model since perceiver was the only config I could find that worked with my Automodel

#preprocessing the images
def process(example):
     features = processor(example[‘image’], return_tensors=“pt”)
     return features
ds = dataset.map(process, batched=True)
#remove original image, else run into issues?
ds = ds.remove_columns('image')

myModel = AutoModelForImageClassification.from_pretrained('deepmind/vision-perceiver-learned',
     problem_type='single_label_classification', 
     num_labels=2,
     ignore_mismatched_sizes=True,
     )

metric = evaluate.load("accuracy")
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

training_args = TrainingArguments(
    output_dir="./output_dir",
    overwrite_output_dir=True,,
    remove_unused_columns=False,
    do_eval=True,
    eval_strategy="epoch",
    logging_strategy="epoch",
    save_strategy="epoch",
    push_to_hub=False,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=16,
    num_train_epochs=6,
)

trainer = Trainer(
    model=myModel,
    args=training_args,
    train_dataset=ds["train"],
    eval_dataset=ds["test"],
    processing_class=processor,
    compute_metrics=compute_metrics,
)
trainer.train()

Note: I may have made some other mistakes in this code… I accidentally used the wrong AutoModel class for a while and spent a long time trying to make it work, so there might be some legacy code that doesn’t actually help anything.

John6666 · November 29, 2025, 1:33am

Isn’t dataset.map being executed internally every time?

PurpleSwans · November 29, 2025, 2:25am

I don’t think so? Unless I’m missing something dataset.map should only run once, before the model is initialized as part of the tokenization process. Is there another map command internal to the Training loop I’m not seeing?

John6666 · November 29, 2025, 3:35am

Well, it’s hard to explain…

Good question to pause on, because there are a few different “transform-like” mechanisms in play and they’re easy to conflate.

I’ll keep the core answer very clear, then add context.

1. Direct answer

dataset.map(...) is only executed when you explicitly call it.
It is not secretly re-run inside Trainer every epoch or every trainer.train() call.
Inside the training loop, Trainer:
- Iterates over the dataset through a PyTorch DataLoader.
- May scan the dataset once for things like group_by_length=True.

But there is no internal .map call in Trainer. If you don’t call .map in your own code, it doesn’t happen.

Everything else that looks “map-like” is either:

Your own with_transform/set_transform function being applied lazily, or
A plain for-loop over the dataset (e.g. to get lengths for bucketing).

2. What `dataset.map` actually does (and when it runs)

Dataset.map from Datasets is an eager transformation:

processed = dataset.map(fn, batched=True)

Semantics:

It loops over every row in the dataset.
Applies fn to each batch.
Writes out a new dataset that is stored in Arrow format in the Datasets cache.
It produces a new fingerprinted version of the dataset.

It only runs when you call it in your code.

2.1 Caching & fingerprints

Map is cache-aware:

It hashes:
- The previous dataset fingerprint.
- Your map function (including a version parameter if you use it).
- The arguments to .map.
If it finds a cache file compatible with that fingerprint, and load_from_cache_file=True (default), it reuses the cached result instead of recomputing.

So behavior across runs is:

First time: .map is heavy (it really transforms every example and writes Arrow files).
Next time with same function + args: .map is much faster, because it directly loads the cached Arrow dataset, but it still has to:
- Recompute the fingerprint.
- Open the cache files from disk.

Still: this only happens when you call .map again.
Trainer does not call .map for you.

3. What is running every time: lazy transforms

There is a different mechanism that does run on every access: set_transform / with_transform.

3.1 `with_transform` and `set_transform` are “on-the-fly”

From the Datasets docs and source:

with_transform(transform) and set_transform(transform) do not precompute new data on disk.
They set a callable that is applied on-the-fly when you call __getitem__ or iterate the dataset.

The docs literally say:

The transform is applied on-the-fly on batches when __getitem__ is called.

And in the HF forum:

If you use set_transform then the resulting data are created on-the-fly and not stored/cached. Only the original data are stored on disk as an Arrow file.

So:

If you use with_transform, then yes, your transform runs for each batch fetched by the DataLoader during training.
But that is not Dataset.map; it’s just a per-batch transform hooked into __getitem__.

This is the recommended pattern for image classification in the official docs and blog posts, precisely to avoid the cost of a big eager .map.

4. What Trainer itself does with your dataset

Trainer takes whatever dataset object you give it—possibly with transforms attached—and then:

Builds PyTorch DataLoaders
- get_train_dataloader and get_eval_dataloader wrap your dataset in torch.utils.data.DataLoader.
- No .map here, just:
```
DataLoader(
  dataset,
  batch_size=...,
  shuffle=...,
  collate_fn=...,
  num_workers=...,
  ...
)
```
Iterates over those DataLoaders during training and evaluation.
- Each for batch in train_dataloader: causes:
  - __getitem__ on your dataset.
  - If you used with_transform, your transform is applied here.
- Again: no .map, just iteration.
Optionally scans the dataset once for special features

Two things often cause confusion:
- group_by_length=True
  - This makes Trainer use LengthGroupedSampler.
  - It iterates over the entire dataset once to get lengths. In code, it literally does something like:
```
lengths = [len(feature[self.model_input_name]) for feature in dataset]
```
    which is a straightforward Python loop over all examples.
  - That can be very slow if each item is expensive to load (e.g. big tensors from a prior .map).
- Some newer Trainer metrics options (like include_tokens_per_second)
  - The docs warn that computing tokens-per-second may iterate once over the whole training dataloader, which slows things down.

But crucially, these are plain for-loops over the dataset or dataloader, not calls to Dataset.map.

5. Why it feels like `.map` is running again

From your POV, it might feel like .map is “happening again” because:

You experience a big stall before the first training step.
That stall happens after you call trainer.train(), so it looks like Trainer is doing something heavy.

What’s actually happening in the kinds of setups we discussed:

Your own .map call
- Runs once, doing heavy preprocessing and writing big Arrow files.
DataLoader construction
- If the dataset rows are now huge tensors (from .map), building DataLoaders and potentially shuffling/selecting indices can be slow.
Full dataset scan for group_by_length=True (if set)
- Trainer loops over every dataset item to compute lengths, which is again heavy if each item is large.

None of that calls .map again, but it does iterate through every example, so it can feel just as slow.

6. So, to answer your specific concerns

Isn’t dataset.map being executed internally every time?

No.

Dataset.map is only executed when you call it.
Inside Trainer, there is no hidden .map call.

What can happen repeatedly:

If you re-run the notebook cell containing .map, you re-trigger it (with or without cache).
If group_by_length=True, Trainer iterates over the dataset once to compute lengths.
If you use with_transform/set_transform, your transform runs at each batch fetch (lazy), but that’s by design and not .map.

dataset.map should only run once, before the model is initialized as part of the tokenization process.

That’s right conceptually—if:

You only call it once, and
You don’t change the map function or its arguments (which would force a new cached version).

There is no “extra map hidden inside Trainer.”

Is there another map command internal to the Training loop I’m not seeing?

No other Dataset.map.

There is:

Dataset iteration (plain for-loops) at several points, especially with group_by_length.
On-the-fly transforms via with_transform/set_transform when you opt into them.

But nothing that re-runs your map function unless you explicitly call .map again.

7. Mental model to keep in mind

A simple way to remember it:

.map = “bake it in”
- Run once when you call it.
- Creates a new, cached dataset with the transformed columns.
- Heavy upfront, lighter later.
with_transform / set_transform = “cook on demand”
- No caching; original data stays as-is in Arrow.
- Transform is run each time you access a batch.
- No huge upfront cost, but work is done every epoch.
Trainer
- Never calls .map for you.
- Just iterates over the dataset (which may have transforms attached).
- Optionally does a full pass for features like group_by_length.

If your startup is slow, the usual culprits are:

Your own .map call doing heavy preprocessing and writing big caches.
Trainer doing a full scan for group_by_length=True.
Slow data loading (no num_workers, big samples) rather than repeated .map calls.

But dataset.map itself is not secretly being executed inside the training loop.

PurpleSwans · November 29, 2025, 4:22am

Thank you for the detailed explanation! I was working on another task and was considering using dataset.with_transform(pipeline()) to run a dataset through a pipeline, but based on what you said I may want to avoid it since I have some additional post-processing that needs to happen for evaluator. (Instead I’ll be using a for loop based on one of the examples here)

Okay, with this setup dataset.map() isn’t being run more than once (Something you don’t see is that there are multiple Jupyter cells between the processing and the trainer, so I’m confident about that), I don’t see any settings similar to group_by_length=True in my settings… I guess the issue is data loading? I don’t recognize the num_workers setting, what is determining my sample size?

I will admit that dataset images are various different sizes. Most tutorials I saw related to image processing did an additional .map() command (or extra line in the process() function) to standardize the image sizes, but I didn’t want to crop or skew the images. Would it probably help if I make the images a certain size before running through the tokenizer?

John6666 · November 29, 2025, 8:21am

The logic is simple since num_workers should specify the number of Python processes for the loader’s workers. Images will likely be preprocessed automatically in some way right before being passed to the model unless you consciously preprocess them yourself beforehand… If you want to preserve the aspect ratio and avoid cropping the image, padding is probably the way to go.

If disk space allows, preprocess and save the images using .map beforehand. This avoids burdening the CPU or GPU right before or during training.

You’re right that dataset.map() is only run where you call it, and there’s no hidden second .map inside Trainer. Given that and no group_by_length, the remaining suspects are:

How data is loaded (DataLoader / num_workers / batch size).
How heavy each sample is (large, variable-size images + preprocessing).

I’ll walk through:

What dataloader_num_workers is and how “sample size” is determined.
Why image size and variability matter, and what to do about them.
How this all affects startup time specifically.
A concrete checklist to improve startup performance.

Along the way I’ll also touch on your idea about dataset.with_transform(pipeline(...)).

1. `num_workers` and “sample size”: what they actually control

1.1 Where `num_workers` comes from in your setup

Under the hood, Hugging Face’s Trainer creates a PyTorch DataLoader for your train and eval datasets. In plain PyTorch you might write:

from torch.utils.data import DataLoader

train_loader = DataLoader(
    train_dataset,
    batch_size=4,
    shuffle=True,
    num_workers=0,        # this is the key parameter
    pin_memory=False,
)

In Trainer, those parameters are controlled by TrainingArguments:

per_device_train_batch_size → batch_size
dataloader_num_workers → num_workers
dataloader_pin_memory → pin_memory

You didn’t set dataloader_num_workers, so it defaults to 0. That means:

All data loading and preprocessing happens in the main process.
There’s no overlap between CPU work (decoding/resizing images) and GPU compute.
The main process must:
- Decode the image.
- Run your preprocessing (processor(...)).
- Build the batch.
- Then finally call model(...).

PyTorch’s own tuning guide summarizes the effect:

num_workers=0 → only the main process loads batches (can be a bottleneck).
num_workers>0 → separate worker processes load data asynchronously and overlap with training.

So one major knob you haven’t touched yet is:

TrainingArguments(
    ...,
    dataloader_num_workers=2,    # or 4 if CPU cores allow
    dataloader_pin_memory=True,
)

1.2 What is “sample size” here?

There are two separate notions of “size”:

Dataset size:
- Number of images in ds["train"] / ds["test"].
- Determined by how many files imagefolder found in your directories.
Batch (or mini-batch) size:
- per_device_train_batch_size in TrainingArguments.
- The effective batch of samples whose gradients are averaged before one optimizer step is:
  
  [
  \text{effective_batch} = \text{per_device_train_batch_size} \times \text{num_gpus} \times \text{gradient_accumulation_steps}
  ]
- In your case (1 GPU, batch=4, grad_accum=16):
  - Effective batch = (4 \times 1 \times 16 = 64) images per optimizer step.

Each DataLoader worker loads batches of batch_size samples in parallel. So:

num_workers controls how many processes are fetching / transforming batches concurrently.

If num_workers=0, only the main process does it; that’s usually okay on tiny datasets, but for images and on-the-fly transforms it becomes a bottleneck.

2. Variable image sizes and resizing: why it matters

You said:

dataset images are various different sizes … I didn’t want to crop or skew the images. Would it probably help if I make the images a certain size before running through the tokenizer?

2.1 What the image processor already does

For vision models, Hugging Face’s AutoImageProcessor (or older FeatureExtractor) usually:

Resizes the image (often with center crop).
Normalizes pixel values (mean/std).
Converts to a tensor of shape [batch_size, channels, height, width].

From the docs:

To achieve this, an image is resized (center cropped) and the pixel values are normalized and rescaled to the model’s expected values.

So even if your raw images are various sizes, processor(image, return_tensors="pt") is almost certainly:

Resizing them to a fixed target size (e.g. 224×224),
Or at least to a predictable size (e.g. shortest edge to a fixed value + crop).

That means:

The model itself expects standardized size.
You’re not really avoiding crop/resize; the processor is doing it anyway.

2.2 Why resizing ahead of time can still help

If your originals are very large (e.g. 2000×1500, 4K photos):

Every call to the processor has to decode a large image and then resize it down to model size.
Doing that:
- For every image,
- For every epoch,
- Synchronously in the main process (num_workers=0),

is expensive.

Pre-resizing the raw images (offline or via a one-time script) to a “reasonable” size, like 256–512 pixels on the shortest edge, will:

Reduce CPU work per image.
Reduce I/O (smaller files).
Make per-sample preprocessing cheaper.

Typical vision pipelines (including many Kaggle + HF examples) apply transforms like:

Resize,
Center crop,
Random horizontal flip,
Normalize.

So yes: if your raw images are much larger than your model’s expected size, downscaling them before training can noticeably help.

If you don’t like hard crops or distortion:

Use a resize that preserves aspect ratio (e.g. shortest edge to 256).
Optionally pad to square instead of cropping (letterboxing).
Let the processor do the final center crop to its native size.

2.3 Do you need uniform size before the processor?

No. For correctness:

AutoImageProcessor can handle varying input sizes and resize them appropriately.

Uniform size before the processor is mainly about performance:

Smaller files + fewer pixels to process.
Less per-image CPU work when building batches.

So the priority order is:

Fix the data pipeline (no huge .map, use with_transform, multiple workers).
If raw images are very large, pre-resize them once offline.

3. About `dataset.with_transform(pipeline(...))` and evaluator

You wrote:

I was considering using dataset.with_transform(pipeline()) to run a dataset through a pipeline, but based on what you said I may want to avoid it since I have some additional post-processing that needs to happen for evaluator.

That instinct is correct.

3.1 What `with_transform(pipeline())` would do

with_transform expects a callable that takes a batch and returns a modified batch (dict of arrays/tensors).
A transformers.pipeline object is an inference wrapper that:
- Runs the full pre-processing + model + post-processing stack.
- Returns predictions (e.g. label strings, scores), not training features.

Using pipeline inside with_transform would mean:

For each dataset access, run full inference rather than producing model inputs:
- Very slow.
- Not aligned with what Trainer expects.

For evaluation with extra post-processing, HF’s evaluate library recommends:

Either pass the pipeline directly to an evaluator, which handles looping under the hood, or
Write a simple for loop over the dataset, call pipeline, and feed predictions into metrics manually.

So your plan:

Instead I’ll be using a for loop based on one of the examples…

is the right approach for that separate evaluation task.

4. Why startup is still slow even with one `.map`

You clarified:

dataset.map() isn’t being run more than once … no group_by_length. I guess the issue is data loading?

Given that:

One heavy .map before Trainer:
- You already pay the cost of decoding + processing every image up front.
- You then store large pixel_values arrays in Arrow.
- Trainer’s DataLoader now reads large tensors from disk rather than small image files.
dataloader_num_workers=0:
- Main process does all the reading + collation.
- There is no overlap; GPU waits for CPU I/O.
Variable large images:
- If your process() function is decoding large images and resizing them on the fly, the .map step itself is heavy.
- If you instead kept raw images and used with_transform, the per-batch preprocessing becomes heavy unless you add workers.

So even if .map runs only once, startup still includes:

Scanning ImageFolder (once).
Running that .map (once, but heavy).
Building DataLoaders on a tensor-heavy dataset (which may require scanning metadata).

And then each epoch still has to read those large tensors from disk.

5. Concrete ways to speed up training startup in your situation

Here’s a prioritized list tailored to your setup.

5.1 If you can change the data pipeline (recommended)

Stop precomputing tensors with .map for training runs.
- Keep dataset as {"image": Image, "labels": int} from imagefolder.
- Use with_transform (or set_transform) to apply the processor lazily per batch:
```
def transform(batch):
    inputs = processor(batch["image"], return_tensors="pt")
    inputs["labels"] = batch["labels"]
    return inputs

prepared_ds = dataset.with_transform(transform)
```
- This is the pattern shown in the HF image classification guide and ViT examples.
Set dataloader_num_workers and dataloader_pin_memory.
```
training_args = TrainingArguments(
    ...,
    dataloader_num_workers=2,      # or 4 if CPU allows
    dataloader_pin_memory=True,
    remove_unused_columns=False,   # important with transforms
)
```
From HF’s own performance guide:

Configure dataloader_pin_memory and dataloader_num_workers to allocate pinned memory and increase the number of workers.

And PyTorch’s tuning guide:

Setting num_workers > 0 enables asynchronous data loading and overlap between training and data loading. pin_memory=True speeds up host→GPU transfers.
Optionally resize raw images offline if they are huge.
- Downsample originals to a modest size (e.g. 256–512 on shortest edge).
- Preserve aspect ratio to avoid distortion; crop or pad later if needed.
- Let the image processor do the final resize/center-crop to model’s expected size.
This shrinks per-sample CPU work and disk I/O.

With this setup, startup time becomes:

Folder scan (once).
Very light Trainer initialization (no heavy .map).
Workers begin prefetching the first batches while GPU waits, so you see the first training logs faster.

5.2 If you insist on using `.map` for preprocessing

If you really want to stick to .map (e.g. you prefer to pay preprocessing cost once and reuse the dataset many times), you can still improve startup by:

Preprocess once in a separate script and save to disk.

# preprocess.py
from datasets import load_dataset

raw = load_dataset("imagefolder", data_dir="folder_datasets")

def process(batch):
    feats = processor(batch["image"], return_tensors="pt")
    feats["labels"] = batch["labels"]
    return feats

ds = raw.map(process, batched=True)
ds = ds.remove_columns("image")
ds.save_to_disk("processed_ds")

Then in your training notebook:

from datasets import load_from_disk
ds = load_from_disk("processed_ds")

load_from_disk is generally much faster than recomputing the map; the Datasets docs recommend this when you have expensive preprocessing you don’t want to repeat.

Still use dataloader_num_workers and pin_memory.

Even with a preprocessed dataset, using multiple workers helps reading and batching these large tensors. HF and PyTorch both recommend non-zero num_workers for performance.
Keep resolution reasonable.

If you precompute pixel_values, make sure you’re not storing huge resolutions you don’t need. Stick close to the model’s expected size.

This doesn’t remove the cost of the initial .map (you pay it once in preprocess.py), but it removes that cost from training startup and makes training notebooks lighter.

6. Summary

dataset.map() is only executed where you call it; Trainer does not call it internally.
Your slow startup is now mainly from:
- Heavy preprocessing (large images → tensors) in .map and/or processor.
- DataLoader running in the main process (dataloader_num_workers=0).
num_workers controls how many CPU worker processes load and preprocess batches in parallel; set it in TrainingArguments.dataloader_num_workers.
Standardizing image size is already done by the AutoImageProcessor (resize + crop + normalize). Pre-resizing large originals offline can still significantly reduce CPU and I/O.
dataset.with_transform(pipeline(...)) is not a good idea for training; pipelines are for inference, and using them in with_transform would run full inference at each access. For evaluator + custom post-processing, a simple for loop around the pipeline (as in the docs) is the right pattern.

If you (a) switch to with_transform for training, (b) set a non-zero dataloader_num_workers, and (c) avoid extremely large raw images, you should see a noticeable reduction in how long it takes for trainer.train() to move from “starting…” to actually printing the first training progress.

PurpleSwans · November 29, 2025, 6:45pm

Alright, I’ll add improvements in the following order:

add a ds.save_to_disk() and ds.load_from_disk() pair so there is a post-processing version of the dataset accessible without using up RAM and temporary/cache space. (already added this, but didn’t see much difference)
add dataload_naum_works=2 to my TrainingArguments (I might be able to handle 4, but starting with option least likely to crash)
Add a line in the process() function downscaling images before running them through the image processor. I’d like to avoid this, since some of the things I based label classification on are small details, but if needs must

I don’t want to use ds.with_transform() since although the initial startup concerns me, I don’t want to have the actual training process longer as a result, and it looks like the with_transform() needs to be re-calculated for every epoch through the dataset.

John6666 · November 30, 2025, 12:25am

I think that’s probably the right way to go, too.

Good plan, and your ordering is reasonable. Let me walk through what each change really buys you, where it falls short, and whether avoiding with_transform actually helps or hurts in your situation.

1. `save_to_disk` / `load_from_disk`: what it actually fixes (and why you saw little change)

What it does well

save_to_disk + load_from_disk mainly help with:

Not re-running expensive .map when you restart Python / Jupyter.
Freeing RAM if your dataset lived in memory (e.g. from from_pandas) and you want it memory-mapped instead.

For example, the Datasets docs explicitly say:

If you want to reduce memory usage, you should write it back on disk and reload using save_to_disk / load_from_disk.

And a GitHub issue notes big speed differences when loading pre-saved Arrow vs rebuilding from scratch.

So:

If you had to re-run .map every time you restarted, save_to_disk would be a big win.
If your dataset was in-memory before, load_from_disk would move it to a more efficient memory-mapped representation.

Why you didn’t see much difference

In your case:

You were already in a single Jupyter session.
You were only running .map once and then training.
Adding ds.save_to_disk() and ds.load_from_disk() in the same session just adds one extra disk write + read, but:
- It doesn’t change the fact your dataset remains a big “tensor-heavy” Arrow dataset.
- It doesn’t change the cost of reading those big pixel_values during training.

So it’s expected that this step alone doesn’t dramatically improve startup in your current workflow. It’s still useful:

For future runs (you won’t ever have to recompute that .map again).
For RAM usage (if you had constructed a dataset in memory and needed it on disk).

But it doesn’t remove the core bottleneck: heavy images + global .map + single-threaded dataloader.

One extra trick, if you keep the precomputed dataset:

Run ds = ds.flatten_indices() before save_to_disk to ensure contiguous chunks; the docs note that an indices mapping can slow down reading.

2. `dataloader_num_workers`: very likely to help (a lot) and low risk

You wrote:

add dataload_naum_works=2 to my TrainingArguments

The flag you want is:

training_args = TrainingArguments(
    ...,
    dataloader_num_workers=2,
)

(Noting the correct spelling.)

Why this matters

By default, Trainer uses dataloader_num_workers=0, which means:

The main process does all data loading and preprocessing.
There’s no overlap between CPU work and GPU compute.
For image tasks with non-trivial transforms, that often becomes the bottleneck.

Hugging Face and PyTorch both explicitly recommend increasing num_workers:

HF’s “image loading best practice” thread: use imagefolder + transforms, and “to improve data loading speed, use a PyTorch DataLoader with num_workers > 1 so image decoding can be done in parallel in subprocesses.”
A Hugging Face Transformers issue: a user reported 10× speedup on image classification by setting dataloader_num_workers=16 instead of 0; they realized the GPU was idle waiting for data.
Another issue specifically about dataloader_num_workers=0 shows data load speed is much slower with 0 or 1 worker than with more; the dev notes that with workers, the loader can keep the prefetch queue full.

On a laptop:

Start with dataloader_num_workers=2 as you planned.
If training is stable and CPU isn’t maxed out, try 4 and compare epoch time.

This will not directly reduce the initial “Trainer start” delay caused by .map, but:

It will speed up throughput per epoch, especially if you later move to with_transform.
It may also reduce the perceived gap between “train called” and “first log printed”, because workers can prefetch while the GPU gets ready.

This is the single most important tweak if your bottleneck is data loading.

3. Extra downscaling in `process()`: pros, cons, and compromise

You wrote:

Add a line in the process() function downscaling images before running them through the image processor. I’d like to avoid this, since some of the things I based label classification on are small details, but if needs must.

What’s already happening

Your AutoImageProcessor for deepmind/vision-perceiver-learned is already doing some combination of:

Resize / center-crop (or shortest-edge resize).
Normalization.
Conversion to [C, H, W] tensors.

So even if you don’t manually resize in process(), the processor is almost certainly resizing to the model’s native input size (or at least to some fixed resolution).

When manual downscaling helps

Manual downscaling helps when:

Your raw images are significantly larger than the model’s expected size (e.g. 3000×2000 vs 224×224).
You’re doing resizing (or similar) inside .map for every image:
- That’s massive CPU work up front.
- It creates very large intermediates (before cropping) and big precomputed tensors.

Pre-resizing images once offline to a moderately high resolution (e.g. 512 or 1024 on shortest edge) can:

Cut your .map time.
Reduce disk I/O and dataset size.

You’re right about the trade-off:

Smaller resolution can lose small details relevant for your labels.

So a reasonable compromise is:

Aim to keep raw image resolution not too far above what the model uses.
Prefer aspect-preserving resize over aggressive cropping: e.g. resize so shortest edge is 512, then let the processor do the final center crop to 224 or its preferred size.

That balances performance and preserving detail.

But: this is a secondary knob. I’d treat it as “nice to have” after fixing data loading and pipeline structure.

4. The key question: will `with_transform` make training slower overall?

You said:

I don’t want to use ds.with_transform() since although the initial startup concerns me, I don’t want to have the actual training process longer as a result, and it looks like the with_transform() needs to be re-calculated for every epoch through the dataset.

This is a very common concern, and the answer is:

Yes, with_transform runs your transform for each batch each epoch.
But that doesn’t necessarily mean training is slower in wall-clock time once you:
- Use dataloader_num_workers > 0, and
- Avoid the heavy .map upfront.

How to compare intuitively

Think of two modes:

Mode A: Precompute everything with `.map` (your current pipeline)

One-time cost:
- Very heavy: decode all images + run processor + write big tensors (pixel_values) to disk.
Per-epoch cost:
- Read large tensor arrays from disk for each batch, stack them, send to GPU.
- CPU work is lower (no resize/normalize per epoch), but:
  - Disk I/O is heavier (tensors are much larger than compressed JPEGs).
  - You get no benefit from augmentation (unless you do it in the model).
Behavior when you change preprocessing or code:
- You must rerun .map and possibly rebuild the dataset.

Hugging Face maintainers describe this “decode once and save arrays” pattern as very fast at runtime but high on disk usage, and warn that .map can be expensive for large image datasets.

Mode B: Lazy transform with `with_transform` + multiple workers

One-time cost:
- No heavy .map.
- Just build the dataset (image paths + labels).
Per-epoch cost:
- Each worker:
  - Reads compressed image.
  - Decodes + resizes + normalizes on the fly via processor.
- Cost is per-batch, but with num_workers>0, Preprocessing can be overlapped with GPU compute.
Disk:
- Much lighter, since you only store compressed images + labels.
Behavior:
- Changing transforms (e.g. adding augmentation) is trivial and doesn’t require recomputing a cached dataset.

HF’s best-practice thread for images explicitly recommends:

Use the new imagefolder builder, and combined with with_transform(transform_to_tensor) you get a good tradeoff in performance vs disk space. To improve data loading speed, use a PyTorch DataLoader with num_workers > 1.

So in HF’s own guidance:

Mode B is the “recommended” pattern for many image tasks, not a fallback.

Will training throughput be worse with `with_transform`?

Not necessarily, and often not, if:

You set dataloader_num_workers sensibly.
Your CPU is not extremely weak relative to the GPU.

For many real-world setups:

Mode A vs Mode B per-step time is comparable:
- A: heavy disk I/O + light CPU.
- B: lighter disk I/O + heavier CPU.
Mode B can be faster if disk is a bottleneck and CPU has headroom.

Your specific pain is startup (before the first step), not per-step throughput. For that:

Mode B almost always wins, because it removes the upfront .map and large dataset write entirely.

Given you’re on a 4 GB 3050 laptop, CPU is likely fine, disk is not incredible, and GPU is modest. In that regime:

You are unlikely to be dramatically slower per step with with_transform + num_workers.
You can still pre-resize images moderately to help CPU.

If you really want to be sure:

Run a small experiment:
- Take a subset (say 2000 images).
- Train 1–2 epochs with your current .map pipeline.
- Then convert to with_transform + dataloader_num_workers=2 and train same config.
- Compare “epoch wall-clock time” and “time from trainer.train() to first log”.

That will tell you how big the difference is on your machine.

5. My recommendation given your ordering and constraints

Given everything:

Keep save_to_disk / load_from_disk
- It’s harmless and protects you from recomputing .map when you restart.
- Optionally add flatten_indices() before saving to ensure good layout.
Absolutely add dataloader_num_workers=2 (and pin_memory=True)
- This is a must, regardless of map vs with_transform.
- It directly addresses data-loading slowness.
Experiment with mild image downscaling only if needed
- If your raw images are huge, pre-resize them offline or in a one-time preprocessing step.
- Keep details as much as possible (e.g. shortest edge 512) and rely on the processor for final resizing.
Strongly consider at least trying with_transform on a subset
- Your concern about recalculation each epoch is valid, but with num_workers>0 the per-step effect is often small.
- It completely removes the heavy .map + save step from startup, which is what you’re most worried about.
- It gives you much more flexibility to modify preprocessing without recomputing everything.

If, after a controlled experiment on a smaller subset, you find that:

with_transform + num_workers does not meaningfully change epoch time,
then I would switch the main pipeline to that pattern—it’s exactly what the HF docs and maintainers recommend for image classification with Trainer and imagefolder.

If you find that it really does slow per-epoch training in your environment, you can stay with .map + save_to_disk but then:

Make sure you only run .map once in a separate prep step.
Use load_from_disk in notebooks so training startup doesn’t include .map.

Either way, dataloader_num_workers and reasonable image sizes will help, and your training will be more responsive than it is now.

PurpleSwans · December 2, 2025, 5:59pm

FYI, although the sugestions were useful… the thing that made the most difference in actual processing time was switching to a different model. Due to the AutoModel mistake I mentioned in my original post, I was limiting my model for specific configurations, but switching from a perciever model to a more popular Google image classification model (google/vit-base-patch16-224) fixed the issue. I’m guessing there’s something in the configurations pulled from the original model that was causing the slowdown.

system · December 3, 2025, 6:00am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Best practice loading images files 🤗Datasets	3	1675	March 27, 2024
Using map take 7,2x times longer than set_transform 🤗Transformers	0	203	November 15, 2023
Ds.map(): optimizing PIL Image processing as tensorflow tensor 🤗Datasets	2	1389	April 27, 2024
[Solved] Image dataset seems slow for larger image size 🤗Datasets	7	3457	December 16, 2021
Image dataset with_transform not applied Beginners	1	138	July 25, 2024

Trainer taking a long time to start

1. Direct answer

2. What dataset.map actually does (and when it runs)

2.1 Caching & fingerprints

3. What is running every time: lazy transforms

3.1 with_transform and set_transform are “on-the-fly”

4. What Trainer itself does with your dataset

5. Why it feels like .map is running again

6. So, to answer your specific concerns

7. Mental model to keep in mind

1. num_workers and “sample size”: what they actually control

1.1 Where num_workers comes from in your setup

1.2 What is “sample size” here?

2. Variable image sizes and resizing: why it matters

2.1 What the image processor already does

2.2 Why resizing ahead of time can still help

2.3 Do you need uniform size before the processor?

3. About dataset.with_transform(pipeline(...)) and evaluator

3.1 What with_transform(pipeline()) would do

4. Why startup is still slow even with one .map

5. Concrete ways to speed up training startup in your situation

5.1 If you can change the data pipeline (recommended)

5.2 If you insist on using .map for preprocessing

6. Summary

1. save_to_disk / load_from_disk: what it actually fixes (and why you saw little change)

What it does well

Why you didn’t see much difference

2. dataloader_num_workers: very likely to help (a lot) and low risk

Why this matters

3. Extra downscaling in process(): pros, cons, and compromise

What’s already happening

When manual downscaling helps

4. The key question: will with_transform make training slower overall?

How to compare intuitively

Mode A: Precompute everything with .map (your current pipeline)

Mode B: Lazy transform with with_transform + multiple workers

Will training throughput be worse with with_transform?

5. My recommendation given your ordering and constraints

Related topics

2. What `dataset.map` actually does (and when it runs)

3.1 `with_transform` and `set_transform` are “on-the-fly”

5. Why it feels like `.map` is running again

1. `num_workers` and “sample size”: what they actually control

1.1 Where `num_workers` comes from in your setup

3. About `dataset.with_transform(pipeline(...))` and evaluator

3.1 What `with_transform(pipeline())` would do

4. Why startup is still slow even with one `.map`

5.2 If you insist on using `.map` for preprocessing

1. `save_to_disk` / `load_from_disk`: what it actually fixes (and why you saw little change)

2. `dataloader_num_workers`: very likely to help (a lot) and low risk

3. Extra downscaling in `process()`: pros, cons, and compromise

4. The key question: will `with_transform` make training slower overall?

Mode A: Precompute everything with `.map` (your current pipeline)

Mode B: Lazy transform with `with_transform` + multiple workers

Will training throughput be worse with `with_transform`?