Automated ML job - Error loading data schema Timeout exceeded

Question

Automated ML job - Error loading data schema Timeout exceeded

Onyango, David 31

I am getting the error, "Error loading data schema, please go back and choose another data. Timeout of 20000ms exceeded." while submitting an Automated ML job for Image classification against a source with approximately 1,700 training images. The soon to be decommissioned Azure custom vision was able to handle this and an additional 700 images associated with Testing and Validation.

Sridhar M 5,340 Reputation points Microsoft External Staff Moderator

2025-12-15T19:54:06.4066667+00:00
Hi Onyango, David

Welcome to Microsoft Q&A and Thank you for reaching out.

This error occurs before the AutoML job even starts, during the step where Azure ML tries to scan your dataset and infer the schema. When the dataset contains hundreds or thousands of image files (your case: ~1,700 total), Azure ML must enumerate every file and request metadata from the storage account. Azure ML imposes backend execution timeouts (commonly 20–60 seconds), and these limits cannot be changed, so schema loading may fail when the dataset contains many small files.

Image classification datasets stored as folders of hundreds of small files trigger large numbers of:

Directory lookups

Blob metadata reads

File open/read calls

Azure ML’s storage-layer guidance confirms that many small files create high request overhead, making it easier to hit storage limits, bandwidth limits, or request throttles — all of which slow down schema loading.

Network/security issues can make schema loading even slower

If your workspace uses VNet + private endpoints or storage firewall restrictions, Azure ML may struggle to read dataset files fast enough. Azure ML documentation notes that dataset preview and schema loading can fail if the storage account does not allow required access paths. The fix is to temporarily enable:

Public Network Access: Enabled, or

Allow trusted Microsoft services The most reliable fix: materialize the dataset into fewer, larger files

Microsoft's recommendation for handling timeout‑prone datasets is to pre-materialize the dataset into a small number of larger files so Azure ML avoids scanning thousands of separate images at schema time. Examples that work well:

ZIP file containing images

Parquet file containing encoded image bytes

TFRecord file for vision data

Azure ML engineering explicitly states that materializing and registering the dataset before submitting the AutoML job is the most stable solution.

Additional optimizations to avoid future schema failures

You can further improve schema loading reliability by:

Flattening the folder structure (avoid deeply nested directories)

Avoiding mounts for huge numbers of small files (downloads perform better than mount-on-open behaviors)

Using premium storage if you have very high file count workloads

Recommended practical workflow for your 1,700-image dataset:

To avoid the AutoML “Error loading data schema” completely:

Zip your 1,700 images, upload the ZIP to your datastore.

Create a data asset pointing to the ZIP.

Use AutoML Vision with the ZIP as input — Azure ML will unpack internally.

(Optional) Convert images → TFRecords or parquet if you want maximum scalability.

References:

Set up AutoML to train computer vision models

Prepare data for computer vision tasks with automated ML

Storage limits of experiment snapshots

Data source and format

Known Issues and Troubleshooting

I Hope this helps. Do let me know if you have any further queries.

Thank you!
Onyango, David 31 Reputation points

2025-12-15T22:37:38.1966667+00:00

in the case of zipping the file, how will it interpret the .jsonl placed on the storage container along with the images it references?
Sridhar M 5,340 Reputation points Microsoft External Staff Moderator

2025-12-15T23:20:16.6+00:00
Hi Onyango, David

Azure AutoML does NOT read images from inside a ZIP archive. AutoML for Images requires that each JSONL entry’s image_url points to an AzureML datastore path (an azureml:// URI), not a file path inside a ZIP. Zipping is only a workaround to reduce schema-load timeouts, but AutoML does not open and enumerate the ZIP’s internal file[dori-uw-1.kuma-moon.com]

Each JSON line must reference the full AzureML datastore path where the actual image file exist

{ "image_url": "azureml://subscriptions/<sub>/resourcegroups/<rg>/workspaces/<ws>/datastores/<datastore>/paths/images/image001.jpg", "label": "cat" }

If you upload a ZIP archive to the datastore to reduce file-count–related timeouts, you must explicitly unzip the archive (e.g., in a preprocessing step or manually) so that the images exist as individual files in the datastore before AutoML starts. AutoML will not internally unzip or look inside the ZIP. The JSONL requires that each image exists as an accessible datastore path. [dori-uw-1.kuma-moon.com]

MLTable you use for AutoML Vision points to the JSONL file. That JSONL file must reference real, accessible cloud image paths. The AutoML Vision pipeline (experimental) loads images from those URIs only. If the images remain inside a ZIP, they simply cannot be resolved. Therefore, uploading ZIP → unzipping → generating JSONL with datastore URIs is the correct workflow. [dori-uw-1.kuma-moon.com]

So in the ZIP scenario:

AutoML does NOT interpret JSONL paths as ZIP-internal references.

The JSONL is interpreted normally (each image_url must point to a datastore path).

The ZIP is only a temporary storage optimization.

Before training, images must be extracted so the JSONL can correctly point to them.

If JSONL points to images still inside a ZIP, AutoML will fail because the files cannot be located.

If you’d like, I can provide a correct end-to-end workflow showing how to:

Upload ZIP →

Unzip inside the datastore →

Generate JSONL automatically →

Build MLTable for AutoML Vision.So in the ZIP scenario:

AutoML does NOT interpret JSONL paths as ZIP-internal references.

The JSONL is interpreted normally (each image_url must point to a datastore path).

The ZIP is only a temporary storage optimization.

Before training, images must be extracted so the JSONL can correctly point to them.
Onyango, David 31 Reputation points

2025-12-16T06:39:33.7+00:00

I think you had misunderstood. The images are already in the datastore along with the JSONL pointing to each relative path of the images already in the datastore.
Sridhar M 5,340 Reputation points Microsoft External Staff Moderator

2025-12-16T16:27:43.33+00:00

Hi Onyango, David

When you zip a JSONL file together with the images it references, the system does not “scan” the storage container or resolve URLs dynamically. Instead, interpretation depends on how the JSONL references the images and how the zip is unpacked by the service.

The JSONL is interpreted exactly as written. Zipping files together does not automatically rewrite paths or “link” images unless the JSONL references them correctly.

If your JSONL references images using relative paths, and those files exist in the same ZIP, everything works.
Sridhar M 5,340 Reputation points Microsoft External Staff Moderator

2025-12-17T18:17:09.7133333+00:00

Hi Onyango, David

Did you get any chance to review the above response.

Thank you!

Aryan Parashar 3,690 Microsoft External Staff Moderator

Hi Onyango, David,

I understand that you're encountering some issues, and I’d like to help resolve this as smoothly as possible.
To mitigate the problem, please ensure the following steps are followed:

Use the correct schema as shown below:

{
   "image_url":"azureml://subscriptions/<my-subscription-id>/resourcegroups/<my-resource-group>/workspaces/<my-workspace>/datastores/<my-datastore>/paths/<path_to_image>",
   "image_details":{
      "format":"image_format",
      "width":"image_width",
      "height":"image_height"
   },
   "label":"class_name",
}

Example of a JSONL file for multi-class image classification:

{"image_url": "azureml://subscriptions/my-subscription-id/resourcegroups/my-resource-group/workspaces/my-workspace/datastores/my-datastore/paths/image_data/Image_01.jpg", "image_details":{"format": "jpg", "width": "400px", "height": "258px"}, "label": "can"}
{"image_url": "azureml://subscriptions/my-subscription-id/resourcegroups/my-resource-group/workspaces/my-workspace/datastores/my-datastore/paths/image_data/Image_02.jpg", "image_details": {"format": "jpg", "width": "397px", "height": "296px"}, "label": "milk_bottle"}
.
.
.
{"image_url": "azureml://subscriptions/my-subscription-id/resourcegroups/my-resource-group/workspaces/my-workspace/datastores/my-datastore/paths/image_data/Image_n.jpg", "image_details": {"format": "jpg", "width": "1024px", "height": "768px"}, "label": "water_bottle"}

Ensure that the images exist in the datastore. The image URL should not return an error when accessed.

Please let me know if you continue to face any issues after verifying the above details.
I’m here to help.

Thank you for your patience.

Onyango, David 31 Reputation points

2025-12-26T21:03:03.4766667+00:00
Hi Aryan, thanks for the feedback. The format suggested above has taken me step closer. I have added the image details as shown below and now able to parse the data (same results whether using image filename only in the relative path or the full URL as per your sample). However, the error message, **"**The selected data must have a stream column to submit a computer vision job."

{"image_url": "img_0.jpg","image_details":{"format": "jpg"}, "label": "Unknown"}

{"image_url": "img_1.jpg","image_details":{"format": "jpg"},"label": "Genuine"}

I have attempted adding the conversion parameters in MLTable as per below unsuccessfully.

paths:

file: ./train.jsonl

transformations:

read_json_lines_with_schema: (as well as read_json_lines:) encoding: utf8 convert:
image_url: stream label: string

***The only thing I haven't done as per your example is putting the image dimensions as they are about 1,700.
Onyango, David 31 Reputation points

2025-12-26T21:10:02.38+00:00

See below screenshot of the latest error message after adding the image_details element in the JSON as described above.
Onyango, David 31 Reputation points

2025-12-29T00:03:48.83+00:00
I have addressed the issue stream column type as, stream_info by specifying it in the MLTable as shown below. I now have a problem with using the compute as it appears either my region does not support GPUs or I need to request quota. I will take this offline and raise a separate ticket. Thanks for the pointers.

paths:

file: ./train.jsonl

transformations:

read_json_lines: encoding: utf8

convert_column_types:

columns: image_url column_type: stream_info
Aryan Parashar 3,690 Reputation points Microsoft External Staff Moderator

2025-12-29T06:55:22.0166667+00:00

Hi Onyango, David,

I’m glad to hear that things are working for you now.

You are right in your analysis that you need to request a quota. If you face any issues while doing that, please create a new thread.

Let me know if you have any further questions regarding the data schema. If everything looks good, please confirm so that I can post an answer and we can close this thread.
Onyango, David 31 Reputation points

2026-01-06T22:39:21.8666667+00:00

Hi Aryan, no further questions on this even though I haven't succeeded in the quota request to test this. You may close this thread as successful as the specific query has been addressed.