Inquiry About Dataset for AI-Driven Cloud Load Balancing and Auto scaling of instances

sohamk28 · October 28, 2025, 6:38pm

Hi everyone,

I’m currently building a Smart Load Balancer with Auto-Scaling Instances and exploring ways to optimize cloud performance using AI-based techniques.

I’m looking for a dataset that contains:

Server or VM utilization data (CPU, memory, network usage)
Task or request distribution logs
Auto-scaling or workload patterns over time
Any real or simulated cloud performance metrics

I’d really appreciate it if anyone could suggest:

Publicly available cloud workload datasets
Google, Alibaba, or Azure cluster traces
Or any datasets that can help in modeling or testing AI-based load balancing algorithms

Thanks in advance for your help and suggestions

— Soham Kale

John6666 · October 29, 2025, 4:06am

I gathered some resources for now.

DinoDS · March 4, 2026, 8:01pm

Hi Soham, you can cover this in two ways: use public traces for realism, and synthetic traces for controlled stress testing.

Public datasets worth checking:

Google cluster traces (Borg) for job/task scheduling and resource usage patterns
Alibaba cluster trace for container workloads and utilization over time
Azure traces and other public workload datasets from academic benchmarking papers
Also look for “cluster trace”, “workload trace”, “autoscaling trace”, “request trace”, “datacenter telemetry”, “Kubernetes trace” on the Hub

If you cannot find a dataset with all signals in one place, a common approach is to fuse:

a request arrival trace (per service) plus
a resource utilization trace (per node or pod)
then derive autoscaling events from policy simulation.

How I can help you directly:

Provide a ready to use synthetic dataset generator that produces time series for CPU, memory, network, request rate, latency, error rate, plus autoscaling actions under different policies (HPA style, predictive, RL style)
Include bursty traffic, diurnal seasonality, noisy telemetry, failures, and multi service interference
Output formats that plug into training easily, like parquet plus a gym style environment spec for RL or a supervised dataset for predicting scale up and scale down actions
Add evaluation scripts for cost latency SLO violations and stability metrics, so you can compare heuristics vs learned policies

Topic		Replies	Views
Using Private cloud data to create datasets 🤗Datasets	0	54	July 15, 2024
Unlock AI training data with the open-sourced Synthetic Data SDK Show and Tell	0	79	February 4, 2025
Exploring contexts of occurrence of particular words in large datasets Research	2	838	October 19, 2022
Tools, datasets ,benchmarks in AI Safety 🤗Datasets	0	130	June 20, 2024
Inference endpoint + Dataset Inference Endpoints on the Hub	0	401	March 7, 2023

Inquiry About Dataset for AI-Driven Cloud Load Balancing and Auto scaling of instances

Related topics