Architecture advice for distributed load testing Azure SignalR to 1 million concurrent users with custom scenarios

Question

Architecture advice for distributed load testing Azure SignalR to 1 million concurrent users with custom scenarios

Benyamin Radmard 0

I am architecting a large-scale real-time application using Azure SignalR Service (Premium Tier) and need to validate our system's performance scaling from 50k up to 1 million concurrent users.

The Challenge: My requirement is to run custom "bot" scenarios where clients are active, not just idle. Each simulated user needs to:

Negotiate with our backend (HTTP request).

Connect to SignalR.

Join a specific group

Send/Receive messages at a specific interval (e.g., 1 message every 5 seconds).

I am concerned that standard load testing approaches (like a simple JMeter script on a few VMs) will hit client-side bottlenecks long before we reach the SignalR Service limits. Specifically:

Ephemeral Port Exhaustion: A single load agent IP is limited to ~65k ports.

CPU Context Switching: Managing 1M active connections requires massive client-side resources.

Negotiation Bottleneck: Ramping up 1M users creates a "thundering herd" on our backend API /negotiate endpoint.

My Questions:

Distributed Architecture: What is the recommended Azure architecture for generating this level of distributed load? Is there a standard pattern using AKS (Kubernetes) to orchestrate thousands of lightweight clients to avoid port exhaustion?

Tooling: Are there specific tools or SDKs recommended by Microsoft for orchestrating custom SignalR scenarios at this scale? (Standard tools often struggle to simulate "smart" client logic without consuming excessive resources).

Ramp-up Strategy: How should we handle the load on the backend negotiation endpoint during the test? Is it common practice to mock the negotiation step during load tests to isolate the SignalR Service performance?

Any advice on the "Right Way" to architect this test bench on Azure would be appreciated.I am architecting a large-scale real-time application using Azure SignalR Service (Premium Tier) and need to validate our system's performance scaling from 50k up to 1 million concurrent users.

The Challenge: My requirement is to run custom "bot" scenarios where clients are active, not just idle. Each simulated user needs to:

Negotiate with our backend (HTTP request).

Connect to SignalR.

Join a specific groups.

Send/Receive messages at a specific interval (e.g., 1 message every 5 seconds).

I am concerned that standard load testing approaches (like a simple JMeter script on a few VMs) will hit client-side bottlenecks long before we reach the SignalR Service limits. Specifically:

Ephemeral Port Exhaustion: A single load agent IP is limited to ~65k ports.

CPU Context Switching: Managing 1M active connections requires massive client-side resources.

Negotiation Bottleneck: Ramping up 1M users creates a "thundering herd" on our backend API /negotiate endpoint.

My Questions:

Distributed Architecture: What is the recommended Azure architecture for generating this level of distributed load? Is there a standard pattern using AKS (Kubernetes) to orchestrate thousands of lightweight clients to avoid port exhaustion?

Tooling: Are there specific tools or SDKs recommended by Microsoft for orchestrating custom SignalR scenarios at this scale? (Standard tools often struggle to simulate "smart" client logic without consuming excessive resources).

Ramp-up Strategy: How should we handle the load on the backend negotiation endpoint during the test? Is it common practice to mock the negotiation step during load tests to isolate the SignalR Service performance?

Any advice on the "Right Way" to architect this test bench on Azure would be appreciated.

Golla Venkata Pavani 270 Reputation points Microsoft External Staff Moderator

2025-12-08T15:05:00.0666667+00:00

Hii @Benyamin Radmard,

Just checking in to see if you had a chance to see the previous response posted by me. If you have any further questions do let us know.

2 answers

Your answer

Golla Venkata Pavani 270 Reputation points Microsoft External Staff Moderator

2025-12-08T15:05:00.0666667+00:00

Hii @Benyamin Radmard,

Just checking in to see if you had a chance to see the previous response posted by me. If you have any further questions do let us know.

Answer 1

Hii @Benyamin Radmard,

Thank you for reaching us about the issue of architecting an Azure-based distributed load testing solution to simulate 50K–1M active SignalR clients with custom scenarios (negotiation, group join, messaging) without hitting client-side bottlenecks like port exhaustion and backend negotiation overload.

We are providing Quick Guidance for Distributed Load Testing Azure SignalR.

Architecture Patterns

AKS + NAT Gateway (Recommended)
- Deploy lightweight .NET bot clients in AKS pods (1–2K active connections per pod).
- Attach NAT Gateway to avoid SNAT port exhaustion (~65K per IP). With up to 16 public IPs, you can achieve ~1M ports.
- Use HPA/KEDA for horizontal scaling.
- Multiple subnets with NAT Gateway for isolation and higher aggregate SNAT inventory.
Azure Load Testing + JMeter Plugin
Managed orchestration with SignalR plugin for connect/join/send scenarios.
- Attach ALT engines to VNet with NAT Gateway for SNAT scaling.

Tools:

Microsoft client SDKs:
Use Microsoft.AspNetCore.SignalR.Client (.NET) or @microsoft/signalr (JS/TS) for full custom logic:
- Negotiate > Connect > Join Group > Send/Receive messages.
Azure JMeter SignalR plugin: Ideal for declarative orchestration at scale via Azure Load Testing; supports connect, join group, send intervals, and runs in distributed mode.
Azure SignalR Benchmark Tool (GitHub) for AKS deployment templates.
Crank/Crankier: Microsoft’s perf tools for connection density. Useful to validate max concurrent connections and transport behavior, but they primarily hold idle connections.

Ramp-Up & Negotiation

Front the negotiate endpoint with APIM (or lightweight negotiate server) to rate limit, cache short‑TTL tokens, and apply jittered backoff at the edge. APIM can act as the negotiate server pattern when you want to decouple the app core from load-gen spikes.
Pre‑warm hub server connections: The SDK opens ~5 server WebSockets per hub by default; configure InitialHubServerConnectionCount higher so servers are ready before you start the ramp.
Staged ramps with jitter: Start with 1–5k clients/sec, then increase in steps while randomly delaying each client’s negotiate by ±(0–N) ms to avoid synchronized spikes on your API.

Scale & Monitoring

Capacity Planning
- Pre-scale units (Premium tier recommended) before test day.
- Each SignalR unit ≈ 1,000 concurrent connections.
Monitor
- SignalR metrics: Server Load, connections, outbound messages.
- AKS/ALT agents: CPU, memory, SNAT failures.
- NAT Gateway SNAT port usage and reuse.

Reference:
https://dori-uw-1.kuma-moon.com/en-us/azure/azure-signalr/signalr-concept-performance

https://github.com/Azure/azure-signalr-bench

https://dori-uw-1.kuma-moon.com/en-us/azure/azure-signalr/signalr-howto-use-management-sdk

https://dori-uw-1.kuma-moon.com/en-us/azure/azure-signalr/signalr-howto-scale-multi-instances
https://dori-uw-1.kuma-moon.com/en-us/azure/app-testing/load-testing/how-to-high-scale-load?tabs=portal

Kindly let us know if the above comment helps or you need further assistance on this issue.

Please "Upvote" if the information helped you. This will help us and others in the community as well.

Answer 2

To architect a distributed load testing solution for Azure SignalR Service that scales from 50,000 to 1 million concurrent users, consider the following recommendations:

Distributed Architecture

Use Azure Kubernetes Service (AKS): Deploy your load testing agents in AKS to orchestrate thousands of lightweight clients. This setup helps manage resources effectively and can mitigate port exhaustion issues by distributing connections across multiple pods.
Horizontal Scaling: Scale your load testing agents horizontally by deploying multiple instances of your testing application within AKS. This allows you to handle more connections without hitting the ephemeral port limit.
Client-Side Optimization: Design your clients to be lightweight and efficient. Use techniques such as connection pooling and reusing existing connections where possible to reduce overhead.

Tooling

Custom Load Testing Tools: While standard tools like JMeter may not suffice for simulating complex client logic, consider building a custom load testing tool using .NET or Node.js that can handle the SignalR negotiation and messaging logic. This tool can be designed to simulate active users effectively.
Microsoft SDKs: Utilize the Azure SignalR Service SDK to manage connections and messaging efficiently. This SDK can help streamline the process of connecting clients and sending messages.

Ramp-up Strategy

Staggered Ramp-up: Implement a staggered ramp-up strategy for your load tests to avoid overwhelming the backend negotiation endpoint. Gradually increase the number of connections over time rather than starting all at once.
Mocking Negotiation: It is common practice to mock the negotiation step during load tests to isolate the performance of the SignalR Service. This approach allows you to focus on the scalability of SignalR without the additional load on your backend API.
Monitoring and Metrics: Continuously monitor the performance of both the SignalR Service and your backend during tests. Use Azure Monitor and Application Insights to gather metrics and identify bottlenecks.

By following these strategies, you can effectively architect a robust load testing environment for your Azure SignalR Service application, ensuring it can handle the desired scale of concurrent users while minimizing potential bottlenecks.

References:

Share via

Architecture advice for distributed load testing Azure SignalR to 1 million concurrent users with custom scenarios

2 answers

Distributed Architecture

Tooling

Ramp-up Strategy

Your answer