Architecture advice for distributed load testing Azure SignalR to 1 million concurrent users with custom scenarios

Benyamin Radmard 0 Reputation points
2025-12-05T08:17:31.7366667+00:00

I am architecting a large-scale real-time application using Azure SignalR Service (Premium Tier) and need to validate our system's performance scaling from 50k up to 1 million concurrent users.

The Challenge: My requirement is to run custom "bot" scenarios where clients are active, not just idle. Each simulated user needs to:

  1. Negotiate with our backend (HTTP request).

Connect to SignalR.

  1. Join a specific group

Send/Receive messages at a specific interval (e.g., 1 message every 5 seconds).

I am concerned that standard load testing approaches (like a simple JMeter script on a few VMs) will hit client-side bottlenecks long before we reach the SignalR Service limits. Specifically:

Ephemeral Port Exhaustion: A single load agent IP is limited to ~65k ports.

CPU Context Switching: Managing 1M active connections requires massive client-side resources.

Negotiation Bottleneck: Ramping up 1M users creates a "thundering herd" on our backend API /negotiate endpoint.

My Questions:

Distributed Architecture: What is the recommended Azure architecture for generating this level of distributed load? Is there a standard pattern using AKS (Kubernetes) to orchestrate thousands of lightweight clients to avoid port exhaustion?

Tooling: Are there specific tools or SDKs recommended by Microsoft for orchestrating custom SignalR scenarios at this scale? (Standard tools often struggle to simulate "smart" client logic without consuming excessive resources).

Ramp-up Strategy: How should we handle the load on the backend negotiation endpoint during the test? Is it common practice to mock the negotiation step during load tests to isolate the SignalR Service performance?

Any advice on the "Right Way" to architect this test bench on Azure would be appreciated.I am architecting a large-scale real-time application using Azure SignalR Service (Premium Tier) and need to validate our system's performance scaling from 50k up to 1 million concurrent users.

The Challenge: My requirement is to run custom "bot" scenarios where clients are active, not just idle. Each simulated user needs to:

Negotiate with our backend (HTTP request).

Connect to SignalR.

  1. Join a specific groups.

Send/Receive messages at a specific interval (e.g., 1 message every 5 seconds).

I am concerned that standard load testing approaches (like a simple JMeter script on a few VMs) will hit client-side bottlenecks long before we reach the SignalR Service limits. Specifically:

Ephemeral Port Exhaustion: A single load agent IP is limited to ~65k ports.

CPU Context Switching: Managing 1M active connections requires massive client-side resources.

Negotiation Bottleneck: Ramping up 1M users creates a "thundering herd" on our backend API /negotiate endpoint.

My Questions:

Distributed Architecture: What is the recommended Azure architecture for generating this level of distributed load? Is there a standard pattern using AKS (Kubernetes) to orchestrate thousands of lightweight clients to avoid port exhaustion?

Tooling: Are there specific tools or SDKs recommended by Microsoft for orchestrating custom SignalR scenarios at this scale? (Standard tools often struggle to simulate "smart" client logic without consuming excessive resources).

Ramp-up Strategy: How should we handle the load on the backend negotiation endpoint during the test? Is it common practice to mock the negotiation step during load tests to isolate the SignalR Service performance?

Any advice on the "Right Way" to architect this test bench on Azure would be appreciated.

Azure SignalR Service
Azure SignalR Service
An Azure service that is used for adding real-time communications to web applications.
{count} votes

2 answers

Sort by: Most helpful
  1. Golla Venkata Pavani 270 Reputation points Microsoft External Staff Moderator
    2025-12-05T09:37:47.39+00:00

    Hii @Benyamin Radmard,

    Thank you for reaching us about the issue of architecting an Azure-based distributed load testing solution to simulate 50K–1M active SignalR clients with custom scenarios (negotiation, group join, messaging) without hitting client-side bottlenecks like port exhaustion and backend negotiation overload.

    We are providing Quick Guidance for Distributed Load Testing Azure SignalR.

    1. Architecture Patterns
    • AKS + NAT Gateway (Recommended)
      • Deploy lightweight .NET bot clients in AKS pods (1–2K active connections per pod).
      • Attach NAT Gateway to avoid SNAT port exhaustion (~65K per IP). With up to 16 public IPs, you can achieve ~1M ports.
      • Use HPA/KEDA for horizontal scaling.
      • Multiple subnets with NAT Gateway for isolation and higher aggregate SNAT inventory.
    • Azure Load Testing + JMeter Plugin
    • Managed orchestration with SignalR plugin for connect/join/send scenarios.
      • Attach ALT engines to VNet with NAT Gateway for SNAT scaling.
    1. Tools:
    • Microsoft client SDKs:
      Use Microsoft.AspNetCore.SignalR.Client (.NET) or @microsoft/signalr (JS/TS) for full custom logic:
      • Negotiate > Connect > Join Group > Send/Receive messages.
    • Azure JMeter SignalR plugin: Ideal for declarative orchestration at scale via Azure Load Testing; supports connect, join group, send intervals, and runs in distributed mode.
    • Azure SignalR Benchmark Tool (GitHub) for AKS deployment templates.
    • Crank/Crankier: Microsoft’s perf tools for connection density. Useful to validate max concurrent connections and transport behavior, but they primarily hold idle connections.
    1. Ramp-Up & Negotiation
    • Front the negotiate endpoint with APIM (or lightweight negotiate server) to rate limit, cache short‑TTL tokens, and apply jittered backoff at the edge. APIM can act as the negotiate server pattern when you want to decouple the app core from load-gen spikes.
    • Pre‑warm hub server connections: The SDK opens ~5 server WebSockets per hub by default; configure InitialHubServerConnectionCount higher so servers are ready before you start the ramp.
    • Staged ramps with jitter: Start with 1–5k clients/sec, then increase in steps while randomly delaying each client’s negotiate by ±(0–N) ms to avoid synchronized spikes on your API.
    1. Scale & Monitoring
    • Capacity Planning
      • Pre-scale units (Premium tier recommended) before test day.
      • Each SignalR unit ≈ 1,000 concurrent connections.
    • Monitor
      • SignalR metrics: Server Load, connections, outbound messages.
      • AKS/ALT agents: CPU, memory, SNAT failures.
      • NAT Gateway SNAT port usage and reuse.

    Reference:
    https://dori-uw-1.kuma-moon.com/en-us/azure/azure-signalr/signalr-concept-performance

    https://github.com/Azure/azure-signalr-bench

    https://dori-uw-1.kuma-moon.com/en-us/azure/azure-signalr/signalr-howto-use-management-sdk

    https://dori-uw-1.kuma-moon.com/en-us/azure/azure-signalr/signalr-howto-scale-multi-instances
    https://dori-uw-1.kuma-moon.com/en-us/azure/app-testing/load-testing/how-to-high-scale-load?tabs=portal

    Kindly let us know if the above comment helps or you need further assistance on this issue.

    Please "Upvote" if the information helped you. This will help us and others in the community as well.

    1 person found this answer helpful.
    0 comments No comments

  2. Q&A Assist
    2025-12-05T08:17:56.0866667+00:00
    AI generated content. This question contains an answer created with AI from Q&A Assist. Learn more

    To architect a distributed load testing solution for Azure SignalR Service that scales from 50,000 to 1 million concurrent users, consider the following recommendations:

    Distributed Architecture

    1. Use Azure Kubernetes Service (AKS): Deploy your load testing agents in AKS to orchestrate thousands of lightweight clients. This setup helps manage resources effectively and can mitigate port exhaustion issues by distributing connections across multiple pods.
    2. Horizontal Scaling: Scale your load testing agents horizontally by deploying multiple instances of your testing application within AKS. This allows you to handle more connections without hitting the ephemeral port limit.
    3. Client-Side Optimization: Design your clients to be lightweight and efficient. Use techniques such as connection pooling and reusing existing connections where possible to reduce overhead.

    Tooling

    1. Custom Load Testing Tools: While standard tools like JMeter may not suffice for simulating complex client logic, consider building a custom load testing tool using .NET or Node.js that can handle the SignalR negotiation and messaging logic. This tool can be designed to simulate active users effectively.
    2. Microsoft SDKs: Utilize the Azure SignalR Service SDK to manage connections and messaging efficiently. This SDK can help streamline the process of connecting clients and sending messages.

    Ramp-up Strategy

    1. Staggered Ramp-up: Implement a staggered ramp-up strategy for your load tests to avoid overwhelming the backend negotiation endpoint. Gradually increase the number of connections over time rather than starting all at once.
    2. Mocking Negotiation: It is common practice to mock the negotiation step during load tests to isolate the performance of the SignalR Service. This approach allows you to focus on the scalability of SignalR without the additional load on your backend API.
    3. Monitoring and Metrics: Continuously monitor the performance of both the SignalR Service and your backend during tests. Use Azure Monitor and Application Insights to gather metrics and identify bottlenecks.

    By following these strategies, you can effectively architect a robust load testing environment for your Azure SignalR Service application, ensuring it can handle the desired scale of concurrent users while minimizing potential bottlenecks.


    References:

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.