One of the (many!) features I like about Visual Studio Team Services is that, while I can create my build definitions in the cloud-hosted service, I do not have to rely on the hosted build agents to run my builds. This is important because my builds might rely on specialized software that isn’t installed on the hosted agents. It’s also possible that my build processes rely on access to other resources on my local network that is not exposed to the Internet. In the case of the company I work for, we have several on-premises build servers each running multiple build agents.
When we first setup the new build servers and agents and registered them with our VSTS account it didn’t take long for us to notice something a bit odd (at least it was odd to us!). While we have a total of 16(‘ish) build agents running across four(‘ish) build servers we noticed that the builds tended to favor the agents running on our first build server instance. While the first build server saw a lot of action, most of the remaining build agents rarely picked up any builds. The issue this was causing was that the first server would make sure its four build agents were active before moving on to the next build server causing undo stress/load on the first build server while the remaining build servers essentially sat idle.
My initial thoughts were that the build agents would simply perform a “round-robin” selection process to select the next available build agent on the next available build server. What I discovered was something entirely different. Let’s look at an example scenario…
Let’s Build a Scenario…
Let’s assume we’ve setup three build servers (on-premises), each with three build agents named as shown in the following image:
You can see in the above image that the servers are named “Server-1”, “Server-2” and “Server-3” and the agents are named similarly (1 – 3).
With this setup, this is what I expected the selection order to look like (at least something really close to this):
I expected “Agent-1” to be selected on “Server-1” for the first build and then “Agent-1” on “Server-2” to be selected for the second active build, and so on. However, what we experienced was the following:
In reality, the first build landed on “Agent-1” on “Server-1” followed by the second active build landing on “Agent-2” on “Server-1” and so on. Essentially, the available agents on the first server were being utilized before moving to the next server.
So, what’s going on? After discussing this with some folks at Microsoft, it turns out that the selection routine works something like this* – when a new build is queued:
- Select all build agents that match the specified demands
- Sort the results by the clustered index on agent name
Translation: The build agents are selected in the same order in which they were registered with VSTS. I suppose that’s a simple approach but not the one that I was expecting.
(*While this is how the selection logic currently works there is no guarantee that Microsoft will leave it this way in future updates.)
Can It Be “Fixed”?
While I haven’t actually attempted this just yet (I will soon) the fix should be as simple as removing agents from VSTS and re-adding them back in the order in which you’d like to see them selected (e.g. like the middle image shown above). I’ll be trying this out soon and I’ll report back how it goes. If you try this before I get a chance to, please comment below with your findings.