Unexplained drop in load ramp up in the ramping-arrival-rate executor


Hoping that someone might take a look at the strange test behaviour we are observing


The script we are trying to test is a simple HTTP script with GET/POST requests. We are using the ramping-arrival-rate executor with the below workload model of our test

  • Initial Load - 1 iteration per second.
  • Target Load - 30 iterations per second.
  • Stage 1 - Target 30, Duration 30 minutes
  • Stage 2 - Target 30, Duration 60 minutes
  • Stage 3 - Target 0, Duration 5 minutes
  • Total Planned Duration - 1 hour 35 minutes 30 seconds

Issue/Problem Statement

During the test, at multiple points (in all the stages), we have observed that k6 drops the load being sent to the system under test. For example, it scaled up to 2500 virtual users and then suddenly dropped to 300 virtual users. Now I would like to highlight that we observed a sudden surge in HTTP error responses from the application at the same time. The scripts are handled to use fail so in case of an error, the virtual user exits the current iteration as soon as it encounters an error.

My understanding is that when all the 2500 users are working simultaneously and all of a sudden there is a spike in errors causing let’s say 2200 virtual users to fail, all of these will immediately start the next iteration. So although there should be a drop in load due to the errors but not a huge drop as all the failed users should now restart the next iteration.

My questions are -

  • When virtual users fail at different steps and abort the current iteration, how does the ramping-arrival-rate executor restarts all these users?
  • Does it just drop all these users from the test and starts scaling again based on the load pattern?
  • OR will it restart all the 2200 users to the next iteration at once?
  • Documentation on fail() states that the virtual user will drop the current iteration but in the test, we have observed the user count drop from 2500 to 300.

Thank you so much in advance

Hi there! I’ll try to answer your questions with a general explanation.

The arrival-rate executors aren’t really concerned whether the iteration finishes successfully or if it’s aborted. The only variable driving their behavior is whether the requested iteration rate is achieved. If it isn’t, then they will try to schedule additional VUs to reach it.

So the amount of VUs is not the main concern of arrival-rate executors, but iterations per whatever timeUnit period you configured. Have you noticed that this is achieved successfully in your tests?

My understanding is that when all the 2500 users are working simultaneously and all of a sudden there is a spike in errors causing let’s say 2200 virtual users to fail, all of these will immediately start the next iteration.

It depends. If initially 2500 VUs are needed to reach the desired iteration rate, but then the configured rate drops, or the iteration duration becomes shorter for whatever reason, and only 300 VUs are needed, then the other 2200 VUs will remain in “standby”. They’re still ready to be used at any point, and instead of k6 initializing more from scratch (a potentially expensive process), the executor will use these standby VUs first.

Note, however, that VUs don’t “fail”. They simply run iterations, which may fail one or more times. So a VU will continue to run iterations regardless if they’re successful or not, and the arrival-rate executor will decide whether to use more or less VUs to run them. When the SUT recovers, then the same VUs will run iterations to completion.

So based on your workload model, and the fact that you’re seeing many aborted iterations, what I’m guessing is happening is that since the aborted iterations are much shorter, the executor is able to reach the desired rate with much less VUs. So initially, 2500 VUs are required to reach 30 iters/s, but when your script starts aborting iterations, then they end much more quickly, and only 300 VUs are needed to maintain 30 iters/s.

Keep in mind that iters/s doesn’t necessarily equal requests per second, and it’s not related to the amount of VUs, either. It all depends on the work you’re doing in a single iteration. So I wouldn’t focus so much on the amount of VUs, but whether you’re able to keep the desired rate without reporting any errors. So you need to look into why your SUT is returning error responses, as that probably means that it’s not able to handle the amount of load you’re trying to send it.

Hope this helps, and let us know if you have further questions.