K6 VU user count off in Grafana

eva2000 · October 11, 2022, 6:50pm

I noticed with k6 output to InfluxDB 1.8/Grafana that the VU user count reported in Grafana is always off by a few VU users? I even tried logging output locally to JSON log file and parsing the logs in InfluxDB batch line format and it still shows VU user count is under reported?

Example at https://github.com/centminmod/k6-benchmarking/blob/master/bench-ramping-vus.sh-2.md

parsed k6 JSON output log at /home/k6-workdir/influxdb-vus.log for a benchmark run using ramping-vus executor in 4 stages from 5, 10, 15, 0 VUs respectively.

cat /home/k6-workdir/influxdb-vus.log

vus,testname=rampingvus value=1 1665497928000154000
vus,testname=rampingvus value=3 1665497929000153300
vus,testname=rampingvus value=4 1665497930000158500
vus,testname=rampingvus value=6 1665497931000157200
vus,testname=rampingvus value=8 1665497932000220000
vus,testname=rampingvus value=9 1665497933000154400
vus,testname=rampingvus value=11 1665497934000158500
vus,testname=rampingvus value=13 1665497935000171500
vus,testname=rampingvus value=14 1665497936000158000
vus,testname=rampingvus value=11 1665497937000153900
vus,testname=rampingvus value=6 1665497938000155600
vus,testname=rampingvus value=1 1665497939000153000

Edit: I’m using gracefulStop of 30s and guess I’m hitting this The VU metric shows a smaller number than what I've configured - #3 by mstoykov - OSS Support - Grafana Labs Community Forums ?

mstoykov:

gracefulStop of 30s by default we will wait for VUs to finish their already started iterations for 30s. This means that some VUs will return before others(arguably this is always the case … even if we stop them “simultaneously” without gracefulStop).

Now unfortunately in this case (and I think this is the only such case at all …) vus is a counter which increases and decreases (see stages/ramping-vus , for an example of this), k6 actually stops all VUs at the end and that counter does go down to 0.

Unfortunately, it seems that the “all VUs have stopped we can print the summary now” and “all VUs have stopped and we have updated the counters and that have gone through so it will be seen in the summary” are two distinct events, that lead to the vus last value (which shown here) to not be 0 as it should be … as at the end of a test we have always stopped all vus.

tl;dr - in your particular case it was 10 vus for the whole duration you care about

Hope this answers your question

nedyalko · October 20, 2022, 10:11am

Sorry for the super-late response, I forgot to reply here earlier in the week

There are a couple of things that might be going on here. First, k6 only emits the vus metric once every second. And second, the target in a ramping-vus executor’ stages represents the number of VUs the executor is supposed to reach right at the end of that stage. And if the following stage has a lower target, k6 might just execute a single iteration (or no iteration, if gracefulRampDown is 0 or some small duration) and drop the VU. If that happens in less than a second, the vus metric emission might miss that one VU that was initialized.

Can you provide a self-contained example script that demonstrates the issue? There might also be some sort of a off-by-one bug in k6 or something like that

eva2000 · October 20, 2022, 12:48pm

So a longer gracefulRampDown would = more chance of hitting all desired VU numbers for multi stage tests?

I don’t have a self-contained example now. I’ve evolved the script to combine 2 executor’s for constant arrival rate + ramping vus (benchmark-scenarios-multi.js) and testing higher concurrency numbers sort of hiding the finer VU reported numbers i.e. 16,000 VUs test at https://github.com/centminmod/k6-benchmarking/blob/master/bench-ramping-vus.sh-3.md

taskset -c 0-3 k6 run --tag testname=rampingvus -e RPS=5000 -e REQRATE_USERS=16000 -e USERS=0 -e STAGETIME=180s -e STAGE_VU1=5000 -e STAGE_VU2=10000 -e STAGE_VU3=15000 -e STAGE_VU4=0 -e URL=https://domain1.com --no-usage-report --out json=/home/k6-workdir/summary-raw-scenarios-multi.gz benchmark-scenarios-multi.js

          /\      |‾‾| /‾‾/   /‾‾/   
     /\  /  \     |  |/  /   /  /    
    /  \/    \    |     (   /   ‾‾\  
   /          \   |  |\  \ |  (‾)  | 
  / __________ \  |__| \__\ \_____/ .io

  execution: local
     script: benchmark-scenarios-multi.js
     output: json (/home/k6-workdir/summary-raw-scenarios-multi.gz)

  scenarios: (100.00%) 2 scenarios, 150000 max VUs, 15m23s max duration (incl. graceful stop):
           * constant_arrival_rate: 5000.00 iterations/s for 3m0s (maxVUs: 16000-150000, exec: constantarrival, gracefulStop: 5s)
           * ramping_vus: Up to 15000 looping VUs for 12m0s over 4 stages (gracefulRampDown: 5s, exec: default, startTime: 3m18s, gracefulStop: 5s)


running (15m22.6s), 000000/016000 VUs, 2707350 complete and 2 interrupted iterations
constant_arrival_rate ✓ [======================================] 000000/016000 VUs  3m0s  5000.00 iters/s
ramping_vus           ✓ [======================================] 00000/15000 VUs    12m0s
INFO[0931] [k6-reporter v2.3.0] Generating HTML summary report  source=console
     ✓ is status 200

     checks.........................: 100.00% ✓ 2707352     ✗ 0      
     data_received..................: 6.5 GB  7.0 MB/s
     data_sent......................: 122 MB  132 kB/s
     http_req_blocked...............: avg=7.18ms   min=132ns    med=248ns    max=2.41s    p(95)=367ns    p(99)=792.56µs p(99.99)=1.94s    count=2707352
     http_req_connecting............: avg=3.38ms   min=0s       med=0s       max=1.73s    p(95)=0s       p(99)=70.6µs   p(99.99)=1.02s    count=2707352
     http_req_duration..............: avg=3.93ms   min=167.36µs med=241.02µs max=1.75s    p(95)=1.32ms   p(99)=17.46ms  p(99.99)=1.03s    count=2707352
       { expected_response:true }...: avg=3.93ms   min=167.36µs med=241.02µs max=1.75s    p(95)=1.32ms   p(99)=17.46ms  p(99.99)=1.03s    count=2707352
     http_req_failed................: 0.00%   ✓ 0           ✗ 2707352
     http_req_receiving.............: avg=176.86µs min=7.85µs   med=19.12µs  max=1.11s    p(95)=205.75µs p(99)=2.1ms    p(99.99)=70.45ms  count=2707352
     http_req_sending...............: avg=87.23µs  min=18.38µs  med=28.61µs  max=890.98ms p(95)=262.05µs p(99)=601.72µs p(99.99)=27.21ms  count=2707352
     http_req_tls_handshaking.......: avg=3.77ms   min=0s       med=0s       max=1.58s    p(95)=0s       p(99)=645.25µs p(99.99)=928.06ms count=2707352
     http_req_waiting...............: avg=3.67ms   min=0s       med=174.41µs max=1.16s    p(95)=860.58µs p(99)=14.12ms  p(99.99)=995.83ms count=2707352
     http_reqs......................: 2707352 2934.531442/s
     iteration_duration.............: avg=2.01s    min=206.09µs med=2s       max=5.03s    p(95)=5s       p(99)=5s       p(99.99)=5.01s    count=2707351
     iterations.....................: 2707351 2934.530359/s
     vus............................: 1       min=0         max=16000
     vus_max........................: 16000   min=3089      max=16000

nedyalko · October 20, 2022, 1:12pm

Yes, but the VU iterations will also need to be more than a second long. k6 will hit whatever VU target you give it, but if the next target is lower, it will immediately try to decrease the actively running VUs. And if either your iteration duration or your gracefulRampDown option is super short, the 1-second measurement and emission of the vus metric might not coincide with the short peak of active VUs.

If you don’t want to mess with these (and you might not be able to, depending on the test), you can also add a stage with the same peak target and a duration of a few seconds. Like this:

  stages: [
    { duration: '30m', target: 16000 },
    { duration: '5s', target: 16000 },
    { duration: '30m', target: 0 },
  ],

eva2000 · October 20, 2022, 1:49pm

I see. I compared my previous ones with current and noticed that after I introduced a randoms sleep time between 1.5s to 4.5s, reported VUs are slight more accurate because of this?

before

iteration_duration.............: avg=1.85ms   min=555µs    med=1.15ms   max=36.28ms p(95)=5.13ms   p(99)=12.22ms p(99.99)=30.84ms  count=46674

after

iteration_duration.............: avg=2.01s    min=206.09µs med=2s       max=5.03s    p(95)=5s       p(99)=5s       p(99.99)=5.01s    count=2707351

Sweet, thanks for the tip