Dial i/o timeout when running vus more than 5000

I am trying to achieve max no of session our API gateway at AWS is able to handle along with lambda

I am running k6 on a machine with 16GB memory and 16 cores and not able to ramp up vus above 5000 , getting dial i/o timeout error

I have done the os fine tuning and also while running test the CPU and Memory is not overloaded on test machine , yet I am unable to increase the VUS per second

sudo sysctl -w net.ipv4.ip_local_port_range=“1024 65535”
sudo sysctl -w net.ipv4.tcp_tw_reuse=1
sudo sysctl -w net.ipv4.tcp_timestamps=1
sudo ulimit -n 250000

Is there any setting which I am missing that is preventing from ramping up the vus

k6 run --vus=8000 --duration=1s

WARN[0033] Request Failed error=“Get “https://test-urlv1/retro/device/TESTE9C5D9CF7683332C-B9F11TESTUS2”: dial: i/o timeout”
WARN[0033] Request Failed error=“Get “https://test-urlv1/retro/device/TESTE9C5D9CF7683332C-B9F11TESTUS2”: dial: i/o timeout”
WARN[0033] Request Failed error="Get "https://test-

running (32.8s), 0000/8000 VUs, 7826 complete and 174 interrupted iterations
default ✓ [======================================] 8000 VUs 1s

 █ setup

 data_received..................: 25 MB  756 kB/s
 data_sent......................: 4.5 MB 137 kB/s
 http_req_blocked...............: avg=3.33s    min=0s      med=1.13s    max=16.34s   p(90)=15.98s   p(95)=16.12s  
 http_req_connecting............: avg=2.86s    min=0s      med=247.52ms max=15.33s   p(90)=15.26s   p(95)=15.28s  
 http_req_duration..............: avg=168.15ms min=0s      med=267.72ms max=973.06ms p(90)=304.8ms  p(95)=503.84ms
   { expected_response:true }...: avg=310.78ms min=258.5ms med=283.52ms max=973.06ms p(90)=489.67ms p(95)=521.39ms
 http_req_failed................: 45.89% ✓ 3672       ✗ 4329  
 http_req_receiving.............: avg=25.91µs  min=0s      med=24.49µs  max=19.96ms  p(90)=53.85µs  p(95)=64.69µs 
 http_req_sending...............: avg=105.6µs  min=0s      med=120.55µs max=4.45ms   p(90)=214.44µs p(95)=242.15µs
 http_req_tls_handshaking.......: avg=320.26ms min=0s      med=503.38ms max=1.02s    p(90)=649.68ms p(95)=672.02ms
 http_req_waiting...............: avg=168.01ms min=0s      med=267.54ms max=972.85ms p(90)=304.56ms p(95)=503.56ms
 http_reqs......................: 8001   243.904172/s
 iteration_duration.............: avg=17.15s   min=1.1s    med=16.43s   max=30.67s   p(90)=30.4s    p(95)=30.55s  
 iterations.....................: 7826   238.569435/s
 vus............................: 3672   min=0        max=8000
 vus_max........................: 8000   min=8000     max=8000

Hi @karthisk19, welcome to the community forum :tada:

Given your arguments you start 8k VUs for 1s which then, due to the default gracefulStop of 30s, run for another 30s, so for a total of 31s before being stopped.

(For some reason you are above that, which likely is just some slowdown due to the amount of VUs.)

Given that you have 174 *interrupted iterations and log message at the 33 second (WARN[0033]) I would expect that this is where the dial i/o timeout is coming from.

It looks like either the load generator, something in between or the SUT can’t handle creating 8k connection this fast and likely either drops them or will take a bunch more time to execute them.

Which gets double confirmed by the fact that http_req_connecting is with an avg of 2.86s and and p(90) is > 15s … so at least 10% of the connections took 15s to just connect. Given that and the http_req_blocked being a bit above that I would expect that you are just severely blocked on connections.

Your iteration_duration also is on avg > 17s and more than 10% take 30.4s which will mean that they are very close to being timeouted.

I would investigate the SUT, and also monitor the CPU and MEM usage on both systems constantly as I expect that at the beginning of the test there is a bigger spike than later on.

I would also look up what is in dmesg and/or any syslog available as it might have something intesting as well

max no of session our API gateway at AWS is able to handle along with lambda

If you are spawning lambda per requests I would expect AWS will throttle this at some point. While with 5k connection it might’ve worked okay or very close to getting with errors it likely takes a bit longer with 8k.

You can also try to just increase the gracefulStop, for which you will need to configure a scenario. While this will likely get you a not erroring test, I would argue given the blocking and connecting times that you already know something isn’t okay. Whether that is on the EC2 instance configuration or the SUT is a question that is hard to answer just based on this data, sorry.

Hope this helps you

I have ran the test against the k6 test domain and seeing similar results of dial i/0 timeout , ( increased graceful time to 50s)

so is this limitation on the machine or the k6

Memory usage was less than 4GB during the test ( checked using iftop )
WARN[0031] Request Failed error=“Get “https://test.k6.io”: dial: i/o timeout”
WARN[0031] Request Failed error=“Get “https://test.k6.io”: dial: i/o timeout”

running (31.7s), 0000/8000 VUs, 8000 complete and 0 interrupted iterations
contacts ✓ [======================================] 8000 VUs 1s

 data_received..................: 50 MB  1.6 MB/s
 data_sent......................: 1.3 MB 42 kB/s
 http_req_blocked...............: avg=813.03ms min=0s      med=0s       max=5.8s     p(90)=3.68s    p(95)=3.75s   
 http_req_connecting............: avg=590.15ms min=0s      med=0s       max=3.28s    p(90)=3.22s    p(95)=3.25s   
 http_req_duration..............: avg=106.33ms min=0s      med=0s       max=1.81s    p(90)=283.16ms p(95)=305.23ms
   { expected_response:true }...: avg=284.97ms min=206.6ms med=250.84ms max=1.81s    p(90)=465.35ms p(95)=494.11ms
 http_req_failed................: 62.68% ✓ 5015       ✗ 2985  
 http_req_receiving.............: avg=2.34ms   min=0s      med=0s       max=257.55ms p(90)=97.81µs  p(95)=261.49µs
 http_req_sending...............: avg=47.38µs  min=0s      med=0s       max=6.77ms   p(90)=53.62µs  p(95)=81.62µs 
 http_req_tls_handshaking.......: avg=129ms    min=0s      med=0s       max=2.52s    p(90)=271.24ms p(95)=287.46ms
 http_req_waiting...............: avg=103.93ms min=0s      med=0s       max=1.81s    p(90)=280.31ms p(95)=298.56ms
 http_reqs......................: 8000   252.465968/s
 iteration_duration.............: avg=20.9s    min=1.69s   med=31.2s    max=31.65s   p(90)=31.37s   p(95)=31.4s   
 iterations.....................: 8000   252.465968/s
 vus............................: 5015   min=5015     max=8000
 vus_max........................: 8000   min=8000     max=8000

[testuser@vm21bsd0007 k6]$

so is this limitation on the machine or the k6

This is really hard question to answer, but I just ran 10k VUs on my laptop (which is running 3500U, so nothing crazy) against another machine on my local network. After some nginx tweaks in order to have enough workers, file limit and to reuse ports - I more or less get 100% correct responses.

After 6-7 runs I got 1 dial i/o timeout that I haven’t been able to reproduce …

Given that I can do this on a lot smaller hardware with k6 v0.38.2 I would argue this is the system under test that is not performing or something else (including OS configuration, or a router in between) that is not handling something.

Again at 5k+ requests at the same time you are likely hitting some rate limiting somewhere.

p.s. I also went with per-vu-iterations with 10k VUs and 1 iteration as at least on my machine k6 can’t start 10k VUs in under 1s

ok understood will run in someother machine and will check the result , Thanks again for the results and helping me understand the tests