Have you ever tried to optimize a system with many moving parts and have had trouble finding where the problem was? Did you ever make a code change thinking you fixed the problem only for it to come back hours later?
I had been working on project in Golang with Websockets. I was starting to collect some metrics about the server like load, saturation, and most importantly latency.
Something peculiar started to happen. I noticed that only specific devices would ping back in a less than 100 milliseconds. Other devices would ping back sporadically ranging from 100ms to 1000ms.
Eventually I narrowed down on which devices this was happening. Ping from laptop to desktop - 5ms. Ping from desktop to laptop - 5ms. Ping from desktop to unplugged laptop - 700ms. Bingo. It was then I had a lesson on 802.11 power saving.
That's right. For 802.11 packets there's a PWR_MGT flag that, when set, tells the router that the device is going to sleep and to start queuing packets. When the device wakes up, the router flushes the queue of packets. The delay from my phone and from my laptop was different depending on how aggressive the power saving was, whether or not the device was plugged in, and if the device just went to sleep or was still awake.
Imagine you were making changes to your Websocket code to debug the problem and you happened to plug in your laptop to power. The problem would magically solve itself. After this happened to me a few times, it was time to get back to basics.
To get to that point, a key step in the debugging process was to, unexpectedly, send more frequent ping messages. If I sent a ping every 100 ms instead of every 1000 ms, I would see acceptable ping times. The device has to fully wake up to send ping reply messages so during that time you can send low latency messages to the device. This might be a workaround if you need it although it has drawbacks like more power consumption and higher data volumes.
This problem eventually motivated me to get a wireless adapter capable of capturing 802.11 power saving packets. If you're interested I included a cleaned up capture below with some notes of the power saving process of a ping on a local network from a laptop to an Android device [0].
This problem has shown me the importance of challenging your assumptions and being diligent in isolating unknown variables.
Please take a look at this post on rachelbythebay.com as well. I felt like I was reliving the same experience her teammates shared and wanted to connect these stories together!
More details on power saving mode: https://dot11zen.blogspot.com/2018/02/80211-power-management-with-packet.html
Note: Something interesting: the WIFI_MODE_FULL_LOW_LATENCY flag on Android basically just disables this power saving feature according to the spec. Unfortunately this isn't a workaround if you are developing browser websockets but still good to know.
[0] Frames from Wireshark are included below to showcase the power saving behavior. The ping below from my laptop to my phone took around 943.3 milliseconds.
Notes | No. | Time (ms) | Transmitter | Reciever | Protocol | PWR MGT | Info |
---|---|---|---|---|---|---|---|
Laptop wakes up | 0 | -0.3 | laptop | router | 802.11 | STA will stay up | QoS Null function (No data), SN=0, FN=0, Flags=.......T |
Laptop sends ping request packet | 1 | 0.0 | laptop | router | ICMP | STA will stay up | Echo (ping) request id=0x7e87, seq=1/256, ttl=64 |
Laptop starts sleep | 2 | 100.5 | laptop | router | 802.11 | STA will go to sleep | QoS Null function (No data), SN=0, FN=0, Flags=...P...T |
Android wakes up, requests buffered packets | 3 | 793.8 | android | router | 802.11 | STA will stay up | QoS Null function (No data), SN=818, FN=0, Flags=.......T |
Router sends ping request packet | 4 | 795.7 | router | android | ICMP | STA will stay up | Echo (ping) request id=0x7e87, seq=1/256, ttl=64 |
Android sends ping reply packet | 5 | 797.3 | android | router | ICMP | STA will stay up | Echo (ping) reply id=0x7e87, seq=1/256, ttl=64 |
Android starts sleep | 6 | 838.4 | android | router | 802.11 | STA will go to sleep | QoS Null function (No data), SN=819, FN=0, Flags=...P...T |
Laptop wakes up, requests buffered packets | 7 | 942.4 | laptop | router | 802.11 | STA will stay up | QoS Null function (No data), SN=0, FN=0, Flags=.......T |
Router sends ping reply packet | 8 | 943.3 | router | laptop | ICMP | STA will stay up | Echo (ping) reply id=0x7e87, seq=1/256, ttl=64 |
Laptop starts sleep | 9 | 1043.3 | laptop | router | 802.11 | STA will go to sleep | QoS Null function (No data), SN=0, FN=0, Flags=...P...T |