Saturday, August 7, 2021

Interesting Networking Issues reported by customer and Important simple things to remember..!!!

 Having around 15 years of experience in networking products like Switch/Router, GPON Access devices I got to work on several features on these products across different companies that include Brocade, Nokia, Matisse Networks, 3Com products. While I moved to Core networking after a brief stint in Telecom, I worked on several customer issues some of which were quite interesting, some easy and some very tedious. 

Each of the issues I worked on gave me a lot of lessons to learn. Here I present some of the interesting customer issues I worked on. But before I jump in to it, let me tell you in most of the times the customer reports only a symptom and the R&D has to start from there to identify the actual root cause. In most of the difficult scenario's, the issue WILL NOT be reproducible and so the R&D team has to grab the information as much as possible before cornering the root cause. 

For starters, when I mention the customer reports only a symptom, it literally means he tells about what he sees. Some of the examples are : "I executed the command and system crashed." Chances are the system could crash due to other commands as well if it is running on low memory, but customer will not be aware of it. 

Another example: "Packets are dropping or connections are timeout." They will not tell if they made any configuration changes or if any HW was changed or any connections were changed. it is the R&D job to ask as many questions as possible to get the answers. Following things are very important things to collect: 

1. In case of crash, the stack trace left behind. In Linux based systems, a coredump would be valuable. 

2. Syslog data will be crucial to identify what the customer has executed during the time of issue. 

3. observing timestamps is of utmost importance. In fact, this is the first thing to do. Check the timestamps of the issue and match it with syslog. In some cases, it saves lot of time. 

4. Non default configuration present in the system and the last command executed before the issue cropped up. 

5. Customer DB usually is not shared unless it becomes important. Get it if possible for easy analysis.

6. Depending on the kind of problem reported, relevant Debug commands related to the problem need to be collected and shared with the TEC/TAC team which interacts with customer. They need to be handled with lot of caution as any wrong execution on the device can create other problems too. 

7. It is better for R&D to avoid direct interaction with the customer and let TEC/TAC front end. Most of the cases, by the time the issue is in R&D lap, there would have been lot of time lapsed and this can irate the customer. Their frustration can spill on and it is better TEC/TAC handle it as they are trained to deal with it. Women deal these scenarios in much better way. 

7. Ensure there is a demo of debug activity done before facing the customer. 

8. Gather the kind of terminals customer is using as well. 

So, moving to the issues I faced, the title is the customer described symptom and the description has the steps we followed to Root cause it. Chances are you may end up surprised as to where the issue began and where it ends !!! 

1. Issue: Executing a Show command in SSH is causing the CLI crash, but not seen when using TELNET: 

Usually when such crashes are reported, it should be a cakewalk for the R&D as the stack trace should point at the location where it crashes. Relevant code walk through would have helped. But in this case, the same command does not crash when executed through TELNET. This means, there is no apparent problem with the functionality that is being run. So the suspicion was on SSH protocol stack. When the stack trace was seen it was pointing at code which was related to array. When we tried to reproduce using the same command using SSH it was working fine. One clue we got from the trace was that the problem was related to the window size aspect of the display. But we are not able to know what was the problem. We had a lucky breakthrough when the customer shared how the output looked like when the issue happened. We noticed there were abnormally large number of headers like "---------" before the columns were published. Normally we see it followed immediately with the column names. This made me wondering what terminal customer was using like putty, tera term, xterm etc. I asked TEC team to find out how the customer is running the command, what are its settings and name of the terminal so we can use the same. 

When the TEC approached the customer, he gave a demo of their terminal and the TEC observed one interesting setting where the window size was mentioned as -1. This was sufficient for me to identify the problem. 

Whenever SSH is done, as part of the transactions, size of the window is also exchanged so as to align the display. There is no limit as such for it other than sizeof integer. This limit is however set to 1024 in TELNET in server. When customer is using the terminal with -1 setting, the size of window is received as 0xFFFFFFFF and SSH was trying to allocate this large memory and failing. The failure is not handled correctly in code as this part was always deemed to be success!!! 

Now that we understood the cause, we were able to generate the crash by using a script that starts session with window as high as 2000000. 

The fix is anybody's guess ..!! 

Lesson 1: Each CLI Application terminal brings its own features and limitations. 

2. Starting a download through SNMP causes Router to reboot (crash): 

This issue started with a simple notion that the stack trace would be sufficient to identify the fix. Since it is SNMP related, the SNMP interfaces will call some API that would initiate a download. As is the case with crashes, the last API in the trace was suspected as it is deemed to have caused it. But then came the surprise. The last API in the trace does not seem to be related with the purpose of the object which was used for initiating the download. it is like if we are initiating a download for one file, the action seen was for a download of totally different application. 

I started going back following the trace and then arrived at the SNMP API that was supposed to have started it all. usually when we are less experienced, we tend to believe the code already in the system is well tested and the code is expected to work. But it is not the case. There will be legacy code which undergoes transformation with time and in some cases, the old API's cease to have their relevance. 

Similar thing happened. The Interface being called from SNMP set code is calling an Old API which was no longer applicable or useful for this application. I immediately checked with R&D team to understand why they continued to use it. They were equally surprised, and agreed to change it to the more useful API that helped fix the issue. 

Lesson 2: It is equally important to suspect the first API as the last. Suspect the Legacy..!!

3. Ports are removed from All VLAN's including default VLAN 1 ..!!!!!!!!!!

How can a port be not part of any VLAN? At least it should be part of default VLAN.. isn't it? This issue was both surprising and annoying for the customer. There was lot of packet loss before they realized this behavior and had almost the entire managers lined up to explain why it entered this state. 

When they were queried as to what did they do, they just said "we do not know what happened". We configured few modules and then noticed it. 

The customer facing TEC team were deployed to reproduce the issue. with little information, they never able to reproduce it. One of my team mates who was working on it offered to add some debug messages and share the build to the customer for them to repeat the same process in the lab. This was done with a hope the debug messages would give some hint. 

I was assigned after he went for PTO and by that time it was already having attention of Director of Sustenance engineering. I went through the syslog file and checked the logs and also the debug prints shared by the TEC team and found there was a TFTP based configuration download that has happened. 

So I tried the same with some combination initially with little luck. Finally I hit the condition of port not being part of any VLAN but I did not know how I entered the state !!!. Again I had to recollect the combination of things I tried. I had to reboot the system restore it and started all again. 

somehow I remembered I downloaded the configuration from TFTP and after some manual configurations, tried it. I was lucky and hit the issue. 

Then I realized the problem is seen only when configuration is being downloaded through TFTP. I started reducing the configuration and checked as to which command was causing the problem. 

Finally I zeroed in on the "end" command!!!. 

The root cause was due to the CLI architecture that router had. Normally in a CLI configuration, to move from one mode to previous mode, there is a exit or end command depending on products. The end command will drive the CLI mode from wherever it is to the root node. In this process, the CLI applies/sets the mode and stores all the command nodes that are available in that mode. 

In One of these commands that were executed, there was an erroneous code that does memset to 0 a portvlan bitmask membership which caused all the VLAN's to be removed from the port. 

Once we were able to identify this code, fix was just a LINE...!!!!

Lesson 3: Write down the set of actions you are performing when you are trying to reproduce an issue. You never know when you get lucky..!!

More to come ...!!!!