Last week I had a difficult bug to track down where an Angular frontend was talking to a Node.js backend (made up of multiple microservices) through a load balancer. The browser appeared to be sending 2 POST requests when it should have been sending one.

The first thing I tried to do was shorten the feedback loop. The problem culminated after a long workflow in the browser with a single HTTP POST to the server. I opened Chrome DevTools and converted the request to a cURL command line. Strangely, cURL didn’t show the same behavior as Chrome. Running through the workflow a few times in Firefox also succeeded without error. I used tcpflow to examine the requests but there were no material differences.

Because this error only manifested itself after a substantial upgrade of the backend services I reasoned I could run the frontend locally. I did this, removed all the guards from the Angular frontend and pointed it the test server. This allowed me to navigate directly to the endpoint in the workflow so I could reproduce the request just by clicking a button.

The next thing I wanted to look at was what was happening on the server. I enabled logging on the load balancer and what I saw was unexpected. One request received by the load balancer was being turned into two requests sent to the internal application. Again, this was not behavior that could be reproduced on cURL or Firefox. After adding additional logging to the application, I turned on network logging for Node.js. This led me to the real issue.

I noticed something different between the Firefox and Chrome requests, namely an additional socket timeout when dealing with connections from Firefox. This led me to think that perhaps because of the upgrade, which included bumping Node.js to the next major version (8.0.0), that it could be a platform issue. Indeed, when I downgraded Node.js on the server to 7.10.0, the problem vanished.

The final technique that I want to share is something that I learned from the git bisect command. It uses binary search to narrow problems down to a specific commit. After upgrading Node.js to 8.4.0 and seeing the problem had been fixed, I backtracked through the versions using binary search until I found the version where this issue had been fixed, which was 8.1.1. Reading through the changelog, I ultimately found the bug (#13391) which summarized my problem exactly.

It seems the convergence of keep-alive differences between browsers, the time taken for a request to complete and CORS all contributed to a bug that caused me to want to tear my hair out. At least I got a blog post out of it.

In summary, a list of tools and techniques I used for debugging this problem were:

  • Shorten feedback loops e.g.
    • Reduce the issue to the smallest amount of services or piece of code that still exhibits the problem
    • use cURL to replay problematic network requests
    • use Selenium or other headless browser frameworks to replicate browser issues
  • Use network sniffers like tcpflow or Wireshark to understand exactly what is being sent over the network
  • Ensure an appropriate amount of log statements in code and enable platform logs when required.
  • Binary search (e.g. git bisect)