Why 90% of Engineers Can't Debug (They're Looking in the Wrong Place)

Have you ever joined a new team with a new tech stack, with a new process, and a production issue already waiting for you? Here is how I ramp up fast: I zoom out before I zoom in.

The Problem: Everyone Hunts Where They’re Comfortable

During my consultant gigs, I am constantly dropped into an unfamiliar environment. New teams, new architecture, new tech stack, new development process, new deployment process, it’s always new. The question is always the same: how do you stay on top of it?

My strategy is simple: I start by understanding the end-to-end flow. Before I touch one line of code, I map how data moves from the first user interaction to the last byte written. Once I can see the whole picture, I can choose where to zoom in.

Take the classic “app is slow” example—this is a common complaint I believe we hear all the time, and I’m sure you’ve also heard it. The user clicks on a web page, the page feels slowish.

In a typical setup, a client app or website sends a request to a reverse proxy, which forwards it to backend services, and then those services may push the work to a queue system, and then they may need to talk to a distributed database.

There are many moving parts, and the common trap is to hunt only where you are comfortable:

A backend developer dives into server code
A frontend developer opens the performance tools on the browser
An ops-focused engineer checks proxy settings

All of those are valid, but each is incomplete without understanding the overall system.

My End-to-End Approach

So I begin with the symptom: which page or endpoints are slow? I start tracking the request step by step.

I start with the frontend, which endpoint is it calling? Is something slow on the frontend? If it’s not, then I follow the request. I’ll go to the proxy, open the reverse proxy logs, check the timers and retries. No luck? Then I follow the request to backend servers.

What does the backend service do? Does it send work to a queue? Is there backlog pressure on the queue? Did a background job get rescheduled due to some failure? Are there any issues on the backend?

Maybe something’s blocking the process. Maybe there’s a problem in the database, missing indexes where they’re needed. Is there poor partitioning? Maybe it’s a database design issue? Even though the problem seems low-level, maybe there’s a network issue somewhere. Maybe packets are being lost.

Those are the steps I roughly follow for these kinds of problems instead of just going with assumptions and trying to focus on only one section.

The same mental model also applies to data pipelines. It can be an ETL or ELT pipeline—it doesn’t matter. I still try to walk the full path.

What is the nature of the source? Are we streaming or is it batch? Are we backfilling or is it live? Is the source an API or a database? In transformation steps, I try to figure out: is the transformation more CPU-intensive or I/O-intensive?

On the load side, I look for choices the team has made. What are the constraints? What is the partitioning strategy? How are these choices affecting the throughput?

If you don’t map the end-to-end flow, you will waste your time fixing a step that’s probably not the bottleneck. So that’s also important—why you need to walk the entire path and try to zoom out before zooming in.

Production Reality and Communication Flow

Now let’s talk about production. First of all, let’s accept this: we have all said at some point, “it works on my machine.” I remember saying it at some point, and I’m sure you’ve said it too.

That usually means our own world isn’t the real world. Our local environment is not the production service.

As engineers and problem-solvers, our responsibility is to see production as much as we safely can. That doesn’t mean we need unrestricted access. It means we need to observe—we need real observability.

I want read-only access to logs, metrics, traces, and dashboards. I need feature flags and canary deployments to test my hypothesis without putting everyone at risk.

In complex, regulated environments like finance or large companies, deployments are careful and the environments are segmented for good reasons. That doesn’t change the method. I still try to map the end-to-end process, even though it’s not as streamlined as with small and medium companies or startups.

Here’s what changes in these large, regulated environments: as an engineer, our communication skills become crucial because now we need to map not only the technical flow end-to-end, but we also need to map the communication flow—how teams communicate with each other.

So this is how I ramp up in unfamiliar teams and systems: I zoom out before I zoom in. I look at how the entire system creates value. I choose to focus based on evidence and not comfort.

That is the most important part. I don’t choose where I need to focus because of my comfort level. I try to focus on the part where I find there is evidence that I need to focus on.

If this approach makes sense, I’d say try it next time.

Watch the Video

I also shared this debugging approach in video format. You can watch it here:

The Problem: Everyone Hunts Where They’re Comfortable#

My End-to-End Approach#

Production Reality and Communication Flow#

Watch the Video#

The Problem: Everyone Hunts Where They’re Comfortable

My End-to-End Approach

Production Reality and Communication Flow

Watch the Video