2026-03 Time at ClickHouse

Back then in 2022, I was interested in learning how to build SaaS products on cloud infrastructure. I joined ClickHouse and spent three and a half years there, had a great time. Put down some notes of what I learned.

Don’t give up easily

We can find ourselves in some challenging situation: investigate a complicated incident root cause; a PR you have spent so much time but still keep finding more corner cases with failing tests. It’s very tempting to give up.

But don’t give up.

The conviction that you can beat the challenge can be a self-fulfilling prophecy. Over time, you will get used to it and live up to it. It’s surprising how much such belief helps in difficult situation.

Certainly there would be cases where efforts are not worth it, that’s why I said easily :)

Production issue investigation

Debugging is usually harder than implementing features. A few things that I found valuable.

Critical thinking of the root cause. A plausible root cause should explain the incident and withstand challenging questions.

  • why this didn’t happen before?
  • Because we lack the triggering condition or binary and configuration are different? We can either find the true root cause, or learn something new.

Take a step back and re-learn basics when they are crucial for debug yet elusive to you. When debugging a VPC peering issue, lack of understanding how Cilium manages IP / ENI prevented me from asking the right question.

Networking issues

Networking issues can be mysterious. For a “simple” question why I got a connection reset / timed out, the answer can lie in many places.

  • server implementation
  • proxy configuration
  • client libraries or their network settings.
  • cloud provider flaws, etc. Some long tail tickets or errors require a sheer amount of investigation effort. Making repro, swap component or experiments for control and comparison would help but to a certain extend.

I don’t have a good answer on this. Let me know if you have some thoughts! Though a few things might help

  • Establish a better SLA and triage model.
  • Build a wide range of testing clients, for all traffic paths and protocols you want to cover.
  • Comprehensive troubleshooting tools, observability across stack.

Misc

Divide and conquer is the best friend for a large and complicated project. It’s very common to require buy-in from many stakeholders. Focus on one aspect at a time. For example, using service mesh, private link or VPC peering; how to change control plane to meet new project requirements.

Integration tests are undervalued and better to be invested early on. It also takes a lot more efforts to address later on if we don’t set it up properly upfront. With proper Docker compose, test framework, you can achieve quite far. Be creative. Not having Proxy Protocol header for a cloud load balancer? Use HAProxy in the middle to simulate. Want to ingest network partition? Use iptable in the docker container.