Observability 2.0

March 04, 2024

I'm not an observability guru but everything inside of me screamed “yes yes yes” when I read Ivan’s article about logs, metrics and traces vs "wide events".

When I was working at Meta, I wasn’t aware that I was privileged to be using the best observability system ever. The basic idea is extremely simple and doesn’t require a glossary page for people to grasp. It operates with Wide Events. Wide Event is just a collection of fields with names and values, pretty much like a json document. If you need to record some information - whether it’s the current state of the system, or an event caused by an API call, background job or whatever - you can just write some Wide Event

Such events called wide, because it’s encouraged to dump to them all the information one can think of. Anything that might be relevant in the context of a certain data - just put it there, it might be useful later. This approach is laying the groundwork for dealing with unknown unknowns - something you can’t think of now that may be revealed later during an incident investigation.

This hit home for me because I have often felt there is insufficient information to properly debug or diagnose a situation, even though we have logs, traces and metrics. In the article, he also linked to an incredible thread by Charity Majors where she contrasts where observability is today (1.0) and a better future (2.0):

Observability 1.0 is f-ing HARD, yo. Trying to predict which custom metrics to define, managing the cost profile, pruning the expensive or unused ones, managing your keyspace and avoiding anything with cardinality (i.e. the useful stuff!), managing dashboard sprawl... With o11y 2.0, if there's something you think might be useful someday, stuff it in the blob. Do a printf, basically. Any type of data -- just stuff it and forget about it. With metrics, your bill goes up linearly (at least) with the number of custom metrics you store, but with o11y 2.0, you effectively get infinite free custom metrics. You get charged per event (or span), but you can make those events as wide as you want. Load em up! The more context the better! You never know what might be valuable someday. 😻

I'm sold on "wide events" serving as better logs and with a system like Meta has built, even metrics, but am unsure how this would be able to replace tracing? I'll have to do more digging because this vision is compelling.