Who wants to spend time dreaming about the ability to leap tall buildings with a single bound, when we can recast stories we live day to day as powers of our own… and improve our own lives in a practical way? When observability is folded into the development process itself, it represents the potential for a beautifully virtuous cycle: production stops being just where our development code runs into issues, and it becomes where part of our development process lives.
Christine Yen’s talk took inspiration from the Marvel Cinematic Universe (with a small dabble into DC Justice League towards the end to reference the broody Batman) to explain to both Devs and Ops the importance of shifting the production feedback loop all the way left into the developer’s hands. With the right set of tools and some cultural willingness to change, Observability provides the entire team with quantitative data straight from the real world, making it the latest superpower for developers to harness in the ever-changing, software-focused world we work in.
Christine is the co-founder of honeycomb.io, a company that practices what they preach. She started with a brief survey of the room to get a sense of who identifies as “ops” and who identifies as “developer”. She is firmly in the developer camp (and delights in it).
As part of the exposition, her early career included glee at being a fast developer. WRITE-TEST-COMMIT! It was some time later she met her first Ops person and started to understand how her cycle of development actually included debugging truly impactful changes pushed to production, which the ops team had to deal with.
Through continued partnership with Ops, Christine learned about taking responsibility for the changes she made, rather than giving the auto-response “it worked on my machine”. A February 2019 Medium post by Subbu Allamaraju at Expedia (https://m.subbu.org/incidents-trends-from-the-trenches-e2f8497d52ed) helps show the reality that change in production lead to production incidents. Allamaraju analyzed incident data and provided the following insight: “Observation 1: Change is the most common trigger” (not root cause, actual trigger of an incident). A change could be any number of things: automated ci/cd releases, partially automated legacy deploys, manual changes, config updates, and/or experimental changes like A/B tests.
Observability: “What is my software doing, and why is it behaving that way?” ~ Christine Yen
If the the first wave in getting dev and ops to work better together was teaching Ops to develop, the second wave is to teach devs to own their code all the way into production. Observability as defined on Wikipedia is “understanding the behavior of a system based on knowledge of its external outputs.” As simple yet rigid monolith apps are replaced with flexible yet complex collections of services, Observability is the bridge to continue to blur the line between DEV and OPS to create positive software outcomes for everyone.
What is a standard(ish) software development process?
Items 3, 4, and 6 in the list above are generally agreed to be “testing” and lead to approaces like test-driven development. Tests form a feedback loop from non-prod back to the developer. Successful dev teams do lots of testing! Christine affirms that item 9 above, “observe our code in production”, can be seen as an extension of the testing process. Unlike the testing we can do in non-prod, observing our code in production provides exposure to eal world usage and situations we cannot predict with static non-prod testing. Try as we might to anticipate everything, once our code is in a production environment with users actually using it, unpredictable outcomes will happen.
Note from Rainya: My favorite use of American pop-culture reference material in the whole presentation was when Christine used Meeko (the racoon from the animated film Pocahontas) to illustrate the “expected” outcome, and then Rocket (the racoon from Guardians of the Galaxy), for the “actual” outcome.
As a real world example of the value of including Observability in software development, @ceejbbot recently posted a thread about how her team prioritized observability, leading to the direct quote “it no longer feels like a scary fucking conundrum” in regards to performance problems they were experiencing. (https://threader.app/thread/1169408562855940097)
By folding Observability into the development process, we create a virtuous cycle that shortens the feedback loop from production to the developer. Adding “Observe” in the development cycle is more than a set of tools or a set of data. It is also about process and the culture of a team practicing looking through the code together.
“Having Thor’s hammer doesn’t make you Thor.” ~Christine Chen
Support natural dev vocabulary
Support (custom) high-cardinality data
Instrumentation should evolve alongside code
Start with the familiar
“Tracing is what happens when logs grow up.” ~Christine Yen
Know WHAT code to write
Know HOW to write the code
Know if the code WORKS
In the past, Devs cared about code in dev envs while Ops cared about production.
Observability REDUCES THE BATTLES WE FIGHT, allowing us to skip the entire CGI battle sequence. It reduces the tension in release and reduces when we get woken up in the middle of the night. We can ship more reliably. We can think through expected vs actual outcomes, be resilient, and do what we love, avoiding burnout along the way.
For Operators: think how you can share the great responsibility and (great power!)
For Developers: embrace observability; bring prod closer to dev; ground your code in the reality of production insight, not just intuition
COMPLIANCE: Production Write vs Read! Adding production to the dev process is reading of signals about how our code behaves, not writing or changing anything production. Other option is to rely on users to tell you when things go wrong.
LOW PRODUCTION TRAFFIC PRODUCTS: Still meaningful for any setup; taking software you deeply understand and putting it into an environment you don’t deeply understand, observability can still help; feedback loops are necessary at every stage for ever product; exception trackers only work if the thing that you don’t understand throws an exception
WHAT ABOUT WHEN IT’S SEEN AS NICE TO HAVE: Show them what you can do when you get there; it’s like tests – you like tests right?!; work on speaking the language of devs who don’t want to slow down; counter perception that instrumentation is a big heavy lift and can be incremental that evolves with the code you’re shipping
HOW CAN YOU CHECK IF YOU’RE DOING IT WELL? Do you have outages or incidents where you don’t know the answer? If you have tools in place and still can’t figure out why something is busted, signal something isn’t right yet. Easy to have 5 tools do the same thing with none of them giving you an actual ANSWER! Seeing a graph, tells you something is off, but doesn’t give you the WHY behind it!
IS UAT A WASTE OF TIME? Christine didn’t know what UAT stood for, so guessing it isn’t part of her world! Once defined as “user acceptance testing”, her take was having end users testing it with their goals in mind can be useful, but still you need to ask if you will get coverage in a way that you care about. This sort of testing still relies on predicting what will matter in a world where you just cannot predict without gathering data. She does see a a trend towards moving away from staging environments as production observability increases.