The Operations HOWTO

Generally, when you run Software-as-a-Service (SaaS) or a Platform-as-a-Service (PaaS) your product is delivered via a web browser, or as a service without a direct User Interface (UI) or some combination of the two. Broadly speaking, everything else consists of:

Ticketing/Tracking system – The record of what to do and what was done.

Source Code – The code that runs and the tests that validate it.

Environments and Configuration – The hardware and software that delivers your product. This might consist of even more code and lots of data needed by everything else to do its job.

Monitoring/Alerting – The real time or near-time data indicating the performance and health of your systems.

Document Store – The long term store of data that doesn’t fit anywhere else. Think Atlassian Confluence or a Wiki.

There will be a single source of truth for everything and everything needs to be linked to everything else. What do I mean by this?

If a developer is asked to do something, whether it is to add a new feature or fix a bug, their first response should always be: “Have you filed a ticket?” The ticket will not only track the record of the work but it will contain links to the merge request / pull request (MR/PR) and whatever design documents were generated in the Document Store. The MR/PR should generate both automated tests and automated builds. These should generate another ticket, although it might not be in the same system as the one tracking developer work. This “build and test” ticket will contain links to the MR/PR, the build results and the test results.

Let’s track what we have so far:

Developer ticket, with links to the Document Store and the MR/PR. The MR/PR will contain links back to the ticket inside the commit message and any documents in the store should also link back to the original developer ticket. We have another ticket with links to build and test results as well as the original MR/PR. So far so good, as everything we’ve generated will never become an island unto itself. We can take any individual piece and reconstruct the entire chain of work and progress.

Now it’s time to get this into production. Another ticket should be generated, with links back to both the developer ticket and the build and test ticket. More often than not, new features will generate new metrics that need to be monitored and new alerts that need to be defined. Ideally, these are kept in source control as well and their MR/PRs are linked in the “Ops ticket” as well. If additional documentation is created in the Document Store around moving the new things into production, those are linked. As new stuff moves from lower “pre-prod” environments, new details are added and tracked.

What we end up with is a system that people in the future can understand and maintain. When the new hire responds to an alert, the links in the alert will contain enough information for her to understand whatever metrics are out-of-line and how to mitigate the issue. If all else fails, she can track down the original MR/PR which will identify which team is responsible for that area of code, even if the original developer hasn’t worked at the organization for quite some time.

I don’t think many experienced people will find these ideas to be novel, but as we move into a future of fully automated CI/CD and things like GitOps, the focus often drifts into just getting everything working and the attention to how everything should be traceable falls into the TODO bucket.

The Operations HOWTO

Comments

Leave a Reply Cancel reply