A Rare Insight Into The Daily Challenges Of An Experiments Team

Written by News One July 9, 2024

I won’t however just leave you with a table up for interpretation, as I think all of those traits are worth a paragraph or two of clarification.

Impact driven as opposed to being creative or technology driven. In over a decade, I have met more passionate engineers about technologies, coding paradigms and abstractions than those who just want to get stuff out there for the sake of learning and iterating. To be perfectly frank, you need both in an engineering organisation if you don’t want your product to become unusable and unmaintainable. Heck, even within our team, some of us care more about the architecture, software integrity and efficiency of the product than others, but we all share the conviction that whatever we do, must have a tangible benefit to us as a team, Prezi and ultimately, the user. This is not the team where you randomly get to try a new frontend-library and rewrite one of your services in Rust — as exciting as that may sound.
It’s important not to confuse being versatile with jack-of-all-trades. That being said, the ideal engineer on our team while may not necessarily be a hard-core full-stack engineer, they won’t shy away from jumping into either sides of the codebase. In our case, that means a wide range of front and backend libraries and frameworks. It sounds intimidating perhaps, but in reality it’s a lot less about the expectation of knowing everything, but rather the openness to discover it all over time.
Being an efficient engineer deserves an article — if not a book — of its own, but let me condense it into a couple of thoughts. Us, engineers, have the tendency to polish code, refactor to the point of giving the impression we’re not writing software but creating the Milo of Venus. In an experiments team, we’re more focused on creating meaningful stick figures. As long as we’re able to gauge from the experiment the data we need, the goal is achieved. The code doesn’t have to be optimised (unless it’s getting in the way of being able to run the test), and keeping implementation as simple as possible is a prime objective. As long as it’s testable and revertible, you have yourself a candidate for release.
Having a data driven attitude is key, and I think it drives a lot of the other traits. How often have we, engineers, developed useless features over weeks, months, maybe even years? It’s not uncommon. In an experiments team, however, you don’t have the luxury to do that. Unless there is data to support a code-change, a new feature, a variant of a feature, it simply won’t happen.
Being an avid learner goes hand-in-hand with being data-driven. The focus in an experiments team is on understanding what happened but more importantly why, as the answer will drive the next experiments and possibly a considerable part of the product strategy.
A competitive engineer, comfortable with bold ideas, doesn’t necessarily mean reckless. It also doesn’t mean a lot of “hacking stuff together”. It’s rather a fine-tuned skill of seeing through the technical challenges in such a way that they’re able to propose the shortest technically viable path to success, and that path doesn’t have to follow the status quo.

On a personal note, I would argue that many of the above skills are worth picking up over time for any engineer. As one moves from company to company, from team to team, being able to adapt to different mindsets can very positively impact one’s career.

If you find yourself having the opportunity to join an experiments team, go for it, learn from it, make the most of it. You’ll thank yourself later.

All fingers in all pies

Before joining the GM team in Prezi, I was lead engineer on Prezi Video for Zoom, and later, on the first two waves of Prezi AI. Both, especially in the case of the former, meant that development was mostly spent in a couple of repositories, in very distinct areas of the product. Prezi Video for Zoom was a web app of its own, and Prezi Present — where Prezi AI was released — is mostly a self-contained entity as well, unless you start veering into service territory, but we have dedicated teams for that. In contrast, the very first day I joined the GM team, I found myself checking out not one, not two, but a list of repositories and as time passed, a few more. I have eight running at the moment in my development environment, and that still doesn’t cover all the possible flows a user could take on the Prezi website. Add to that Prezi Present, which we still contribute to with experiments, and you have yourself a context in which certain complexities are unavoidable.

You may wonder, why unavoidable? Can’t other teams run their own growth and monetisation experiments in their respective areas of expertise and ownership? I have no doubt that in certain organisations that is possible. And even in Prezi, for instance, we were able to do that with Prezi Video for Zoom. Our Infogram team can also operate similarly, as it’s a distinct product. However, when it comes to the rest of what Prezi essentially is — the Prezi website, Prezi Present and Prezi Video — one has to approach it holistically, and we must be able to own the experiment end-to-end, which conveniently brings me to what an experiment lifecycle looks like.

Experiment lifecycle

A picture’s worth a 1000 words and because this article is vertiginously approaching 4000, I’ll rely on a diagram to tell most of the experiment lifecycle story.

Releasing an A/B test is — quite literally — only half the work and half the story, but let’s see briefly what these 13 steps in the experiment lifecycle are:

Ideation is a somewhat nebulous step, and it involves a lot of product manager/product owner (PM/PO) sorcery outside the scope of this article, but generally speaking, ideas will be based on market research, data, previous findings, user feedback, etc.
Ideas there may be many, but it’s important to keep focus on what moves the company goals forward in a viable context. Sometimes ideas can be really good, but other things need to happen before they become feasible.
Having a low-fidelity design — a rough sketch — of what the experiment and the user flow would look like can further validate the idea or uncover logical fallacies. At this point, you might already find engineers to be a great asset in the conversation.
Getting to the planning stage means this is now going ahead full-steam and gets into the upcoming sprint. In our case, we tend to work kanban style, so whoever is next willing and well-suited enough to pick the work up, gets to do so. Every so often you’ll find that the experiment is not just a story, but an entire epic, in which case several engineers might allocate their time to it being led by a project lead.
Development is as self-explanatory as it can be. It’s the coding stage, including writing automated unit, integration and regression tests, adding the feature switches and getting everything into a (or more) pull request for code review.
Our manual QA team member(s) ensure everything has been done to spec and execute some regression testing as well. Given the number of experiments we run, it’s a much-needed peace of mind to know at least one set of objective eyes checks everything.
Releasing deserves a section of its own, so keep reading. For now, let’s just say it involves setting the feature switch configuration up to the desired cohorts and enabling them. Once it’s released, a cleanup task is automatically generated for a later date (see step 12).
Spot checking ensures we’re on the right track with the experiment, nothing blew up, we’re not seeing any majorly negative results or collateral damage in signups or upgrades.
After a few weeks, the experiment is stopped, so no more new users are getting exposed to the test. At times, we might allow the users who have been getting the test variant to keep having access to it to further observe user behaviour. This usually lasts no more than another 2–3 weeks.
Evaluation is all about interpreting the data, understanding the learnings. This is the moment we may decide to release a variant (success) to all users or stick to the control (fail).
Rollout is essentially the outcome of the evaluation — all users get one variant going forward, which from that point on becomes the control.
Cleanup is another phase I deemed important enough to highlight in its own section, so do keep reading, but the short of it is, we ensure that all redundant code, tests, and feature switches are done away with. This triggers steps 4, 5 and 6, all culminating in the final step…
Everything is done. The variant is rolled out, the code is cleaned up, and we have either learned something (failed experiment) or achieved something (successful experiment).

That’s the gist of the experiment lifecycle, but as I mentioned, there are a couple of stages there that are really worth digging into more to truly understand some of the complexities and challenges a team like ours can face on a daily basis.

Dealing with feature switches

Some will call them the best human invention since fire, while others, a necessary evil. I, for one, think it’s a very useful tool, but like every tool, it can be overused or misused. In our case, it’s invaluable to have the option of setting up a new feature switch for every experiment and variant. The more challenging part is keeping track of them all.

For context, we have 9 engineers on the team, and generally speaking, we aim for just as many experiments per sprint. Some quick maths suggests 160 experiments per year, but let’s go with a more conservative 100 experiments instead. Just assuming you have two variants per experiment already means 300 feature switches. 100 of those control the bucketing of the variants. If not handled correctly, things can get quickly out of hand, so we have devised some ways to avoid that:

Adding a special prefix for feature switches that control the variants.
Using team-based feature switch prefixes.
By making sure each feature switch has a clear ownership marked — we use a unique team email address.
Each feature switch will have a link to the experiment note or the Jira ticket it refers to.

This varies from organisation to organisation, but in Prezi, it’s mostly the software engineers who add, configure and clean up feature switches. We opted for this approach as it keeps the control of software integrity in engineering’s hands. We don’t have to worry about product owners inadvertently breaking regression tests by turning switches on and off at the wrong time.

Releasing an experiment

While releasing an experiment will ultimately come down to just flipping a switch — a feature switch that is — there’s a lot more to it and how much exactly, can vary from experiment to experiment. Some are a lot more involved than others. As I am writing this, I am working on an A/B test that involves three frontend bundles (think apps), and four different services. Even if you’re experienced and QA did a fantastic job making sure we haven’t broken anything, there are still a myriad of things that can fall through the cracks.

To make sure releases go as smoothly as possible, we adopted an already standard practice from aviation and medicine — a checklist.

Surgeons use Surgical Safety Checklists, and pilots rely on Pre-flight Checklists to ensure the best outcomes. We call it a release document, but it’s really a checklist as clearly stated in the head of each document:

This document is meant to be used as a checklist for the person who’s driving the release to be able to do it in a calm, collected, professional way. Also meant to act as a document for others, so when troubleshooting is needed, all the information about what was happening during a release is recorded. — Prezi internal release document

All such documents are signed off by at least one— but ideally two — senior or lead engineers on the team.

To some, this might seem excessive, and at times it really is, but in weighing the costs and benefits, as a team we concluded this approach gives us enough value and confidence to stick to it. Just to illustrate some of the items on the checklist, here’s what we look for:

Have the relevant senior/lead engineers signed off on the plan?
What components are meant to be deployed and have they deployed successfully?
Have all relevant teams been notified about our intent to release the experiment?
What’s the feature switch configuration?
Is the testing scenario working on production as expected?
Is the A/B test distribution as expected on OpenSearch?
Any unexpected spikes in Grafana?
Are there any new relevant Sentry error logs?
What action(s) to take in case of needing to revert?
If all of the above OK, notify internal stakeholders of successful release.

It’s a cross your “T”s and dot your “I”s kind of exercise, but out of it we get a log we can reference later and the assurance that anything that could have been prevented, has been prevented because, you know… Murphy’s Law. 😉

Having released, however, doesn’t mean we’re done. Far from it. There’s cleanup, and it’s such an important part of what our team does that I felt it deserved its own section, so without further ado…

Cleaning up

I hate doing the dishes, so by week’s end there’s a literal pile of them waiting to be washed. Now, remember those 300 feature switches? That’s precisely the pile we desperately want to avoid. Because feature switches as useful as they are, they quickly pollute the code to a point it becomes unmaintainable, which would result in us losing more and more velocity over time. As a team, you can easily grind to a screeching halt if code is not maintained, and as an experiments team, we’re particularly prone to having this happen if we’re not vigilant.

One way we’re working on preventing such a situation is by automatically creating cleanup tickets for each experiment. Jira isn’t so bad after all, aye? 😄 You see, once an experiment goes live, it will stay live for at least a couple of weeks. Gathering useful enough data to make pragmatic product decisions, doesn’t happen instantly, so usually a few weeks after the A/B test release a decision gets made. Either we stick to what we had before — aka we keep the control variants — or we keep one of the other variants. Often it’s just one, but there are times when an A/B test has a total of as many as four variants. Let me pseudocode an example:

if(isActive('amazing-feature-variant-a')){
<ABTestComponentVariantA>...</ABTestComponentVariantA>
} else if(isActive('amazing-feature-variant-b')){
<ABTestComponentVariantB>...</ABTestComponentVariantB>
} else if(isActive('amazing-feature-variant-c')){
<ABTestComponentVariantC>...</ABTestComponentVariantC>
} else {
<ControlVariant>...</ControlVariant>
}

Regardless of which one we keep, three of those have to go. You can imagine, of course, that often times an A/B test is far more involved than just showing a component or not, so cleanup can become quite an undertaking, as you want to make sure you understand the variants that have been added, the relationship with the rest of the codebase and the overall user flows, so cleaning up doesn’t result in collateral damage. This typically means editing tests as well.

You might wonder, in case of a lost A/B test where we end up sticking to control — to what we had before — can’t we just revert to the original PR? The answer is maybe, perhaps partially or not at all for the following reasons:

You might be able to revert if the initial change was very clean, and other changes to those files haven’t been done since. In a high-traffic codebase, that’s quite unlikely to happen, though.
You might only be able to do a partial revert if some of the changes happened in a low-traffic codebase, while others in higher-traffic codebases. The A/B test I am working on right now touches several repositories. I could imagine one or two of those repositories seeing light enough traffic that I could just revert, but the rest would require a more involved approach.
If you’re only dealing with high-traffic codebases, you simply don’t have this option, but you will also find cases where you added code for one of the variants that’s actually going to be useful for future work. Maybe you wrote a nice utility function, or refactored some code as part of the A/B test to make your life easier. You surely don’t want to revert that.

When all is clean and done

I won’t gaslight you into thinking we don’t deal with technical debt, awkward tech stacks, or breaking pipelines like every other team and engineering organisation out there. We do, and some of our challenges aren’t even new to many developers out there. It’s more like a unique flavour of what other teams deal with daily, and it’s unique enough that we found ourselves having to fine-tune how we do things, improve our processes, and continuously refine and shape ourselves as engineers into individuals reflecting the previously illustrated skills (mindset) table.

This is what has worked for us. This is what gets things done. For now. Just like we experiment with features, we experiment with ourselves as individuals and as a team. Sometimes that means we succeed, other times it means we learn and move on, or we learn to move on. It’s a journey, and it requires stamina, but ultimately, it’s well-worth the effort. So, yes, A/B tests for the win! 🎉

Source link