One of the more major challenges a software engineering organisation tends to face at one point or another in their lifetime is technical debt that simply cannot be “paid back”. Even with the best of intentions, it does happen, and it can happen for a myriad of reasons, one of them being a stack change over time or a certain language or framework’s fall from grace over the space of a decade or two. Add to that some inevitable brain drain, and you have yourself a migration trifecta.
Over the last few years, an ongoing conversation between engineering teams was our hefty suite of Cucumber regression (E2E) tests written in Ruby. As the years have gone by, Ruby has slowly become the abandoned child of our stack. There was a lot of appetite for it initially, and fairly widespread skillset in the teams. The language was popular, Cucumber was popular, so writing tests in Ruby was also popular. Until it wasn’t. By 2021, whenever our Cucumber tests came up in conversation, you could feel the dread setting in. Everyone wanted to get rid of them, but nobody had the time, will, or energy to do it. After all, we were talking about roughly 200 tests.
By the end of 2023 we had virtually no Ruby skills left in the company. Be that infrastructure or development side.
Because it’s important to remember that regression tests don’t just run in a vacuum or on local machines. Writing them, updating them, is only half the equation. The other half is an entire infrastructure that enables those tests to run as part of your CI pipelines. At this point, it wasn’t just developers who wanted to — and I quote word for word — “kill it with fire”, our developer experience team (DX), who were tasked with maintaining the Ruby infrastructure were also getting exhausted by its costly and unsustainable maintenance, nevermind the risk of ending up in a situation where some dependencies would simply not be supported at all anymore, blocking the pipelines and thus critical releases to production of our products. I mean, just look at these gems, and I say that both literally and figuratively:
ruby 2.5: release: release date: 2017-12-25, EOL: 2021-04-05 (latest version: 3.3.6)
google chrome 75: release: 2019-06-04 (latest version: 131)
bundler gem v1.17.3: release: 2018-12-27 (latest version: 2.5.23)
cucumber 3.1: release: 2017-11-28
As one of my DX team-mates aptly put, it was a time-bomb ready to blow at any time. The last time I heard that, I had to migrate an entire frontend from Angular 1 to React and do so while also moving a monolith to microfrontends.
But I’ll be honest, I also tend to be intrigued by challenges that keep not getting solved for a long time. Perhaps it’s a form of self-validation or just “red energy” as one of my therapist friends calls it.
If you ever used anger to fuel positive change, you used red energy.
By spring of 2024 it was decided. I am going to make it my personal goal for the year to once and for all migrate all the Ruby Cucumber tests to our Java-based E2E framework. I was hell-bent on doing whatever was necessary to get it done. Unbeknownst to me, Turu, a colleague of mine from the QA team had a very similar energy fueling a very similar goal. I know that 9/10 times the word “synergy” is used completely unnecessarily in conversations, and we’re all tired of hearing it, but this time the synergy was real. I was going to need the QA team’s support to some extent anyway, but seeing our goals intersect — love the boardroom lingo, aye? 🙂 — was a massive relief as it meant we were going to be able to share the load somewhat more evenly and accomplish — now our collective goal — faster. Believe it or not, sometimes throwing more people at the problem does help. As much as I love Fred Brooks’ timeless software engineering classic, it doesn’t always apply.
A few words on strategy
In short? Let’s call it the “80 days around the world” strategy. I could say we time boxed it, but that sounds boring, and tying our success somehow to Jules Verne sounds more fun. Regardless of what you call it, that aspect — especially in hindsight, and hindsight is always 20–20 — was crucial to getting this migration done.
I have learnt this doing a lot of proof-of-concept projects and hackathons. Creating an unmovable constraint — designers know this first-hand — inspires people. Creative ideas surface, people suddenly become more dynamic, adaptable, and start focusing on what truly matters — the outcome by a certain date. In this case, we really did give ourselves around 80 days with a singular goal: migrate everything.
Migrate everything in 80 days. How? Doesn’t matter. Get creative. Stay pragmatic. Get. It. Done.
Anyone who works in software development knows that prioritisation is a tricky business. A lot hinges on it. In this case, everything did. I ran all the Cucumber tests locally, and quickly realised we will have to be smart about what we migrate, when and why, so to make sure we stayed efficient:
- I reached out to teams to find out if they had any redundant or deprecated tests. Some did, so I marked them for deletion.
- I looked at the currently passing tests, and created the first batch to migrate. These got priority because all of these tests were running on live software, used by millions of customers. If, for whatever reason, we would suddenly end up running out of time, we’d at least have the most important tests migrated.
- Then I created a second batch, while my colleagues from QA already began giving a helping hand in migrating them to our own test automation framework (TAF). This second batch was all the flaky tests, the ones failing for whatever reason or the disabled ones.
- Finally, there was a last set of tests that covered some of our A/B tests. Initially, I almost made the mistake of starting with these, but then I realised by the time we’re done with the migration, most of these A/B tests will have already been concluded. That turned out to be true, and out of 20 or so, we only had to write tests for 3.
Once prioritisation was ready, the QA team (partially) and myself (full-time) got working on the implementation part. Tests after test, one by one, day after day, we could see the progress. We used a traffic-light system. Tests that we migrated, we marked with green 🟢, tests we were working on we marked with amber 🟠, and tests we found did not need migrating, we marked with red ❌. At all times, everyone involved knew who was working on what test. I decided to waste as little time with Jira tickets as possible, so we did most of the tracking in a Confluence doc.
Were we ruthless with our time-saving measures? Perhaps. But did we deliver the work on time? You bet!
Once all the tests were migrated, QA did a final review to make sure we tagged everything correctly, important test cases weren’t missed, and as an output, we created a log table that showed what Cucumber test ended up in what TAF test. Literally within days of migrating, we already had engineers making use of this log as they now had to find the old Cucumber test cases in their new home.
The final step in the strategy was setting up the CI appropriately. We wanted to make sure these tests were parallelised, but in doing so, we had to keep infrastructure cost in mind. Our Ruby tests, while a pain in the neck in every other way, they used a fairly low amount of resources, while the Java tests were a tad more resource-hungry, but DX figured out a good resource to test ratio to keep costs in check. With that in place, I had the honour of pressing the archive button on the repository and announcing to the entire company that we have finally killed all our Cucumber tests.
What ultimately enabled a successful migration
Looking back, trying to run a retrospective in my head of what went well, and why we finally managed to pull this migration off, there are a few things that come to mind, and some of these I have come to consider key to any successful project going forward.
We had a common goal. It cannot be understated just how important it is for everyone to row in the same direction. It empowers those doing the work to focus on it and do it well. So, the support of both my team and the QA team was crucial. Turu, our senior automated QA specialist had this migration as a personal 3rd quarter goal just as much as I did, so we were both heavily vested in getting the work done successfully.
Zero wasted time. Apart from a few initial meetings with QA, my team and I had around what we wanted to achieve and some historical context, the only meetings we had were a weekly 1-hour sync between Turu and myself. That’s roughly a day’s worth of meetings over a 10-week project. That’s not to say that meetings are bad, but every so often they cost the project, and we couldn’t afford that.
Keeping the goal in mind and the goal was clear: migrate all the tests as effectively as possible within the time we had. At times, that meant merging more test scenarios into one, or moving a test into another existing test as a scenario rather than a standalone test.
For each test, we did whatever made most sense instead of sticking to a 1:1 carbon-copy approach.
Translated to tangible business outcomes
But that’s the engineering (including QA) success story and as I mentioned in “How to Sell Engineering Needs To Product Managers” we owe ourselves and the business as engineers to translate engineering needs to business needs. I’d be the first to shy away from work that makes no business sense. While I’m no CFO, nor do I intend to ever become one, any effort that doesn’t make any business sense doesn’t sit well with me. That said, no project will ever be done “because it sits well with Attila”, so let me translate this particular engineering need to a business need.
When you have tests written in a language that nobody knows or cares to learn, those tests will be either poorly written or not written at all. This increases the chance of customer-blocking bugs that could go unnoticed until customer support is alerted, at which point it’s already too late and costly. So, a more robust product results in less customer support calls, aka money saved.
The other downside of a severely outdated test infrastructure is maintenance. Ideally, a software company wants to spend as little money as possible on maintenance. Features or A/B tests are more interesting, and they make more money. Maintenance that costs 10 times more than it should, is a waste of finances, brings down morale and might even be the cause of being unable to hire new engineers. There’s only so much money in an engineering pot, and we much preferred spending it on new tools or perhaps even additional headcount than maintaining a severely aged infrastructure.
Reducing complexity increases velocity. It really does come down to that.
As our DX team repeatedly highlighted, we were sitting on a time-bomb. Waking up every day to the very real possibility that one of our Ruby-Cucumber dependencies gets nixed because of its age is not a great place to be in when the core functionality of your product — such as signup, payments, and analytics — depends on it. Such a situation would have caused severe disruption for Product, wasted A/B testing runtime, increased manual QA and customer support costs for weeks if not months, potential loss of customers and revenue. This is unacceptable, especially when you are on a growth trajectory.
Finally, this migration was also a massive enabler. Within weeks of completion, having all of our tests in one place, we were already able to identify areas where we can make our tests more efficient, spend less time in the CI, and be more confident in what is being tested — aka have a real and meaningful understanding of our coverage. This can only mean one thing: better velocity in 2025 and beyond, and if there is one thing that Product Managers love hearing, is higher throughput. 😉
Closing thoughts on migrations, AI, and machine learning
As QA and I were wrapping the migration up, I couldn’t help but reach certain tangential conclusions that, I feel, will be food for thought for many of us software engineers and quality engineers in the coming year(s).
While completing a migration like this is an exciting opportunity for some of us — myself included — it’s not something most engineers would volunteer for, and for good reason. Migrations can be a can of worms, you’re touching a lot of legacy code you’ve never seen before and have no historical context on. You’re likely going in a little blind.
Then there’s also the monotonous aspect of the job. Especially when it comes to writing E2E tests, once you have everything in the framework available to you, writing the tests themselves can feel like more of the same, which brings me to my next point and an interesting realisation.
At one point, by pure luck, I downloaded the latest version of IntelliJ that features Full Line code completion. Within minutes, I started seeing the IDE suggesting my next line of code, be that a new page object or an assertion, and what do you know? It was often right! Often enough, that I saved 2–3 days’ worth of time over the course of the migration. This was machine learning in action, under human supervision, which made me think…
If there is one job that I’d like generative AI to do in the future, it’s maintenance and migrations.
It would have been great to feed a model our Cucumber and TAF tests, let it figure out what was missing, migrate those tests, run them and even deploy them with minimal human supervision. Now that’s something I could really get behind, and who knows, with another healthy dose of red energy it might soon become reality. 😉