💡Learnings from 750+ AB tests with Matthew Emery (Product Manager & Consultant at Turbine Games Consulting.)💯

Our guest in today’s episode is Matt Emery, Product Manager & Consultant at Turbine Games Consulting.

Matt has extensive experience with over 750 A/B tests across gaming genres – and across retention, monetization, and game development. In this episode, he delves into the key takeaways from these tests and how to audit and improve mobile games.

📍Product Market Fit (PMF) is vital; optimization won’t fix fundamental PMF issues.

📈 Analyze CPI vs. LTV to identify growth opportunities.

🔑 Defining PMF for games involves competitive retention and CPI thresholds.

🗂️ Prioritize improving retention over monetization for struggling games.

⚖️ Balance monetization to enhance retention without compromising user experience.

📌 Casual games are easier to optimize due to lower complexity.

FULL TRANSCRIPT BELOW

Shamanth

I’m excited to welcome Matt Emery to the Mobile UA show. I’m thrilled to have you because you’ve seen hundreds of A/B tests and I’ve always enjoyed and loved reading some of the stuff you talk about and some of the tests you guys have run and all the insights around retention, monetization and game development. So for several reasons, I’m thrilled to have you on the show today.

Matt: Thanks for having me.

Shamanth. In today’s episode, we’re going to talk about some of the learnings that you had from the hundreds of tests that you guys have run and how you audit a game and review a game to identify opportunities for improvement.

What are some of the first things you look for when you’re auditing a game? Where do you begin?

Matt:

Our general approach is to scan the entire product stack and marketing stack for low-hanging fruit essentially. That said, that approach is not always appropriate or what’s best for the client or the game. It depends on the phase of the product.

And whether a game is pre or post-product market fit validation. We’ve worked with teams who behave as if they’re post product-market fit growth mode, but for various reasons, they never credibly demonstrated product-market fit using any objective measurable standard.

And so

one of the learnings that we’ve had, after running over 750 split tests and optimizing many games is that if a game doesn’t have product market fit, even doubling, tripling, or 9Xing revenue is not likely to save them from a fundamental product market fit problem.

To put it more bluntly, without PMF, it’s, unlikely that any level of optimization will be enough to overcome that structural disadvantage.

But when we’re looking at a game that does have product market fit, or that is already in growth mode, the first thing we want to do is to look at the CPI versus LTV equation and understand the shape of what we’re working with. We want to understand whether they’re currently running profitable UA, whether they’re close to getting over the hump, and what scale they’re currently operating at. That helps us prioritize where the lever to unlock growth is for that product.

Shamanth:

What I found surprising in your answer was that games and apps could grow even if they don’t have product market fit, but that may not be sustainable. So if I might segue based on that, what are some of the objective measurable criteria based on which you would decide a game has product market fit?

I also ask that because I would imagine it’s very different from what it would be for a consumer product, which has some form of consumer utility, whereas for games it would be different. So how would you define product market fit for a game?

Matt:

This is something we’re currently solving.

I think part of the problem is that in the industry, especially among smaller publishers and developers, there’s not an objective standard for what product market fit looks like. Our current approach is that, in pre-release, before you have a build, the things you can look at are creative performance and survey results across certain thresholds.

When you’re in soft launch, that is when most teams converge around thinking about product market fit in the form of retention and CPI. And that’s correct. In soft launch, getting your CPI to whatever objective threshold you view as a green light is a happy equation.

And the other half is getting retention past a benchmark threshold with your target audience. Hopefully, you haven’t entered worldwide and entered growth mode without first, having confidence that your CPI and retention are competitive. Although many apps do end up in that place for one reason or another.

Shamanth:

I hear you and I understand that the retention and the monetization are building blocks of the LTV that eventually dictate how scalable and profitable the game itself eventually becomes.

To switch gears a bit, you talked about pre-product market fit. Are you running any tests at that point? And by extension, how does the stage at which the product affects the test that you initially run?

Matt:

The simple answer is that if we’re not hitting our retention targets, that’s where we focus often, there will be a mandate to focus on monetization.

But we can pretty quickly just ascertain that the problem is a retention problem or a product market fit problem, and we may attack some low-hanging fruit on the monetization front just to check the boxes, but really the problem is retention, and that’s where the focus needs to be.

Retention is a lot more difficult to fix. Again, that’s built on a solid economy model. Improving monetization is quite straightforward and those are the experiments that reproduce the most frequently. So we’re much more confident in making that happen than we are in trying to improve retention for a product that is not clicking. That’s the most important thing. We define product market fit as having competitive retention and CPI. If you haven’t crossed that threshold then we’re just laser-focused on CPI and retention.

Shamanth:

If I might push back on some aspects of what you said and maybe I’m missing something very fundamental here, but would it not be possible to improve monetization to the point where your LTV to CAC just becomes sustainable and helps with your retention?

The most obvious example I can think of is having super aggressive paywalls very early on in the user flow. It’s a terrible user experience. It’s terrible retention, but you might just monetize to a point where your LTV to CAC is sustainable and profitable. So why not go down that path, especially if monetization fixes are also easier?

Matt:

It is possible.

I’ve never seen us monetize our way out of a retention problem. I’ve worked on over a hundred games. I know that there are games that certainly appear to be low retention experiences. They look very punishing and they look like they’re designed to get you to watch 20 ads on day zero and then churn out. But I think the key is just what is your competitive set. The goal is to have competitive retention relative to your peers.

It’s such a competitive market that if you try to go out of the gate with retention, that is underperforming relative to your peer group or the group of products that are chasing your same target audience, then, you’re just walking out on a limp.

You can try to optimize your way out of that but it’s very unlikely to work and I would argue not. Not the most efficient path.

Shamanth:

And staying with monetization, you talked about product-market fit, how do you differentiate the lack of product-market fit versus a game just not being sufficiently monetized? I would imagine both of these would manifest in product metrics like ARPDAU or LTV. By extension, how do you differentiate not having product market fit from a retention problem where it’s bad retention versus no product market fit?

Matt:

It’s a great question. You can go upstream. Many teams don’t do this, but you can go upstream to non-retention signals of product-market fit like surveys. But usually, when a team has retention that isn’t where they want it to be, they won’t just give up. They’ll optimize for retention. They’ll play with difficulty tuning. They’ll pull the obvious levers to try to improve early retention and make sure that it’s not just some operational problem or some game-specific problem that is causing their retention to be lower than it could be. So generally, they’ll bang on that egg and see if they can crack it. But if they can’t, I would argue that it’s probably a product-market problem.

It’s probably that the users they are targeting with their experience are not particularly interested in that experience.

Shamanth:

You’ve worked across multiple genres and done hundreds of tests in your experience, what genres are easier to optimize or improve through testing than others?

Matt:

Hyper-casual and casual are the easiest to optimize if we’re in a growth optimization mode. Those are the easiest for a variety of reasons. One of the most difficult and sometimes annoying issues with mid-core and core titles is that the communities are extremely active they’re very good at detecting experiments, and detecting differences in experience, and they can be conspiratorial at times. When we’re in the casual and hyper-casual space, we’re usually less likely to encounter that.

There are casual games where people pay a lot of attention, but in general, on average, it’s safer to do more invasive experimentation on hyper-casual and casual games. The scope of features in casual and hypercasual tends to be lower than mid-core. So it’s just faster and easier to build and make client changes to test.

The UA costs to power experiments are lower and the TAMs are bigger so you just have greater user flow coming into experiments. And then finally,

in casual games, the mechanics and solutions are pretty portable. Across different genres within casual. I’m sure you’ve seen, battle pass daily rewards. These types of features tend to work equally well in a match-3 game or a merge game or what have you. It’s very fun and easy to play within the realm of casual because the learnings can be shared so readily and you can acquire them pretty quickly.

Shamanth:

I would imagine a lot of the formats, the features tend to be relatively standardized. So I imagine a lot of the learnings from one kind of casual game, with some modification imported over to different casual games and hyper-casual games.

Matt:

Yes.

Shamanth:

With that said, what are some of these commonalities that you feel could be ported over as learnings? What are some examples of commonalities that could be ported over as learnings from one genre to another? Or what could be patterns that you’ve seen across different genres?

Matt:

One that immediately comes to mind, I think every game has this now, but there was a period where piggy banks were being moved into every casual and casino title under the sun and people were playing with tuning, playing with, whether to treat them as events or permanent features. Gradually, those have spread out throughout the casual side of free-to-play. That’s a pretty large feature.

So it’s on the large end of the types of initiatives that we run with clients, but

some of the low-hanging fruit that we commonly lean on if we’re chasing CPI icon testing. It is pretty profound and the simplest art asset that you can imagine that can produce a double-digit lift for ROAS. It’s generally lower cost and higher impact than screenshots or store videos. It’s a really fun place to play to get quick wins.

The other obvious pillar asset is ad creative.

I’m sure as you know, ad creatives can easily have a double-digit impact on, ROAS, and in the grand scheme of things, they’re easy to produce and don’t require any client work. So that’s an obvious place to play on the retention front.

As we mentioned briefly before, difficulty tuning or progression speed tuning are really good places to play. In many cases, just turning a knob, changing a value, a numerical value, the double-digit impacts to LTV, or retention, LTV via retention.

And then on the monetization front, IAP and ad optimization is one of the places that we focus the most heavily and where we have, many experiments.

Shamanth:

Some of these can be very quick wins. It’s funny since you mentioned ad creative, somebody I was speaking to thought their game was broken. They thought their retention was terrible, and monetization was terrible. It turned out they had one misleading fake creative that their ad person was running and they just hadn’t questioned the ad person and that person hadn’t questioned it. The moment they took that creative out of running, their retention, and monetization metrics all looked better.

Certainly. A good creative can move your numbers up, but also a bad one can make it look like your game’s just not working. So definitely, I think that’s one of the pitfalls to be very careful about, I would imagine.

Matt:

Yes, that’s part of what makes this so hard. You have to make sure all of your metrics need to be based on the right users. Whatever your theory is of your target audience that’s what you use for product market fit testing. That’s what you use when you’re optimizing. And if you’re simultaneously shifting the sands of UA creatives, then your target audience changes underneath you.

It’s hard to know that that’s happening. And then when it happens, it can be very disruptive. That’s the strange nature of the business that we’re in.

Shamanth:

I think it becomes much harder when there are larger teams. I was in a public company where this happened, where one team was running incentivized traffic to a certain country and nobody else knew about it. Everybody hit the panic button that the game is terrible in this particular country until we all found out it was just incentivized. It’s very important to be careful about these interdependencies.

To switch gears a bit, with monetization tests, I would imagine it’s very important to get the economy to be balanced, which is to say in-game currency and in-game goods need to be valued just right. You can’t give too much away, or you can’t give too little away. If you give too little away, people churn. If you give too much away, you’re destroying the economy.

How do you determine what’s the right way to keep the economy balanced and the different in-game currencies be valued, right? How do you think about that? How do you make sure these are valued, right?

Matt:

That’s a good question. I have somewhat of a cop-out answer here which is that if we’re starting with a new economy, we don’t have a perspective yet on tuning. I would generally recommend a pretty blunt 80-20 approach which is to copy the tuning of your most popular competitor of the games that your target audience is most used to. Try to calibrate your game to what your audience is comfortable with and expects, and then using that as your baseline, experiment from there.

We would approach it the same way when it comes to balancing, free output from the economy, and difficulty and then experiment from that baseline.

It’s very common to look at wallets to see if people are hoarding currency to see if people are not using the currency that you’re giving them and that can reveal a number of different problems with the economy but in general, I would recommend not reinventing the wheel on many fronts including this one.

It ties into the fact that a lot of games have so many similarities in the mechanics. So if you’re building on what’s proven, you’re getting 80 % of the way there. You could certainly improve and optimize based on the uniqueness of your own game. But it sounds like what’s established is always a good place to start.

Shamanth:

Having worked on 750-plus tests, what do you still find are some of the common mistakes that developers make when it comes to running tests, improving games, improving monetization, improving retention, and the whole gamut of making a game better?

Matt:

I think probably the number one problem we see is trying to continue to invest in a game without demonstrating product market fit or without really focusing on that as a gate to, focusing on growth mode. Not understanding what phase you’re in and prioritizing your activities accordingly. That’s probably the number one mistake I see.

When it comes to split testing specifically. It is not a silver bullet. It is one of many tools that we use with clients. We don’t by any means split test every initiative that we drive with clients.

We split test when appropriate and when it’s valuable to do so. One of the mistakes we see clients making when they’re first playing around with split testing is just testing initiatives that don’t need testing at all.

Testing things that have no downside. Split testing is not free. It incurs overhead costs, both to instrument it and analyze it and opportunity costs from other experiments you could be running. And so if you’re making a quality of life change that’s unlikely to hurt the experience, you don’t need to split test that unless you’re just, absolutely determined to learn from the impact, I would recommend generally just doing it.

The second big mistake is split testing initiatives that aren’t designed to produce large lifts. If you’re not setting out to move some measurable metric by 10 % or more, it depends. If you have millions and millions of users coming in, you can detect much finer changes in metrics, but generally speaking, if you’re trying to move a metric by 3%, it’s very unlikely that a split test is going to be able to detect such a small change.

We are generally split testing initiatives where we’re trying to move some metrics by a large amount. Otherwise, we would just do it or not do it. But depending on whether we think there’s downside risk The canonical example of changing a button color and expecting sales to increase.

That’s not a recommended experiment. It’s not going to move anything. 10% – maybe it would move a number, but you wouldn’t know that from a split test unless you had millions of users in that experiment.

Shamanth:

What you said also reminded me of some teams I’ve worked with where I think getting statistical significance and getting a clear measure of how much better A is better than B, it’s very important to the team.

And sometimes it strikes me that that isn’t very critical because it comes at an opportunity cost. You have to wait. You have to spend resources to make sure you’re getting that level of statistical significance.]

Matt: And you’ll never know for sure what the actual impact was, the best you can get is a probability distribution where you might say, It improved between 1 and 10%. We’re very confident it was between 1 and 10%. You’ll never know. As long as the number goes up you can just kind of, celebrate and move on.

Shamanth:

100%. This has been incredibly insightful. And, like we talked about, you’ve done 750 plus tests. So you clearly can pattern match to understand what’s really moving the needle.

This is perhaps a good place for us to wrap up this interview. But before we do that, can you tell folks how they can find out more about you and everything you do?

Matt: Come to our website at Turbine Games or find me on LinkedIn. I’m pretty active there.

WANT TO SCALE PROFITABLY IN A POST IDENTIFIER WORLD?