Google Play Store Optimization

I Built an Autonomous Loop That Rewrites Google Play Descriptions

I built an autonomous loop that rewrites a Google Play long description, scores every version, keeps what helps, and reverts what doesn't. Here's how it works, and the open-source code.

K
Kevser Imirogullari
· · 9 min read
Table of contents

I noticed I was doing the same thing on every app.

Open the Google Play long description. Check keyword density. Spot the primary keyword that’s diluted to nothing. Find the high-value term that isn’t mentioned once. Move a keyword out of the cluster it’s stuck in. Tighten the first 250 characters because that’s the only part most people read. Re-check density. Repeat until it stops getting better.

That’s not strategy. That’s a loop. And loops are exactly what agents are good at.

So I built one. It’s an autonomous loop that rewrites a Google Play long description, scores every version against a fixed function, keeps the changes that help, and throws out the ones that don’t. The code is public and MIT licensed: google-play-description-autoresearcher. This is the walkthrough of how it works and why I built it the way I did.

The pattern I stole

The structure comes from Andrej Karpathy’s autoresearch idea. The shape is simple: an agent runs experiments, a deterministic function scores each one, and the agent accepts or rejects based on the score. The agent’s intelligence goes into deciding what to try next. The scorer’s job is to keep the agent honest.

The reason that split matters: an agent left to “improve” a description with no scoring function will happily rewrite it into something that sounds great and ranks worse. It has no ground truth. The scorer is the ground truth. The agent is just smart search over text edits, and the scorer is the contract it can’t argue with.

Which means the hard part of this project was never the agent. It was the scorer.

The scorer is the whole game

If the fitness function is wrong, the loop optimizes confidently in the wrong direction. So most of the thinking went into score.py: a deterministic 0 to 100 composite built from nine components.

ComponentWeightWhat it measures
Keyword density30%Each keyword’s share of total words against a target band (2-3% primary, 1-2% secondary)
Keyword coverage15%How many target keywords appear at least once
First 250 characters15%NorthStar keyword presence and hook quality before “Read More”
Keyword distribution15%Whether primary keywords appear in at least 2 of the 3 thirds
Formatting quality10%HTML tags, bullets, paragraph length, emoji use
Structure10%Hook, features, social proof, CTA all present
Policy compliance5%Stuffing, prohibited superlatives, comma-list keyword spam
Character utilizationfloorBinary penalty below 2,000 characters
Repetitionbuilt-inCaps on per-keyword counts

Density carries the most weight on purpose. On Google Play the entire long description is indexed, and keyword density is the primary on-page ranking signal. This is the part people coming from iOS get wrong: on iOS the long description isn’t indexed at all, so the instinct to write a pretty paragraph and move on is fine there and actively costs you on Android. The scorer encodes that difference. It rewards 2-3% density for primary keywords and 1-2% for secondary, and it penalizes anything above 3.5% as stuffing risk, because that’s where Google’s policy starts flagging.

A few design choices in the scorer that I’d defend in a room:

  • Relevancy is a multiplier, not a label. A NorthStar keyword at the right density is worth 3x a merely relevant one. The function reads a tiered keyword list and weights accordingly, so the loop spends its effort where ranking actually matters instead of treating every keyword as equal.
  • Length is not a component. Character count is a binary floor at 2,000, nothing more. A 2,500-character description with clean density beats a 4,000-character one with diluted density. I did not want the loop padding text to “use the space.”
  • The first 250 characters get their own 15%. That’s the slice visible before “Read More.” It’s both the conversion hook and a strong keyword signal, and it deserves to be scored as its own thing rather than averaged into the body.

The agent is explicitly told never to modify score.py. The scorer is the constitution. The agent only edits the description.

The loop itself

The mechanics are deliberately boring, because boring is what survives an unattended run.

The agent reads program.md, which is its playbook: the rules it follows, the rules it can’t break, and the order to attack problems. Then it runs this cycle:

  1. Read history. It keeps a results.tsv log of every experiment, kept or discarded, with the score and what changed. Before each new experiment it reads the whole log so it never repeats a change that already failed.
  2. Pick one targeted change. One. Not a rewrite. Fix a policy violation, or correct a density imbalance, or strengthen the first 250 characters, or weave in a missing high-value keyword. One variable at a time, so the score delta is attributable.
  3. Edit the description. Only description.txt. Natural prose, no comma-separated keyword lists, varied sentence structure. The scorer can smell spam and so can Google’s NLP.
  4. Score it.
  5. Keep or revert. If the composite improved, git commit. If it didn’t, git checkout -- description.txt and the change is gone.

That last point is the part I’m happiest with. Git is the undo stack. Every kept experiment is a commit, every reverted one leaves no trace, and the description’s entire optimization history is a readable git log. No custom state management, no “best version so far” variable to corrupt. The version control system already solved that problem in 2005.

One more design decision: the playbook tells the agent to never stop and ask the human. The human might be asleep. When it runs out of obvious moves it’s instructed to re-read the keyword list for missed angles, check which keywords are one mention from target, try restructuring instead of word edits, or test whether removing content improves the density ratios. It stops on its own terms: 25 experiments, or 5 consecutive discards (a plateau), or a score of 90+.

Watching it run

To show the loop working without exposing any real app, I built a fictional one. PlantPal, a houseplant care app, with a made-up brand, made-up competitors, and a keyword list across four relevancy tiers.

Two choices there were deliberate. First, the example is obviously fictional so nobody mistakes it for a competitor teardown or a client’s data. Second, the baseline description is intentionally mediocre. It scores 44.87 out of 100. A polished baseline would make the loop look pointless. A weak one, in the mid-40s, is also what most real unoptimized long descriptions actually look like, so the demo is honest about the starting point.

Let the loop run 20 to 25 experiments and the PlantPal description lands in the 75 to 90 range. The git log reads like a lab notebook: kept “raised primary density in section two,” discarded “added social proof line, no score change,” kept “moved tracker keyword into the first 250 characters.” You can watch it reason.

To be clear about what that number is: 44.87 to ~85 is a synthetic example chosen to demonstrate the mechanism. It is not a client result, and I’m not going to dress it up as one.

What this is, and what it isn’t

This is the part I want to be careful about, because it’s where credibility gets won or lost.

What the loop optimizes is the on-page quality of the description: keyword density, coverage, distribution, structure, policy safety. The part you fully control. A higher composite score means the description is doing its job better as a keyword surface and reads cleanly to a human.

What it does not do is move your rank on its own. Google’s actual ranking algorithm is closed and runs on signals this scorer never sees: install velocity, retention, ratings, reviews, behavioral data. A strong description is necessary, not sufficient. The scorer is a proxy for one input into ranking, not a prediction of ranking. Anyone who tells you a text optimizer alone moves the needle on store rank is selling something.

It also doesn’t replace judgment. The scorer can’t tell you whether the copy actually sells the app. The last step is always a human reading the final description before it ships. The loop gets you to a strong, policy-safe, well-distributed draft fast. It doesn’t get you to “publish” by itself.

And it’s not for every app. If you’re running an app doing more than 10,000 installs a day, I would not point this loop at your live description. High-volume listings need surgical, conservative changes made one at a time and observed over a longer timeline across several optimization rounds, because at that scale a swing in the wrong direction is expensive and slow to walk back. This is a tool for getting a weak or unoptimized description to a strong, policy-safe draft fast. It is not a tool for aggressive rewrites of a listing that’s already carrying real volume.

I’d rather state that plainly than have someone run it, not see an instant rank jump, and conclude the whole idea is broken. It isn’t broken. It’s scoped.

Why it’s open

I put it out under MIT because there’s no patent surface worth defending and the point isn’t the code, it’s the pattern. The density and placement rules in the playbook lean on public ASO research from yellowHEAD and Phiture, so it would be strange to gate work built on shared knowledge.

If you want to run it on your own app, the README has the steps. Copy the example folder, drop in your real description and a tiered keyword list, point the run command at your folder. The .gitignore already excludes per-app folders so your own data doesn’t get committed if you fork it.

The broader reason I’m sharing this: most of the repetitive craft in growth is a loop with a missing scorer. We do the same moves across accounts and apps and call it experience. Most experienced marketers already know what “good” looks like. The hard part was never the knowing. It’s that the knowing lived in our heads, where an agent can’t reach it. The scorer is how you write it down. It’s the guidebook that teaches the AI what “good” looks like, so the loop you’ve been running by hand can run without you while you go work on the part that actually needs you.

The repo is here: github.com/kevserimirogullari-hash/google-play-description-autoresearcher. If you run it on something real, I want to hear what the score did and, more importantly, what it didn’t.

Google Play ASO automation Claude Code autoresearch
Share:
K

Written by Kevser Imirogullari

Independent mobile marketing consultant helping apps by connecting acquisition, store, and monetization insights they missed.

Explore free tools →

Get more insights like this

Join 500+ app marketers getting weekly tips on ASO, Apple Search Ads, and mobile growth.

No spam. Unsubscribe anytime.

Newsletter

Weekly mobile growth insights

What I'm seeing inside real app growth work, before it becomes common advice.

Subscribe

or get in touch