AI can write code, generate videos, craft poetry, make art, and count badly. These are all pretty impressive, except the last one, but I don’t really want an LLM in my everyday life that can write good prose. What I want is a model that can tell me what’s wrong with my prose and offer suggestions on how to make it better. In short, an editor.
Writing is fun, so I wouldn’t want to outsource it to a machine, but editing is frankly a slog. The advantage of an AI editor wouldn’t have to be that it was better than a human editor, since no eyes but mine see my posts before they go live on Substack. It would just have to provide suggestions making it better than what I had written to start, or save me time in polishing up typos and the like. This shouldn’t be too high a bar to clear, but it would also have to satisfy my concerns about ethics and taking away from my learning.
So I tried out three LLMs on a recent post of mine, about phytomining, before publishing it, to see if it could add some juice to the text. The results were not good, to put it mildly. Gemini 2.5 Pro had the best performance of the free models, coming in at maybe as good as a human who was doing a lazy job and not a particularly skilled writer in the first place. The paid ChatGPT Deep Research was slightly better, but not enough to justify its price tag. Claude Sonnet 4 alternated between wildly hallucinating problems and semi-useful advice. GPT-4o was basically useless. Despite explicit instructions in the prompt to do otherwise, all of them made edits that strongly went against my voice as a writer, with OpenAI’s GPT-4o being particularly vociferous.1
Let’s start with Claude, since it’s the first model I’ve tried and the one reputed to have the best writing style. It made 28 suggestions, although I had to use some judgement as to what was one suggestion and what was multiple. 2 of the 3 suggestions in the “typos and mechanical errors” section were flat out wrong, in addition to making some patently nonsensical comments on topics without a clear right answer.2 It pointed out two terms that might be overly technical, “concomitant” and “scalable”, which led to me replacing them but not with the alternatives that it suggested. Like the other models, it suggested removing a footnote that I was already leaning towards taking out. An outline for restructuring the piece would probably work about as well as the article’s current structure, but in general its more high-level suggestions tried to add in pop science clichés. All in all, it wasn’t a disaster, but didn’t make a compelling case for itself.
Where Claude tried to make my language a bit less academic, ChatGPT instead suggested removing a few more colloquial expressions, advice which I didn’t follow.3 It did catch a non-American spelling of “signalled”, which it had no way of knowing was actually the spelling that I prefer.4 Most of its suggestions were slight rewordings that wouldn’t have hurt the piece, but not places where I could see an advantage.5 However, they definitely pushed my writing towards a bland style akin to that of completely AI-generated pieces or that of the Grammarly AI, adding in section headers, removing all footnotes, and starting with a “this isn’t science fiction” hook similar to that suggested by Claude. Not an impressive performance, however you want to measure it.6
Gemini, unlike the other two models, focused more on small-scale fixes, catching an awkward phrase involving “could be having to”, although I replaced it with “could need” rather than its equivalent formulation of “would need.” It also identified a bit of an awkward referent in the last sentence of the first paragraph, although its solution was even worse, and correctly pointed out that I should specify ARPA-E rather than just ARPA. Like ChatGPT, many of its other comments tended to reword sentences with no discernable difference between them, or suggest unnecessary definitions and redefinitions. I’ll be generous and claim that 3 out of its 22 suggestions were useful, although I suspect that I could have identified many of these problems myself when rereading it. To its credit, incorporating all of its suggestions mindlessly would have only made the piece slightly worse by my standards, and I could imagine a reader out there preferring the Gemini version, making it a cut above the other two.
ChatGPT with Deep Research is maybe a touch better than Gemini, but not enough to make up for the cost difference. Its structural suggestions were reasonable. I wouldn’t want to, for instance, combine my discussion of the advantages of phytomining with the more argumentative sections towards the end, but it was a framing that I had considered and one that I wouldn’t be surprised to see suggested by a flesh-and-blood person. Like the other ChatGPT model, it tended towards reducing my stylistic tendencies, including correcting my use of “signalled.” Its unnecessarily verbose output did include a few errors, such as misidentifying what “this” refers to in a sentence I later changed for other reasons, but it struggled with parsing links and footnotes somewhat less than the other models. It can be difficult to tell when it’s pointing something out as opposed to making a correction, but I would put its rate of useful suggestions at approximately that of Gemini.
Taken as a whole, the AI models shared some tendencies. I’ve already mentioned their uncalled-for obsession with punchy hooks and hackneyed phrases, but they had such a strong antipathy to footnotes that I’m not sure they’re parsing them correctly.7 They also had a habit of suggesting including lots of definitions, which is an understandable preference if a bit overstated. But they also departed from each other in many ways. ChatGPT took a broom to colloquialisms, while Claude and Gemini cautioned against technical jargon. With the exception of the one unnecessary footnote, the suggestions that made it into the final piece were unique to that model.8
Although these initial results were not good, there are reasons why this might not be capturing the total promise of AI for editing. I made plenty of edits to the piece, but they were mostly about bringing the piece in line with my voice rather than fixing copyediting errors, which had already been caught by spell check. While my prompt is reasonably detailed, I didn’t do any real prompt engineering, which should be able to at least fix certain models trying to provide too much output. Testing out a few variants on the prompt, it does seem like adding a sentence trying to describe the audience means that Gemini tells me to define fewer things, but it doesn’t improve the quality of the suggestions otherwise. However, I can’t rule out that other people have writing styles where a round of AI revisions helps more; Bentham’s Bulldog and Andy Masley both report good results.9
Even with a revised prompt, there are plenty of problems that the AI editors can’t surmount now or in the near future. To start, I’m writing for an audience of humans, so I can gather from human feedback that at least one person is confused by a stylistic choice. But the LLMs here are more or less predicting what my intended audience thinks, which means that there could be as few as zero humans who agree with their choices. I don’t know enough about the technical details to say whether AI editing will inevitably be homogenizing, but that’s definitely another obstacle that future systems would need to overcome.
Additionally, there are reasons to be concerned about the ethics of using AI models. The argument from environmental costs is overstated, at least from individual uses; model training is a different story. But I don’t really trust AI companies enough to want to give them money for fancier models, and would have some smaller concerns about coming off as in support of their future agendas. Copyright and human feedback training also raise related concerns.
Furthermore, editing is a skill, and AI editing isn’t going to improve my ability to create second and third drafts. However, it seems likely that AI editing is more likely to add another round on top of the editing that I’d already be doing, so this might not be as major a problem as it sounds at first. Even if these programs became more competent, I’d still have to see if they were taking away opportunities for me to strengthen my writing.
So, while I’d love to have a second pair of whatever the AI equivalent of eyes is on my writing, practically speaking, there isn’t much use here for me. AI is almost certain to continue to improve, as will AI-driven applications and prompting techniques, so I know I’ll have to reëvaluate this conclusion at some point. For now, it’s difficult to imagine a situation where I’d care enough to have an LLM review my piece before publication, but not a human. In that rare case, I’d use Deep Research if I had access to it and Gemini if I didn’t, but there’s still a good ways to go before this is a valuable tool for my writing.
Here’s my prompt. “This is a piece intended to be posted on a blog covering a variety of scientific and social scientific topics for a lay audience. Please provide all editing suggestions for this piece. List all typos that need to be fixed, but also provide larger-scale advice if there are problems with the structure of the piece, and point out ways to clarify the technical content for non-technical audiences. Do not smooth out the style of the writer of this piece, except when it interferes with comprehension of the text. A large number of suggestions are wanted. It is acceptable to provide rewritten portions of the text if that is the best way to convey suggestions. Be critical and pay attention to all aspects of the piece, just like the best human editors.” A pdf of the Google Doc in which I wrote the piece was attached.
For example, suggesting that I should break a paragraph of just over four lines, one of the shorter paragraphs in the piece, into smaller chunks.
Although I later edited out one of the colloquialisms for completely different reasons.
This is a personal blog, so I’ll use my natural spelling system, which is mostly American English but with occasional Briticisms.
Plus a complaint that I don’t understand about plants not actually storing metals.
4.5 has similar issues, but since most people have to pay to use it I didn’t bother to critique it thoroughly.
Particularly Claude, which hallucinated a problem with footnote numbering.
Not counting Deep Research, because I ran that one later.
Although both are unfortunately pretty vague on what it does and how they’re using it.