The coverage around AI tools tends toward either uncritical enthusiasm or wholesale dismissal. Neither is particularly useful. Large language models are genuinely capable for specific tasks and genuinely unreliable for others—and knowing which is which is what makes them actually useful rather than just impressive.
Here are three areas where AI fails consistently enough that you should assume it will fail, not hope it won’t.
1. Multi-step maths
AI language models are not calculators. They generate text by predicting statistically likely tokens, and numbers behave particularly badly in this process. Straightforward arithmetic on familiar number combinations often works, because those patterns appear frequently in training data. Multi-step calculations, probability problems, financial modeling, and anything requiring reliable numerical reasoning across multiple operations do not.
The failure mode that makes this especially problematic: AI gets the maths wrong confidently. The answer is formatted like a correct answer, explained like a correct answer, and wrong. A calculator returns a number you can verify. An AI model returns a fluent explanation of an incorrect result.
For anything numerical beyond basic arithmetic, verify with a calculator. Every time.
2. Memory
Start a new conversation with any AI assistant, and it has no knowledge of any previous conversation. Every session begins completely blank. There is no continuity between chats unless a specific memory feature has been enabled—and even those store summaries rather than a full recall of previous exchanges.
Within a single long conversation, context from early in the session gets dropped as the conversation extends and the context window fills. Instructions given at the beginning may be forgotten by the middle. References to earlier parts of the conversation may be misremembered or ignored entirely.
“Remember what I told you earlier” is among the most reliably frustrating things to type into a chat window, because the earlier context may simply no longer be there.
This is worth understanding when deciding how to structure requests. Anything important that was established in a previous session needs to be re-established. Anything from early in a long conversation may need to be restated as the session progresses.
3. Legal and medical documents
AI produces legal contracts, medical summaries, and clinical documents that look indistinguishable from professionally produced work. The formatting, structure, and terminology are correct. The substance may not be.
Errors in AI-generated legal documents—incorrect clauses, provisions that contradict each other, missing protections that standard contracts include—are invisible to someone without legal expertise. The document looks like what a lawyer would produce. The errors are the kind that lawyers catch, and non-lawyers don’t.
The same applies to medical documents: the format of clinical language is reproducible by a language model; the accuracy of the content within that format is not guaranteed.
This is the category where the gap between “looks right” and “is right” has the most direct consequences. The output being well-formatted is not evidence of it being correct. For legal and medical matters with real stakes, professional review is not optional.
The practical takeaway
None of this argues against using AI. It argues for using it with accurate expectations. The tools are genuinely powerful for drafting, summarizing, brainstorming, explaining, and dozens of other tasks where the standard is “useful output I can evaluate and refine” rather than “definitely correct output I don’t need to check.”
Knowing where it fails reliably is what separates useful AI use from placing blind trust in impressive-looking output.