I don’t think the fact that LLMs can handle small numbers more reliably has anyt...

luke0016 · 2025-11-08T16:28:23 1762619303

> reasoning ability should enable them to handle numbers of arbitrary size, just as it enables humans to do so, given some pencil and paper.

Or given a calculator. Which it's running on. Which it in some sense is. There's something deeply ironic about the fact that we have an "AI" running on the most technologically advanced calculator in the history of mankind and...it can't do basic math.

novok · 2025-11-08T17:55:10 1762624510

This is like saying it's ironic that an alternator in a car cannot combust gasoline when the gasoline engine is right beside it, even though the alternator 'runs' on the gasoline engine.

luke0016 · 2025-11-08T19:08:50 1762628930

Or similarly having a gasoline engine without an alternator and making the observation that there's an absurdity there in that you're generating large amounts of energy, yet aren't able to charge a relatively small 12V battery with any of it. It's a very practical and natural limitation, yet in some sense you have exactly what you want - energy - you just can't use it because of the form. If you step back there's an amusing irony buried in that. At least in my humble opinion :-)

benjiro · 2025-11-08T17:48:23 1762624103

Thing is, a LLM is nothing but a prediction algorithm based upon what it trained. So it missing basic calculator functionality is a given. This is why tool usage is more and more a thing for LLMs. So that the LLM can from itself use a calculator for the actual math parts it needs. Thus increasing accuracy ...

DrewADesign · 2025-11-08T18:54:05 1762628045

If they were selling LLMs as “LLMs” instead of magic code-writing, answer-giving PhD replacements, the lack of basic arithmetic capability would be a given… but they aren’t. Judging a paid service using their own implied claims is perfectly reasonable.

throwup238 · 2025-11-08T18:02:07 1762624927

Why is it a given? The universal approximation theorem should apply since addition is a continuous function. Now whether the network is sufficiently trained for that is another question but I don’t think it's a given that a trillion parameter model can’t approximate the most basic math operations.

I think the tokenization is a bigger problem than the model itself.

benjiro · 2025-11-08T19:01:43 1762628503

Easy to answer that one ... predictions are based upon accuracy. So if you have a int4 vs a float16, the chance that the prediction goes off is higher with a int4. But even with a float16, your still going to run into issues where your prediction model goes off. Its going to be a lot less, your still going to get rounding issue, what may result in a 5 being a 8 (just a example).

So while it can look like a LLM calculates correctly, its still restricted by this accuracy issue. What happens when you get a single number wrong in a calculation, everything is wrong.

While a calculator does not deal with predictions but basic adding/multiplying/subtracting etc .. Things that are 100% accurate (if we not not count issues like cosmic rays hitting, failures in silica etc).

A trillion parameter model is just that, a trillion parameters, but what matter is not the tokens but the accuracy as in, the do they use int, float16, float32, float64 ... The issue is, the higher we go, the memory usage explodes.

There is no point in spending terabytes of memory, to just get a somewhat accurate predictive calculator, when we can just have the LLM call a actual calculator, to ensure its results are accurate.

Think of a LLM more like somebody with Dyslexia / Dyscalculia... It does not matter how good you are, all it takes is to switch one number in a algebraic calculation to get a 0/10 ... The reason why i mention this, is because i often think of a LLM like a person with Dyslexia / Dyscalculia. It can have insane knowledge, be smart, but be considered dumb by society because of that less then accurate prediction (or number swiping issue).

Take it from somebody that wasted a few years in school thanks to that issue, it really does not matter if your a good programmer later in life, when you flunk a few years thanks to undiagnosed issues. And yet, just like a LLM, i simply rely on tool usage to fix my inaccuracy issues. No point in wasting good shoulder space trying to graft a dozen more heads/brains onto me, when i can simply delegate the issue away. ;)

The fact that we can get computer models, that can almost program, write texts, ... and do so much more like a slightly malfunctioning human, amazes me. And at the same time, i curse at it like my teachers did, and also call it dumb at times hehehe ... I now understand how my teachers felt loool

halJordan · 2025-11-08T17:07:16 1762621636

This is a very unserious take. It's not ironic, because it's not a calculator.

rrr_oh_man · 2025-11-08T17:26:04 1762622764

What's meaning of `computer`, remind me quick?

anamexis · 2025-11-08T17:37:32 1762623452

Computer vision algorithms run on computers and they can’t do basic arithmetic.

My email client runs on my computer and it doesn’t do basic arithmetic either.

Something running on a computer does not imply that it can or should do basic arithmetic

TheOtherHobbes · 2025-11-08T18:01:50 1762624910

That's confusing basic arithmetic as a user feature and as an implementation requirement.

I guarantee that computer vision and email clients both use basic arithmetic in implementation. And it would be trivially easy to bolt a calculator into an email app, because the languages used to write email apps include math features.

That's not true of LLMs. There's math at the bottom of the stack. But LLMs run as a separate closed and opaque application of a unique and self-contained type, which isn't easily extensible.

They don't include hooks into math features on the GPUs, and there's no easy way to add hooks.

If you want math, you need a separate tool call to conventional code.

IMO testing LLMs as if they "should" be able to do arithmetic is bizarre. They can't. They're not designed to. And even if they did, they'd be ridiculously inefficient at it.

anamexis · 2025-11-08T18:04:56 1762625096

Yes, you are agreeing with me.

gishh · 2025-11-08T18:05:35 1762625135

Pretty sure the only thing computer vision does is math.

I’ve also observed email clients tallying the number of unread emails I have. It’s quite obnoxious actually, but I qualify adding as math.

ghurtado · 2025-11-08T19:32:15 1762630335

> Pretty sure the only thing computer vision does is math.

That is only marginally less pedantic than saying that the only thing computer vision does is run discrete electrical signals through billions of transistors.

gishh · 2025-11-09T20:16:08 1762719368

If you’ve ever written code for a computer vision application, you’d realize how incorrect this statement is.

anamexis · 2025-11-08T18:11:22 1762625482

Yes, everything that a computer does, it does using math. This does not imply that things running on the computer can do basic arithmetic tasks for the user.

zamadatix · 2025-11-08T16:18:01 1762618681

Pencil and paper is just testing with tools enabled.

LadyCailin · 2025-11-08T16:23:07 1762618987

I’d say it’s fair for LLMs to be able to use any tool in benchmarks, so long as they are the ones to decide to use them.

zamadatix · 2025-11-08T16:34:15 1762619655

Agreed. I don't like when the prompt sets up a good portion of how to go about finding the answer by saying which tools to use and how. The LLM needs to decide when and how to use them, not the prompt.

daveguy · 2025-11-08T17:11:38 1762621898

I don't think it should be completely open ended. I mean, you could have an "ask_hooman" tool that solves a ton of problems with current LLMs. But that doesn't mean the LLM is capable with respect to the benchmark.

vntok · 2025-11-08T18:15:27 1762625727

Why not? One of the most intelligent things to do when stuck on a problem is to get outside help.

If allowing this behaviour raises a problem, you can always add constraints to the benchmark such as "final answer must come out under 15s" or something. The LLM can then make the decision to ask around in accordance to the time risk.

daveguy · 2025-11-08T19:28:23 1762630103

Because AI are good at devolving to the highest score, regardless of test intent. For most problems "ask_hooman", or especially the plural, would be much more effective. So, the degenerate case would dominate and tell you precisely zero about the intelligence of the AI. If a specific "tool" is more adept than the "AI" then "choose tool" will always be the correct answer. But I agree, a tight time constraint would help.

Dylan16807 · 2025-11-08T17:53:30 1762624410

On some level this makes sense, but on the other hand LLMs already have perfect recall of thousands of symbols built into them, which is what pencil and paper gives to a human test taker.

zamadatix · 2025-11-08T19:11:35 1762629095

If only context recall was actually perfect! The data is certainly stored well, accurately accessing the right part... maybe worse than a human :D.

Dylan16807 · 2025-11-08T19:15:51 1762629351

If you're not doing clever hacks for very long windows, I thought a basic design fed in the entire window and it's up to the weights to use it properly.

layer8 · 2025-11-08T16:22:20 1762618940

You seem to be addressing an argument that wasn’t made.

Personally, I’d say that such tool use is more akin to a human using a calculator.

zamadatix · 2025-11-08T16:26:20 1762619180

I'm not addressing an argument, just stating that's already a form of LLM testing done today for people wanting to look at the difference in results the same as the human analogy.

layer8 · 2025-11-08T16:28:45 1762619325

Okay, but then I don’t understand why you replied to my comment for that, there is no direct connection to what I wrote, nor to what bee_rider wrote.

zamadatix · 2025-11-08T16:29:50 1762619390

> To the contrary, reasoning ability should enable them to handle numbers of arbitrary size, just as it enables humans to do so, given some pencil and paper.

People interested can see the results of giving LLMs pen and paper today by looking at benchmarks with tools enabled. It's an addition to what you said, not an attack on a portion of your comment :).

layer8 · 2025-11-08T16:35:17 1762619717

I see now. My focus was on the effect of LLMs’ (and by analogy, humans’) reasoning abilities argued by bee_rider. The fact that tool use can enable more reliable handling of large numbers has no bearing on that, hence I found the reply confusing.

zamadatix · 2025-11-08T16:40:10 1762620010

Hmm, maybe it depends on the specific test and reasoning in it? I certainly think reasoning how and when to use allowed tools and when not to is a big part of the reasoning and verification process E.g. most human math scores allow for a pen and paper calculation, or even a calculator, and that can be a great way to say spot check a symbolic derivative and see it needs to be revisited without relying on the calculator/paper to do the actual reasoning for the testee. Or to see the equation for motion of a system can't possibly have been right with some test values (without which I'm not sure I'd have passed my mid level physics course haha).

At the very least, the scores for benchmarking a human on such a test with and without tools would be different to comparing an LLM without the analogous constraints. Which is (IMO) a useful note in comparing reasoning abilities and why I thought it was interesting to note this kind of testing is just called testing with tools on the LLM side (not sure there is an equally as standard term on the human testing side? Guess the same could be used for both though).

At the same time I'm sure other reasoning tests don't gain much from/expect use of tools at all. So it wouldn't be relevant for those reasoning tests.

ambicapter · 2025-11-08T16:31:01 1762619461

> Since performance on large numbers is not what these exams are intended to test for,

How so? Isn't the point of these exams to test arithmetic skills? I would hope we'd like arithmetic skills to be at a constant level regardless of the size of the number?

singron · 2025-11-08T16:47:16 1762620436

No. AIME is a test for advanced high schoolers that mostly tests higher level math concepts like algebra and combinatorics. The arithmetic required is basic. All the answers are 3-digit numbers so that judging is objective and automated while making guessing infeasible. You have 12 minutes on average for each question, so even if you are terribly slow at arithmetic, you should still be able to calculate the correct answer if you can perform all the other math.

ambicapter · 2025-11-08T18:26:45 1762626405

That's probably a great test for high schoolers but it doesn't really test what we want from AI, no? I would expect AI to be limited by the far greater constraints of its computing ability, and not the working memory of a human high schooler.