Nope, just “reasoning: high” and nothing else, you can run it here and try yourself, though I’m using it on my PC, compared to the online version they seem to use some kind of system prompt other than high reasoning that alter its response: https://gpt-oss.com/

1 Like

Great, I had missed the fact that it is easily available as a browser interface. This is the response to the same question, high reasoning, it’s similar, but not identical to yours.

What a “limiting amino‑acid” is

  1. Every protein is made of 20 different amino‑acids.

    • 9 of them are essential (the body cannot make them, you must get them from food).
    • The other 11 are “non‑essential” because the body can synthesize them from other nutrients.
  2. When you eat a food, it supplies a mixture of those amino‑acids.
    The relative amounts are not the same in every protein source.
    For example, a scoop of wheat protein is rich in leucine, isoleucine, valine, but very low in lysine.

  3. The “limiting” amino‑acid is the essential one that is present in the smallest amount relative to the body’s requirement.
    Think of a recipe that calls for ten different ingredients; the ingredient you have the least of determines how many complete dishes you can make. In protein synthesis, the “dish” is a new protein molecule, and the “ingredient you have the least of” is the limiting AA.


Why the limiting amino‑acid caps new‑protein synthesis

Step in protein synthesis What happens if an essential AA is missing
tRNA charging (amino‑acyl‑tRNA synthetases attach an AA to its tRNA) The tRNA for the missing AA stays “empty.”
Ribosome elongation (the ribosome moves along mRNA, adding one AA at a time) When the ribosome reaches a codon that calls for the empty tRNA, it stalls. Translation stops or the incomplete peptide is degraded.
Net protein balance (new protein added – protein broken down) Even if you have plenty of all the other AAs, the shortage of one essential AA prevents the whole chain from being completed, so the extra AAs are either oxidized for energy or recycled, not stored as new muscle, enzyme, or tissue protein.

Bottom line: The body can only incorporate as much protein as the least‑available essential amino‑acid allows. This is why the limiting AA is often called the “first‑limiting factor” in protein nutrition.


How it looks in real diets

Common food source Typical limiting AA Why it matters
Cereals (wheat, rice, corn) Lysine Grain proteins are high in methionine & cysteine but low in lysine; without extra lysine you can’t fully use the grain protein.
Legumes (beans, peas, lentils) Methionine + cysteine Legume proteins are rich in lysine but low in sulfur‑containing AAs.
Soy protein Methionine (borderline) Soy is a relatively balanced plant protein, but its methionine is still a little lower than the “ideal” pattern.
Animal muscle (beef, chicken, fish) Usually none (all essential AAs in adequate proportions) Animal proteins are “complete” – they rarely have a limiting AA, though very low‑fat cuts can be slightly lower in lysine.
Egg white Methionine (slightly) Egg white is high in most AAs, but methionine is the closest to the “first‑limiting” spot.

What this means: Even if you eat 2 g of “protein” from wheat, the body can only use the amount that contains enough lysine. Any extra wheat protein that exceeds the lysine supply is wasted for growth or repair.


Measuring protein quality → the limiting AA is the key

Score What it does How it uses the limiting AA concept
PDCAAS (Protein Digestibility‑Corrected Amino‑Acid Score) Compares the pattern of essential AAs in a food to a reference pattern (human requirement) and adjusts for digestibility. The lowest ratio (the limiting AA) becomes the overall score.
DIAAS (Digestible Indispensable AA Score, newer) Similar, but uses ileal digestibility for each AA individually. Again, the most limiting AA determines the final DIAAS value.

If a protein source scores 0.60, that means the most limiting essential AA is only 60 % of the amount the body needs (once you factor in digestibility). Even if total protein grams look ample, the limiting AA caps the usable protein to 60 % of the ingested amount.


Practical take‑aways for everyday eating

  1. Aim for “complete” protein or combine complementary proteins.

    • Animal proteins (meat, dairy, eggs, fish) are usually complete.
    • Plant‑based combos:
      • Cereal + Legume (e.g., rice + beans, tortilla + black beans) → lysine from legumes + methionine from cereals.
      • Grain + Nuts/Seeds (e.g., oatmeal + almonds) → adds lysine and other limiting AAs.
    • Soy and quinoa are plant sources that are close to complete; they’re excellent “stand‑alone” options.
  2. Don’t rely on gram‑count alone.

    • 30 g of whey protein, 30 g of wheat protein, and 30 g of pea protein all deliver 30 g of total nitrogen, but their net contribution to new tissue varies because of different limiting AAs.
  3. Spread protein intake through the day.

    • The body’s “AA pool” turns over constantly. Consuming a moderate amount of high‑quality protein every 3–5 h keeps all essential AAs above the threshold needed for continuous synthesis (important for athletes, older adults, or anyone recovering from injury).
  4. Special populations need extra attention.

    • Infants, children, pregnant women, and the elderly have higher requirements for one or more essential AAs (e.g., lysine for growth, leucine for muscle maintenance). Diets that are marginal in those AAs can become limiting faster.
    • Clinical settings (e.g., patients on tube feeding) use formulas designed to be free of a limiting AA—often by adding crystalline AAs to meet the exact pattern.
  5. If you’re on a strict vegan diet, check which AAs might be low.

    • Lysine is often the first‑limiting AA for many grain‑heavy meals.
    • Methionine & cysteine can be limiting when legumes dominate.
    • Adding a handful of pumpkin seeds, hemp seeds, or soy products helps cover sulfur‑containing AAs, while beans or lentils cover lysine.

A quick “mental model” to remember

Protein synthesis = building a LEGO tower.
You have many different coloured bricks (the 20 AAs). To finish a floor, you need one of each colour. If you run out of red bricks (say, lysine), you can’t place any more floors, no matter how many blue, green, or yellow bricks you have left. The red bricks are the limiting ones, and the height of the tower (your net new protein) is set by how many red bricks you have.


Bottom line in a nutshell

  • Limiting amino‑acid = the essential AA that is supplied in the lowest proportion relative to the body’s needs.
  • Protein synthesis stops when that AA runs out, even if all other AAs are abundant.
  • Total grams of protein matter less than the balance of all essential AAs.
  • Optimize diets by choosing complete proteins or by pairing complementary foods so that no essential AA is left “short‑changed.”

Understanding the limiting‑AA concept lets you design meals that truly support growth, repair, and maintenance—rather than just “eating protein” in the abstract. :blush:

What If A.I. Doesn’t Get Much Better Than This?

The “scaling hypothesis” has not been falsified, as far as I know. What has happened is that companies are running out of internet data. e.g. in training GPT-4.5, OpenAI was rumored to have not had enough data to reach optimal performance, so settled with a model trained on what they had. But, don’t worry, they have lots of other ideas for how to improve their models, including synthetic data, data efficiency gains, and RL “reasoning models” training.

The scaling hypothesis is about how models should improve – as measured by a loss function – on predicting text from a given distribution, where the model is in some class of models with a certain number of parameters, and with a certain amount of compute for training. It’s empirical, yes, though people have tried to give heuristic justifications (like Jared Kaplan). Also, it doesn’t say anything about downstream performance, i.e. how much “smarter” the model is getting on any particular task.

One thing worth mentioning, regarding testing, is that it’s really hard to compare models to see whether they are on the curve predicted by the scaling hypothesis. That’s because they don’t all use the same data or data distribution – in fact, some may even use copious amounts of synthetic data. Thus, each model is like a point on its own scaling curve, where that curve may be different from one for another model; trying to put them on the same curve is like comparing apples to oranges.

The “scaling hypothesis” is presented as a scientific hypothesis, but companies building models are doing a lot of engineering that defeats attempts at verifying or falsifying it. It’s like saying someone has a mathematical formula to predict the amount of gas a car needs to drive 100 miles on a certain road, and then they want to test it. Meanwhile, someone comes along and builds a new road to run the test, and completely redesigns the car and even the fuel you feed into it, and gets different results from that predicted by the formula. “I guess they formula has been falsified!”

Addendum: I fed my comment into GPT-5-thinking, and you can see what it says about what I get right and get wrong (or overstate):

Everyone, if the AI-generated text is more than a couple of paragraphs then please just include a short posting with a link to the full posting. We don’t want to fill up the forums with AI generated text.

Thanks!

3 Likes

GPT-5 is a joke. Will it matter?

How about quoting the AI text instead and you limiting the size of the quote to a few paragraphs (with the possibility of expansion)?

I think that is basically what I said… the link would be to the full text (expansion)… or are you envisioning some other way to implement this idea?

Yes, as you can see when you’re quoting someone else you only see a couple lines of text, but then you can expand it.

If you could implement similar functionality for

Blockquote

with >

In the top right you have “expand/collapse”:

Nobody clicks on links. At the same time no one probably reads full length AI posts right now unless they’re a summary of a video or similar, at least I don’t.

If I wanted a AI answer, I could just ask it myself.

But it’ll get much better over time so I don’t mind people posting about it personally, I just scroll by it and it doesn’t take much. I think it’s good that people use it.

This paywalled paper looks interesting:

Unexpected ability of large language models: predicting aging status

We developed a large language model-based framework to predict the magnitude of aging across diverse populations using unstructured and heterogeneous data. The predicted aging was highly correlated with multiple aging-related outcomes. Our research suggests that large language models can do more than generate text — they can also predict a person’s aging status.

https://www.nature.com/articles/s41591-025-03865-7

Anyone have access to the PDF (or an LLM summary of the paper)?
@cl-user @John_Hemming

I don’t have either of those, but if I remember rightly I saw an image where it was using blood biomarkers much like levine’s formula. (although we don’t know how it did the calculations)

Succinct summary:

(I would have used GPT-5, but sometimes when there are PDF files in the conversation the conversation sharing doesn’t work.)

The paper is very short, by the way. It’s like a poster.

1 Like

New arxiv paper testing GPT-5 on medical questions:

Capabilities of GPT-5 on Multimodal Medical Reasoning

Recent advances in large language models (LLMs) have enabled general-purpose systems to perform increasingly complex domain-specific reasoning without extensive fine-tuning. In the medical domain, decision-making often requires integrating heterogeneous information sources, including patient narratives, structured data, and medical images. This study positions GPT-5 as a generalist multimodal reasoner for medical decision support and systematically evaluates its zero-shot chain-of-thought reasoning performance on both text-based question answering and visual question answering tasks under a unified protocol. We benchmark GPT-5, GPT-5-mini, GPT-5-nano, and GPT-4o-2024-11-20 against standardized splits of MedQA, MedXpertQA (text and multimodal), MMLU medical subsets, USMLE self-assessment exams, and VQA-RAD. Results show that GPT-5 consistently outperforms all baselines, achieving state-of-the-art accuracy across all QA benchmarks and delivering substantial gains in multimodal reasoning. On MedXpertQA MM, GPT-5 improves reasoning and understanding scores by +29.26% and +26.18% over GPT-4o, respectively, and surpasses pre-licensed human experts by +24.23% in reasoning and +29.40% in understanding. In contrast, GPT-4o remains below human expert performance in most dimensions. A representative case study demonstrates GPT-5’s ability to integrate visual and textual cues into a coherent diagnostic reasoning chain, recommending appropriate high-stakes interventions. Our results show that, on these controlled multimodal reasoning benchmarks, GPT-5 moves from human-comparable to above human-expert performance. This improvement may substantially inform the design of future clinical decision-support systems.

And those are absolute (not relative) improvements, which is just crazy. And note that this isn’t even the thinking model (GPT-5-thinking) or the GPT-5-pro version. Those models are even stronger, with the pro model especially strong.

2 Likes

(video version: https://www.youtube.com/watch?v=ZxH9QWfzifo)

How AI could transform the future of medicine

Health-care and technology leaders discuss how assistive AI could revolutionize the future of medicine.

https://www.washingtonpost.com/washington-post-live/

Here are a few references to “reliability” of GPT-5, including about “hallucinations”, as OpenAI claimed that it is significantly better in this aspect (note that this doesn’t include image-synthesis, since that’s a separate module from GPT-5; in fact, I think image-generation requests get handed-off to the separate gpt-image-1 model), so is worth considering. It’s clearly a problem that has many facets, so improvements in one facet don’t necessarily mean it will be improved in others; however, it does seem that the model is improved in several of those where people pointed out failings in the past. And it’s especially important to get this right if we ever want to trust AIs to make accurate medical diagnoses.

For example, you may have heard about the problem of “counting the r’s in strawberry” that previous models tended to get wrong. See this tweet by Daniel Litt, U. of Toronto mathematician:

https://x.com/littmath/status/1954978847759405501#m

Just asked GPT-5 to count letters in various words (b’s in blueberry, y’s in syzygy, r’s in strawberry, i’s in antidisestablishmentarianism, etc.) and it got 10/10 right. I have no doubt one can elicit poor performance in this kind of question but it’s atypical.

Also see this tweet where a guy did a more thorough test:

https://x.com/ctjlewis/status/1955202094131974438#m

there’s no point showing a heatmap here because i measured a straight diagonal line. i couldn’t produce an error at all for strings up to 50 characters over API. lengths 1-50 chars, 10 samples each, 500 samples total, zero errors.

Also see this tweet from OpenRouter about their testing of various models:

https://x.com/OpenRouterAI/status/1956030489900560769#m

After one week, GPT-5 has topped our proprietary model charts for tool calling accuracy🥇

In second is Claude 4.1 Opus, at 99.5%

GPT-5 is in first-place with a score of 99.9%, which is a 5x lower error-rate than Claude 4.1 Opus.

Then there are some of the benchmark results in the OpenAI’s GPT-5 System Card technical report:

https://openai.com/index/gpt-5-system-card/

They ran tests using some of their own new benchmarks, but then also used ones by Meta, Google Deepmind (LongFact), U. of Washington (FActScore), and saw large improvements. They saw less improvement – but still significant – on SimpleQA, which is OpenAI’s own older benchmark. However, on SimpleQA they only showed the “no web” results. In this setting, “accuracy” is mainly measuring how good a model’s latent memory of facts is, where it can’t look them up to verify. And “hallucination rate” here is the difference between error rate and refusal rate. You see an improvement here, though it’s less impressive than the other reliability gains.

Finally, it’s worth pointing out that, even if it isn’t used in GPT-5, OpenAI does seem to know how to mitigate hallucinations quite a lot in certain specific domains like math. This is evident from their system that won an IMO gold on the 2025 exam without using an external verifier, just text – pages and pages of natural language math without a single error.

Isn’t that because of the way words are represented in tokens though?

e.g the word might contain “str” “awb” “erry” tokens and not individual characters. If that didn’t change all it did might represent those three tokens with individual tokens that consist of the individual characters for reasoning and not an underlying understanding because of the compression of words into fewer tokens than there are characters. The representation of the tokens with individual tokens for each character could then just be training on it specifically.

Yes, I know about BPE encoding. The issue is deeper than that. e.g. you could ask previous models to spell out “strawberry” and would dutifully write “s t r a w b e r r y”, with letters separated by spaces; you could even ask it to write the letters, each on a different line, which means the letters are each part of different tokens. But then, even given this, they still struggled to count the number of r’s – at least that was how it was a couple months / years ago… but not anymore!

1 Like

Good point, if the LLM did write out that then indeed it’s an issue of the next token prediction is unable to count amount of specific token previously. It would be an interesting benchmark if they tried this for 1000 words, especially those not trained on, and if it matters for overall performance.

1 Like

How to write effective prompts for GPT-5

1 Like