Hallucination Resistance Holds at 64K and 128K Context

Author: JV Roig Published: Tuesday, February 18, 2026

TL;DR: We took the same LoRA-finetuned Granite 4.0 Micro from our earlier experiment and tested it at 64K and 128K context — 4x to 16x longer than anything it saw during training. Hallucination resistance held up well (92% at 32K → 88% at 64K → 83% at 128K). Extraction accuracy did not (68% → 24% → 27%). The model learned "don't make things up" as a general principle that survives extreme context scaling. It did not learn "find the answer in a much bigger haystack."

Quick Recap

In our previous article, we LoRA-finetuned IBM Granite 4.0 Micro (3B dense) on ~1,100 lease contract examples at 8K-16K context, then tested at 32K against three document types — including two the model had never seen. The results were striking: hallucination resistance jumped from 16% to 92%, information extraction improved from 48% to 68%, and the gains generalized across document types. (Check out the previous article for the specifics of the training dataset)

But 32K is only 2-4x the training context. What happens when you really push it?

Extending to 64K and 128K

We generated larger RIKER test sets at 64K and 128K context and ran the same finetuned model through them. Same three document types, same balanced mix of hallucination and extraction questions. The only thing that changed was the size of the haystack.

Here's what happened:

Overall Results

Metric	32K	64K	128K
Overall Accuracy	80.0%	56.0%	54.7%
Hallucination Resistance	92.0%	88.0%	82.7%
Information Extraction	68.0%	24.0%	26.7%
F1 Score	78.2	37.7	40.4

Two very different stories in one table.

Hallucination Resistance: Still Standing

Document Type	32K	64K	128K
Field Reports	84%	92%	72%
HR Evaluations	96%	80%	88%
Lease Contracts	96%	92%	88%

At 32K, the finetuned model resisted hallucination 92% of the time overall. At 64K, it still resisted 88%. At 128K — where the prompt is 8-16x longer than anything in training — it still held at 83%.

Lease contracts (the trained document type) stayed strong throughout: 96% → 92% → 88%. The unseen document types showed more variation — field reports actually improved at 64K before dipping at 128K, while HR evaluations dipped at 64K but recovered at 128K. But the pattern is clear: the model consistently refuses to fabricate answers, even when drowning in context.

Extraction: The Cliff

Document Type	32K	64K	128K
Field Reports	60%	16%	16%
HR Evaluations	56%	16%	40%
Lease Contracts	88%	40%	24%

This is the other side of the coin. Extraction falls off a cliff between 32K and 64K. Lease contracts — the document type the model trained on — go from 88% to 40%. Field reports crater to 16% and stay there.

What's Next

As we explained in the previous blog post, we purposely gimped the training data set: short training context, single document type, small dataset. We wanted to isolate whether the learned principle of "don't hallucinate" transfers across context lengths and generalizes. It does.

The obvious next step is training with longer context data. RIKER can generate training examples at any context length, so if we train with longer-context examples, we expect extraction performance at those context lengths to improve while maintaining the hallucination resistance we've already established. We're working on this now, along with testing whether larger models push the extraction cliff further out.

Early signs from our ongoing experiments suggest larger models do help — but that's a story for another post.

Methodology Note

All evaluations used RIKER-generated test sets at 64K and 128K context lengths, with the same balanced question design (hallucination resistance + information extraction) across three document types. The finetuned model is the same IBM Granite 4.0 Micro LoRA from our original experiment. Inference was run at temperature 0.0 for deterministic output.

This is a follow-up to Can We Reduce LLM Hallucinations for Enterprise Use?. For the full RIKER methodology, see our RIKER paper.

Quick Recap​

Extending to 64K and 128K​

Overall Results​

Hallucination Resistance: Still Standing​

Extraction: The Cliff​

What's Next​

Methodology Note​