benchmarks

Public benchmark snapshots for Memcone on BEAM.

We track judged answer quality, prompt footprint, context footprint, and latency against a simple full-transcript replay baseline.

Raw SnapshotUsage Logic

judged accuracy

39.0%

100 judged questions

prompt compression

50.4x

473.9 avg tokens vs 23,886.5 replay

memory compression

59.5x

449.4 avg memory tokens

context latency

614 ms

remember avg 4,798 ms

Memcone

accuracy

39.0%

prompt avg

473.9 tok

memory avg

449.4 tok

model latency

1,458 ms

context latency

614 ms

Full transcript replay

accuracy

53.0%

prompt avg

23,886.5 tok

memory avg

26,750.1 tok

model latency

3,011 ms

context latency

n/a

category

memcone

replay

abstention

85.0%

80.0%

contradiction resolution

20.0%

17.5%

event ordering

18.9%

19.1%

information extraction

42.5%

85.8%

instruction following

28.7%

62.5%

run history

Apr 24, 2026

Phase 5 tightened — 10×10

45.0%

latest

Mar 2026

Phase 4 semantic baseline

49.1%

prior best

Feb 2026

Lexical retrieval (worst)

~38%

lexical floor

Accuracy is improving run-to-run. The token efficiency advantage (~50.4× fewer prompt tokens than replay) is consistent across all runs and is the primary product metric.

dataset

Source: Mohammadta/BEAM. Latest snapshot includes 10 conversations, 500 turns, and 100 judged questions.

setup

Both strategies use the same answer model. Memcone answers from compressed memory. The baseline answers from full transcript replay.

judging

Predictions are scored with a BEAM-aligned rubric judge and published as product benchmark snapshots.