I’ve published a theoretical framework for combining LLM ensemble outputs using Born’s rule from quantum mechanics, producing bilinear interference cross-terms that no existing ensemble method generates.
The Problem
Every current ensemble method (averaging, log-linear, routing, majority vote) combines models linearly — the relationship between model predictions is ignored.
The Idea
Convert each model’s token probability into a complex amplitude ψ(t) = √P(t) · e^(iφ), superpose them, and extract the combined probability via |Σψ|². This produces interference terms √(Pᵢ·Pⱼ)·cos(φᵢ−φⱼ) that:
Amplify tokens where models agree (constructive interference)
Suppress tokens where models disagree (destructive interference)
Can push combined probability below any individual model’s estimate — impossible with any weighted average
The phase φ is derived from the logit gap (top-1 minus top-2 logit), which is already available from any forward pass. No new information is needed — just a structurally different mathematical operation on existing outputs.
The paper includes a systematic comparison against DeePEn, FusionRoute, MoA, self-consistency, and all major ensemble classes — none produce these cross-terms. It also presents three potentially fatal problems with full rigour and proposes falsifiable experiments (Phase Coherence Test via Rayleigh statistic, Cross-Term Discrimination Test).
Looking For
Feedback on the framework and phase extraction approach
Collaborators interested in running the empirical validation (two open-weight models + 10K tokens is sufficient)
Pointers to any prior work I may have missed
Happy to discuss the maths, limitations, or experimental design.
Your approach modifies the combination rule of probabilities without changing model weights, it’s very interesting.
I am exploring a different direction where structured axiomatic prompts may locally modify the decision field of the model within a conversation.
Do you think structural modifications of the inference dynamics (like your Born-rule ensemble) and structural constraints at the prompt level could potentially produce similar effects in trajectory stabilization?
Structurally these are different mechanisms at different pipeline stages — BRF operates between models post-inference, axiomatic prompting within a single model pre-inference.
But what interests me is composability: if your axiomatic constraints stabilise each model’s trajectory before BRF combines them, do the cross-model phase relationships tighten?
That would be the condition under which interference becomes non-trivial (Section 5.2).
Do you have logit-level outputs from your Qwen experiments that could be used to test this?
Thank you for your insightful response and for emphasizing the importance of composability.
I confirm, I have the results at the logit level of Qwen’s PCE experiments. If you wish, I can share with you the complete reports. I am sending you an image of the Logits, so that you can analyze them directly, calculate phase relations and evaluate if axiomatic stabilization tightens interactions between models as you suggested.
Moreover, the model is available on my Hugging Face lab in open-source form, so you can also perform your own tests on the model if you wish. This should provide a fully reproducible environment for experimentation.
Thank you Allan, I appreciate you sharing these results and the open access to your model — that’s generous.
The summary tables are useful context. For the specific phase analysis I described in Section 3.3, I’d need the raw logit values (pre-softmax) at each token generation step — the gap between the top-1 and top-2 logits is what maps to the phase angle in BRF.
Would it be possible to log those during generation? CSV would work well — one row per token position, columns for the top-k logit values. Happy to share the extraction script once I see the format.
Thank you Aaron, this is a very useful clarification.
The exploratory trials were conducted by a collaborator, and the reports I currently have mainly include behavioural observations and summary metrics rather than token-level logits.
So, unfortunately, I don’t have the raw pre-softmax logits for these races and I don’t have the technical possibilities nor the knowledge to retrieve its data.
However, since the model and prompts are available openly on my Hugging Face lab, it should be possible to correctly reproduce the generation and save these values. If you wish, I can even deliver the model to you by email for easier accessibility?
This would allow us to see if the axiomatic inductive facility produces detectable stabilization patterns in logit dynamics.
Thank you again for these suggestions and I agree, this could be an interesting direction to explore.