Skip to content

improve Simpleupdate performances#361

Merged
lkdvos merged 49 commits intoQuantumKitHub:masterfrom
ogauthe:simpleupdate
May 5, 2026
Merged

improve Simpleupdate performances#361
lkdvos merged 49 commits intoQuantumKitHub:masterfrom
ogauthe:simpleupdate

Conversation

@ogauthe
Copy link
Copy Markdown
Contributor

@ogauthe ogauthe commented Apr 16, 2026

This is a work in progress to improve SimpleUpdate performances.

Context: SimpleUpdate is quite slow, especially for 2nd neighbor interaction. I had cases where it became the bottleneck, above CTMRG. I started working on improving the performances last week. I realized many parts could be ported upstream.

I followed 4 different strategies:

  • improving type stability
  • removing copies where I think they could be avoided
  • removing intermediate normalization
  • removing permutations by simplifying perm1∘ perm2

This is a work in progress. It already shows significant speed-up for e.g. finite temperature J1-J2 with SU(2) symmetry. I know it is possible to do much better. I just saw #360 which has overlap for _get_cluster_trunc and I realized it may be relevant to mention my work (objective is not to block any part of #360)

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 16, 2026

Codecov Report

❌ Patch coverage is 93.63296% with 17 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/algorithms/time_evolution/get_cluster.jl 83.78% 12 Missing ⚠️
src/algorithms/time_evolution/simpleupdate.jl 97.10% 2 Missing ⚠️
src/algorithms/time_evolution/simpleupdate3site.jl 88.23% 2 Missing ⚠️
src/utility/util.jl 83.33% 1 Missing ⚠️
Files with missing lines Coverage Δ
src/algorithms/contractions/absorb_weight.jl 100.00% <100.00%> (ø)
src/algorithms/contractions/bondenv/als_solve.jl 84.90% <ø> (-11.46%) ⬇️
src/algorithms/contractions/bondenv/benv_ctm.jl 100.00% <100.00%> (ø)
src/algorithms/contractions/bondenv/benv_tools.jl 50.00% <ø> (ø)
src/algorithms/contractions/bondenv/gaugefix.jl 100.00% <100.00%> (ø)
src/algorithms/time_evolution/apply_gate.jl 85.00% <100.00%> (-7.86%) ⬇️
src/algorithms/time_evolution/apply_mpo.jl 99.20% <100.00%> (-0.80%) ⬇️
src/algorithms/time_evolution/time_evolve.jl 89.18% <ø> (-2.71%) ⬇️
src/algorithms/truncation/bond_tensor.jl 100.00% <100.00%> (ø)
src/algorithms/truncation/bond_truncation.jl 93.67% <100.00%> (-1.21%) ⬇️
... and 5 more

... and 34 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Member

@Yue-Zhengyuan Yue-Zhengyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for improving SU! Left some initial comments.

Comment thread src/algorithms/time_evolution/simpleupdate.jl Outdated
Comment thread src/algorithms/time_evolution/simpleupdate.jl Outdated
(d, r, c), = _nn_bondrev(sites..., (Nr, Nc))
alg.bipartite && r > 1 && continue
ϵ′ = _su_iter!(state2, gate, env2, sites, alg)
ϵ′ = _su_iter_gate!(state2, gate, env2, sites[1], sites[2], alg)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I wanted _su_iter! to be able to handle both AbstractTensorMap{E, S, 2, 2} and length-2 MPO gates, though this is mainly for test purposes to ensure that MPO gates are applied correctly even for fermions. Is it possible to recover this behavior without loss of performance?

Copy link
Copy Markdown
Contributor Author

@ogauthe ogauthe Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I now see the point of using dispatch. There is a traedoff here between code readibility/efficiency (gate is much faster) and this test feature. Is this something that is currently tested?

[copy from above] In su_iter, we need an if statement to distinguish nearest neighbor from long distance at run time. Therefore my idea was that dispatch does not bring much here, while using different function names makes it explicit and easier to find which function/method is called.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something that is currently tested?

It is tested with Hubbard model in test/timeevol/cluster_projectors.jl.

gate is much faster

The MPO way can still be optimized. When applying the gate MPO on the first and the last cluster sites, we can also use the trick of reduced bond tensors, which has not been done yet. Even for longer-range (e.g. NNN) bonds, this can still be useful. Separating out two unitary tensors at the two ends of an OBC-MPO will not affect finding the Vidal gauge.

We can try this out in a follow-up PR if you prefer not to do too much here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I will revert to dispatch. Keeping this conversation, closing the other relative to this one.

Comment thread src/algorithms/time_evolution/simpleupdate3site.jl Outdated
@Yue-Zhengyuan
Copy link
Copy Markdown
Member

Yue-Zhengyuan commented Apr 19, 2026

There is another recent change that further slowed down SU for 2nd neighbor interaction. After introducing LocalCircuit in #347 (and its predecessor TrotterGates in #339), each term in the Hamiltonian is individually exponentiated to a gate in the LocalCircuit. Nearest neighbor and next-nearest neighbor terms are no longer grouped together when exponentiating, resulting in an increase in the number of Trotter gates to be applied.

Comment thread src/algorithms/time_evolution/simpleupdate.jl
Comment thread src/algorithms/time_evolution/simpleupdate3site.jl Outdated
Comment thread src/environments/suweight.jl
Comment thread test/timeevol/j1j2_finiteT.jl Outdated
Copy link
Copy Markdown
Member

@lkdvos lkdvos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small comments, will try and actually look into this in detail next week :)

Comment thread src/algorithms/time_evolution/simpleupdate3site.jl Outdated
Comment thread src/algorithms/time_evolution/apply_gate.jl Outdated
Comment thread src/algorithms/time_evolution/apply_mpo.jl Outdated
Comment thread src/algorithms/time_evolution/apply_mpo.jl Outdated
Comment on lines +147 to +148
gate_axs = alg.purified ? (1:1) : (1:2)
for gate_ax in gate_axs
X, a, b, Y = _qr_bond(A, B; gate_ax, positive = true)
for gate_ax in gate_axs # TODO try to use type stable helper function
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it be better if I always do for gate_ax in 1:2, and manually break after the first loop when alg.purified === true?

for (site, vertex, open_vaxs) in ((siteA, A, open_vaxs_A), (siteB, B, open_vaxs_B))
s′ = (mod1(site[1], Nr), mod1(site[2], Nc))
rotated = _bond_rotation(vertex, dir, rev; inv = true)
state[s′...] = absorb_weight(rotated, env, s′..., open_vaxs; inv = true)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The state tensors need normalization, especially when evolving for the ground state; otherwise, the norm will explode.

rightpermuted = permute(last(Ms), right_invperm)
state[s′] = absorb_weight(rightpermuted, env, s′, right_vaxs; inv = true)

for (vertex, s, invperm, vaxs) in zip(Ms[(begin + 1):(end - 1)], sites[(begin + 1):(end - 1)], map(t -> t[3], mids), map(t -> t[2], mids))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ogauthe Is it really that important to separate out the first and last tensors for type stability because they have 3 instead of 2 open virtual legs?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I mitigate the problem by wrapping contents in the for-loop as a function?

Copy link
Copy Markdown
Member

@Yue-Zhengyuan Yue-Zhengyuan May 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the speedup for the J1-J2 test mainly comes from the improvement in applying nearest neighbor gates. In my test, separating out the first and the last tensors of the cluster offers little gain in performance, especially since we currently only have examples with up to 3-site MPO gates. To reduce redundancy, I'll restore the old approach, but with more things wrapped inside dedicated functions.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think indeed a function barrier can mitigate some of it, but it will still give a dynamic dispatch. This might completely be fine though, that by itself doesn't cost too much as long as the type instability doesn't leak out of that region and taint the subsequent code

local wts
for gate_ax in 1:2
_apply_gatempo!(Ms, gates; gate_ax)
wts, ϵs = _cluster_truncate!(Ms, truncs)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized that I have recently updated _cluster_truncate! in #348, so it can handle MPSs with any number of physical legs (and even different numbers of physical legs at each site). So no longer need to first fuse the physical legs for PEPO.

@Yue-Zhengyuan
Copy link
Copy Markdown
Member

I know it is possible to do much better.

@ogauthe Do you mind briefly outlining what can be further improved starting from the current PR?

@Yue-Zhengyuan Yue-Zhengyuan marked this pull request as ready for review May 2, 2026 12:30
@Yue-Zhengyuan Yue-Zhengyuan requested a review from lkdvos May 4, 2026 00:51
Comment thread src/algorithms/contractions/bondenv/benv_ctm.jl Outdated
@lkdvos lkdvos merged commit c5b94ba into QuantumKitHub:master May 5, 2026
62 of 63 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants