Fake GLV: You don't need an efficient endomorphism to implement GLV-like scalar multiplication in SNARK circuits

_____ _ ____ _ __ __| ___|_ _| | _____ / ___| |\ \ / /| |_ / _` | |/ / _ \ | | _| | \ \ / /| _| (_| | < __/ | |_| | |__\ V /|_| \__,_|_|\_\___| \____|_____\_/

You don’t need an efficient endomorphism to implement GLV-like scalar multiplication in SNARK circuits

Introduction

P-256, also known as secp256r1 and prime256v1, is a 256-bit prime field Weierstrass curve standardized by the NIST. It is widely adopted in internet systems, which explains its myriad use cases in platforms such as TLS, DNSSEC, Apple’s Secure Enclave, Passkeys, Android Keystore, and Yubikey. The key operation in elliptic curves based cryptography is the scalar multiplication. When the curve is equipped with an efficient endomorphism it is possible to speed up this operation through the well-known GLV algorithm. P-256 does unfortunately not have an efficient endomorphism (see parameters) to enjoy this speedup.

Verifying ECDSA signatures on Ethereum through precompiled contracts, i.e. smart contracts built into the Ethereum protocol (there are only 9) is only possible with the secp256k1 curve and not the P-256.
Verifying ECDSA signatures on P-256 requires computing scalar multiplications in Solidity and is especially useful for smart-contract wallets, enabling hardware-based signing keys and safer, easier self-custody. Different solutions can bring P-256 signatures on-chain. There are primarily three interesting approaches: (zk)-SNARK based verifiers, smart contract verifiers (e.g. [Dubois23], Ledger/FCL (deprecated), smoo.th/SCL and daimo/p256verifier), and native protocol precompiles (EIP/RIP 7212).

Using SNARK (succinctness) properties, provides a great way to reduce gas cost for computation on Ethereum (e.g. ~232k gas for Groth16, ~285k gas for PLONK and ~185k gas for FFLONK). This is very competitive with (and sometimes better that) the currently gas-optimal smart contract verifier. Moreover one can batch many ECDSA verifications in a single proof, amortizing thus the gas cost. However verifying P-256 signatures in a SNARK circuit can be very expensive i.e. long proving time. This is because the field where the points on the P-256 curve lie is different than the field where the SNARK computation is usually expressed. To be able to verify the proof onchain through the procompile the SNARK field needs to be the BN254 scalar field. Different teams tried to implement the ECDSA verification on P-256 in a BN254 SNARK circuit efficiently. Among these: zkwebauthn/webauthn-halo2, https://github.com/zkwebauthn/webauthn-circom and PSE/circom-ecdsa-p256.

If P-256 had an efficient endomorphism we could have optimized the proving time a great deal!

In this note we show a way to implement a GLV-like scalar multiplications in-circuit without having an efficient endomorphism.

Other applications

Background

Standard scalar multiplication

Let EE be an elliptic curve defined over the prime field \mathbb{F}_pFp and let rr be a prime divisor of the curve order \#E(\mathbb{F}_p)#E(Fp) (i.e. the number of points).
Let s \in \mathbb{F}_rsFr and P(x,y) \in E(\mathbb{F}_p)P(x,y)E(Fp), we are interested in proving scalar multiplication s\cdot PsP over the rr-torsion subgroup of EE, denoted E[r]E[r] (i.e. the subset of points of order rr).

The simplest algorithm is the standard left-to-right double-and-add:

INPUT: s = (s_{t−1},..., s_1, s_0), P ∈ E(Fp).OUTPUT: sP.1. Q ← ∞.2. For i from t−1 downto 0 do2.1 Q ← 2Q.2.2 If s_i = 1 then Q ← Q + P.3. Return(Q).

If/else branching is not possible in SNARK circuits so this is replaced by constant window table lookups inside the circuit. This can be achieved using polynomials which vanish at the constants that aren’t being selected, i.e. a 1-bit table lookup Q ← s_i * Q + (1 - s_i) * (Q+P). Hence this double-and-add algorithm requires tt doublings, tt additions and tt 1-bit table lookup.
This can be extended to windowed double-and-add, i.e. scanning more than a bit per iteration using larger window tables, but the multiplicative depth of the evaluation increases exponentially. We use affine coordinates for doubling/adding points because inverses cost as much as multiplications, i.e. instead of checking that 1/x1/x is yy we provide yy out-circuit and check in-circuit that x\cdot y = 1xy=1. However since we start with Q ← ∞Q it is infeasible to avoid conditional branching since affine formulas are incomplete. Instead, we scan the bits right-to-left and assume that the first bit s_0 is 1 (so that we start at Q ← P), we double the input point P instead of the accumulator Q in this algorithm and finally conditionally subtract (using the 1-bit lookup) P if s_0 was 0.

INPUT: s = (s_{t−1},..., s_1, s_0), P ∈ E(Fp).OUTPUT: sP.1. Q ← P.2. For i from 1 to t−1 do2.1 If s_i = 1 then Q ← Q + P.2.2 P ← 2P.3. if s_0 = 0 then Q ← Q - P4. Return(Q).

GLV scalar multiplication

However it is well known that if the curve is equipped with an efficient endomorphism then there exists a faster algorithm known as [GLV].

Example 1 : suppose that EE has Complex Multiplication (CM) with discrimant -D=-3D=3, i.e. EE is of the form y^2=x^3+by2=x3+b, with b \in \mathbb{F}_pbFp. This is the case of BN254, BLS12-381 and secp256k1 elliptic curves used in Ethereum. There is an efficient endomorphism \phi: E \rightarrow Eϕ:EE defined by (x,y)\mapsto (\omega x,y)(x,y)(ωx,y) (and \mathcal{O} \mapsto \mathcal{O}OO) that acts on P \in E[r]PE[r] as \phi(P)=\lambda \cdot Pϕ(P)=λP. Both \omegaω and \lambdaλ are cube roots of unity in \mathbb{F}_pFp and \mathbb{F}_rFr respectively, i.e. \omega^2+\omega+1 \equiv 0 \pmod pω2+ω+10(modp) and \lambda^2+\lambda+1 \equiv 0 \pmod rλ2+λ+10(modr).

Example 2 : suppose that EE has Complex Multiplication (CM) with discrimant -D=-8D=8, meaning that the endomorphism ring is \mathbf{Z}[\sqrt{−2}]Z[2]. This is the case of the Bandersnatch elliptic curves specified in Ethereum Verkle trie. There is an efficient endomorphism \phi: E \rightarrow Eϕ:EE whose kernel is generated by a 2-torsion point. The map can be found by looking at 2-isogeneous curves and applying Vélu’s formulas. For Bandersnatch it is defined by (x,y)\mapsto (u^2\cdot \frac{x^2+wx+t}{x+w},u^3\cdot y\cdot \frac{x^2+2wx+v}{(x+w)^2})(x,y)(u2x2+wx+tx+w,u3yx2+2wx+v(x+w)2) for some constants u,v,w,tu,v,w,t (and \mathcal{O} \mapsto \mathcal{O}OO) that acts on P \in E[r]PE[r] as \phi(P)=\lambda \cdot Pϕ(P)=λP where \lambda^2+2 \equiv 0 \pmod rλ2+20(modr).

The GLV algorithm starts by decomposing ss as s = s_0 + \lambda s_1s=s0+λs1 and then replacing the scalar multiplication s \cdot PsP by s_0 \cdot P + s_1 \cdot \phi(P)s0P+s1ϕ(P). Because s_0s0 and s_1s1 are guaranteed to be \leq \sqrt{r}r (see Sec.4 of [GLV] and Sec.4 of [FourQ] for an optimization trick), we can halve the size of the for loop in the double-and-add algorithm. We can then scan simultaenously the bits of s_0s0 and s_1s1 and apply the Strauss-Shamir trick. This results in a significant speed up but only when an endomorphism is available. For example the left-to-right double-and-add would become:

INPUT: s and P ∈ E(Fp).OUTPUT: sP.1. Find s1 and s2 s.t. s = s1 + 𝜆 * s2 mod r1.1 let s1 = (s1_{t−1},..., s1_1, s1_0)1.2 and s2 = = (s2_{t−1},..., s2_1, s2_0)2. P1 ← P, P2 ← 𝜙(P) and Q ← ∞.3. For i from t−1 downto 0 do3.1 Q ← 2Q.3.2 If s1_i = 0 and s2_i = 0 then Q ← Q.3.3 If s1_i = 1 and s2_i = 0 then Q ← Q + P1.3.4 If s1_i = 0 and s2_i = 1 then Q ← Q + P2.3.5 If s1_i = 1 and s2_i = 1 then Q ← Q + P1 + P2.4. Return(Q).

Using the efficient endomorphism in-circuit is also possible (see [Halo, Sec. 6.2 and Appendix C] or [gnark implementation] for short Weierstrass curves and [arkworks] and [gnark] implementations for twisted Edwards). But one should be careful about some extra checks of the decomposition s = s_0 + \lambda s_1 \mod rs=s0+λs1modr (not the SNARK modulus). The integers s_0, s_1s0,s1 can possibly be negative in which case they will be reduced in-circuit modulo the SNARK field and not rr.

The fake GLV trick

Remember that we are proving that s\cdot P = QsP=Q and not computing it. We can “hint” the result QQ and check in-circuit that s\cdot P - Q = \mathcal{O}sPQ=O. Now, if we can find u,v \leq \sqrt{r}u,vr such that v\cdot s = u \pmod rvs=u(modr) then we can check instead that

(v\cdot s)\cdot P - v\cdot Q = \mathcal{O}(vs)PvQ=O

which is equivalent to

u\cdot P - v\cdot Q = \mathcal{O}uPvQ=O

The thing now is that uu and vv are “small” and we can, similarly to the GLV algorithm, halve the size of the double-and-add loop and apply the Strauss-Shamir trick.

Solution: running the half-GCD algorithm (i.e. running GCD half-way) is sufficient to find uu and vv. We can apply the exact same trick for finding the lattice basis as in the GLV paper (Sec. 4). For completeness we recall the algorithm hereafter.
We apply the extended Euclidean algorithm to find the greatest common divisor of rr and ss (This gcd is 1 since rr is prime.) The algorithm produces a sequence of equations

w_i \cdot r + v_i \cdot s = u_iwir+vis=ui

for i = 0, 1, 2, \dotsi=0,1,2, where w_0 = 1, v_0 = 0, u_0 = r, w_1 = 0, v_1 = 1, u_1 = sw0=1,v0=0,u0=r,w1=0,v1=1,u1=s, and u_i \geq 0ui0 for all ii. We stop at the index mm for which u_m \geq \sqrt{r}umr and take u = u_{m+1}u=um+1 and v = -v_{m+1}v=vm+1.
Note: By construction uu is guaranteed to be a positive integer but vv can be negative, in which case it would be reduced in-circuit modulo the SNARK modulus and not rr. To circumvent this we return in the hint uu, vv and a \texttt{b}=1b=1 if vv is negative and \texttt{b}=0b=0 otherwise. In-circuit we negate QQ instead when \texttt{b}=1b=1.

Implementation

A generic implementation in the gnark library is available at gnark.io (feat/fake-GLV branch). For Short Weierstrass (e.g. P256) look at the scalarMulFakeGLV method in the emulated package and for twisted Edwards (e.g. Bandersnatch/Jubjub) look at the scalarMulFakeGLV method in the native package.

Benchmark

The best algorithm to implement scalar multiplication in a non-native circuit (i.e. circuit field ≠ curve field) when an efficient endomorphism is not available is an adaptation of [Joye07] (implemented in gnark here).
Next we compare this scalar multiplication with our fake GLV in a PLONKish vanilla (i.e. no custom gates) circuit (scs) over the BN254 curve (Ethereum compatible). We also give benchmarks in R1CS.

P-256Old (Joye07)New (fake GLV)
[s]P[s]P738,031 scs
186,466 r1cs
385,412 scs
100,914 r1cs
ECDSA verification1,135,876 scs
293,814 r1cs
742,541 scs
195,266 r1cs

Note here that the old ECDSA verification uses Strauss-Shamir trick for computing [s]P+[t]Q[s]P+[t]Q while the new version is merely two fake GLV multiplications and an addition.

Comparison

p256wallet.org is an ERC-4337 smart contract wallet that leverages zk-SNARKs for WebAuthn and P-256 signature verification. It uses PSE/circom-ecdsa-p256 to generate the webAuthn proof, and underneath PSE/circom-ecdsa-p256 to generate the ECDSA proof on P-256 curve. The github README reports 1,972,905 R1CS. Compiling our circuit in R1CS results in 195,266 R1CS. This is more than a 10x reduction, which is not only due to the fake GLV algorithm but also to optimized non-native field arithmetic in gnark.

Other curves

Similar results are noticed for other curves in short Weirstrass, e.g. P-384 and STARK curve:

P-384Old (Joye07)New (fake GLV)
[s]P[s]P1,438,071 scs782,674 scs
ECDSA verification2,174,027 scs1,419,929 scs
STARK curveOld (Joye07)New (fake GLV)
[s]P[s]P727,033 scs380,210 scs
ECDSA verification1,137,459 scs732,131 scs

and also in twisted Edwards e.g. Jubjub vs. Bandersnatch:

JubjubOld (2-bit double-and-add)New (fake GLV)
[s]P[s]P5,863 scs
3,314 r1cs
4,549 scs
2,401 r1cs
BandersnatchOld (GLV)New (fake GLV)
[s]P[s]P4,781 scs
2,455 r1cs
4,712 scs
2,420 r1cs

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
1
Add to Favorites
Comments