a16z: Checks and balances between machine learning and Zero-knowledge Proof

avatar
MarsBit
04-07
This article is machine translated
Show original
The author of this article, Elena Burger, is a transaction partner of a16z crypto, focusing on games, NFT, web3 media and decentralized infrastructure. Before joining the team, she was an equity analyst at Gilder, Gagnon, Howe, and Co for four years. She holds a BA in History from Barnard College, Columbia University.

Over the past few years, Zero-knowledge Proof on blockchains have been very useful for two key purposes: (1) scaling computationally constrained networks by processing off-chain transactions and validating results on mainnet; Enable shielded transactions to protect user privacy, only those with the decryption key can see them. In the context of blockchains, it's clear why these properties are desirable: a decentralized network like Ethereum would be fine without unsustainable demands on validator processing power, bandwidth, and latency (hence the need for validity Rollup). There is no way to increase throughput or block size, and all transactions are visible to anyone (hence the need for an on-chain privacy solution).

But Zero-knowledge Proof are also useful for a third class of functionality: efficiently verifying that any kind of computation is running correctly (not just computations in EVM off-chain instantiations). This goes far beyond the scope of blockchain.

Advances in leveraging the power of Zero-knowledge Proof to succinctly verify computing systems now allow users to demand the same degree of trustlessness and verifiability from blockchains in every digital product (and most critically machine learning models) in existence sex. The high demand for blockchain computing has spurred Zero-knowledge Proof research, creating modern proof systems with smaller memory footprints and faster proof and verification times, making it now possible to verify certain small machine learning algorithms on-chain.

By now, we've probably all experienced the potential of interacting with an extremely powerful machine learning product. A few days ago, I used GPT-4 to help me create an AI that kept beating me at chess. This feels like a microcosm of all the advances in machine learning over the past few decades: IBM developers spent 12 years developing Deep Blue , a model running on a 32-node IBM RS/6000 SP computer capable of evaluating Nearly 200 million chess moves, defeated chess champion Gary Kasparov in 1997. In contrast, it only took me a few hours, I wrote very little code, and I wrote a program that beat me.

Granted, I doubt that the AI I've created can beat Gary Kasparov at chess, but that's beside the point. The point is that anyone playing with GPT-4 has probably had a similar experience of gaining superpowers: with little effort, you can create things that approach or exceed your own abilities. We are all IBM researchers; we are all Gary Kasparov.

Obviously, this is exciting and somewhat daunting to think about. For anyone working in the crypto industry, the natural impulse (after marveling at what machine learning can do) is to consider potential centralization vectors, and how to decentralize these vectors into a system that people can transparently audit and own in the network. Current models are built by assimilating vast amounts of publicly available text and data, but currently controlled and owned by only a handful of people. More specifically, the question is not "whether artificial intelligence is of great value" but "how do we build these systems in such a way that anyone who interacts with them benefits financially and, if they choose, ensures that their data remains be used in a manner that respects their right to privacy."

Recently, there has been a voice calling for a moratorium or a slowdown in the progress of large-scale artificial intelligence projects such as Chat-GPT. Stopping progress is probably not the solution here: instead, it is better to push open-source models, with on-chain and fully auditable privacy-preserving Zero-knowledge Proof in cases where the model provider wants their weights or data to be private. protect them. The latter use case around private model weights and data is not feasible on-chain today, but advances in Zero-knowledge Proof systems will make it possible in the future.

Verifiable and Ownable Machine Learning

A chess AI like the one I built using Chat-GPT feels relatively benign at this point: a program with a uniform output that doesn't use data that violates valuable intellectual property or violates privacy. But what happens when we want to make sure that the model we're told is actually the model that's running when it's running behind the API? Or what if I wanted to pull verified data into a model on-chain and be sure the data really came from a legitimate party? What if I want to make sure that the "person" submitting the data is actually a human and not a bot trying to attack my network? Zero-knowledge Proof , with their ability to succinctly represent and verify arbitrary programs, are one way to achieve this.

It’s worth noting that today, the primary use case for Zero-knowledge Proof in an on-chain machine learning environment is to verify correct computations. In other words, Zero-knowledge Proof, and more specifically, SNARKs (Succinct Non-Interactive Arguments of Knowledge), are most useful in an ML context for their succinct properties. This is because Zero-knowledge Proof protect the privacy of the prover (and the data it processes) from snooping verifiers. Privacy-enhancing techniques like Fully Homomorphic Encryption (FHE), Functional Encryption , or Trusted Execution Environments (TEE) are more suitable for letting untrusted verifiers run computations on private input data (exploring these techniques in more depth is beyond the scope of this paper. within the range).

Let's take a step back and understand at a high level the types of machine learning applications you can represent with zero knowledge. (For a more in-depth look at ZK, see our articles on Zero-knowledge Proof algorithms and hardware improvements , Justin Thaler's work on SNARK performance , or our zero-knowledge standard .) Zero-knowledge Proof typically represent programs as arithmetic circuits: Using these circuits, a prover generates a proof from public and private inputs, and a verifier mathematically computes that the statement's output is correct—without gaining any information about the private inputs.

We are still in the very early stages of practical verification of computation using Zero-knowledge Proof on-chain, but algorithmic improvements are expanding what is feasible. Here are five ways Zero-knowledge Proof are applied in machine learning.

1. Model authenticity: You want to ensure that the machine learning model that some entity claims to have run is indeed the model that was run. For example, the case where a model can be accessed behind an API and there are multiple versions of a particular model's provider -- say, a cheaper, less accurate version and a more expensive, more performant version. Without evidence, you have no way of knowing if the provider of the model is offering you a cheaper model that you have actually paid for the more expensive model (e.g. the provider of the model wants to save on server costs and increase their profit Rate).

To do this, you need separate proofs for each instantiation of the model. A practical way to achieve this is through Dan Boneh, Wilson Nguyen, and Alex Ozdemir's Functional Commitment Framework , a SNARK-based zero-knowledge commitment scheme that allows model owners to submit data to models, and users can enter data into model and receive validation that the submitted model has run. Some applications built on Risc Zero , a general purpose virtual machine based on STARK, also support this. Additional research by Daniel Kang, Tatsunori Hashimoto, Ion Stoica, and Yi Sun has demonstrated that valid inference on the ImageNet dataset can be verified with 92% accuracy ( comparable to the highest performing non-ZK-validated ImageNet model ).

But simply receiving evidence that a submitted model has run is not enough. Models may not accurately represent a given program, so submissions are expected to be audited by a third party. Functional commitments allow the prover to be certain that it used a committed model, but they cannot guarantee anything about the committed model. If we can make Zero-knowledge Proof executable enough to prove training (see example #4 below), we could one day start to get these guarantees too.

2. Model integrity: You want to ensure that the same machine learning algorithm works the same way on data from different users. This is useful in domains where you don't want to apply arbitrary bias, such as credit scoring decisions and loan applications. You can also use function promises. To do this, you will submit a model and its parameters, and allow people to submit data. The output will validate that the model was run with the submitted parameters for each user data. Alternatively, the model and its parameters can be made public, and users themselves can prove that they applied the appropriate model and parameters to their own (authenticated) data. This could be especially useful in the medical field, where certain patient information is required by law to remain private. In the future, this could enable medical diagnostic systems to learn and improve from real-time user data in complete privacy.

3. Proofs: You want to integrate proofs from external verified parties (eg, any digital platform or hardware that can produce digital signatures) into a model or any other type of smart contract running on-chain. To do this, you will use a Zero-knowledge Proof to verify the signature and use this proof as input to the program. Anna Rose and Tarun Chitra recently hosted an episode of the Zero Knowledge podcast with Daniel Kang and Yi Sun to explore the latest advances in this field.

Specifically, Daniel and Yi recently published work on how to verify that images captured by a validated sensor have not undergone transformations such as cropping, resizing, or limited editing—this is useful when you want to prove that an image is not a deepfake but Useful in cases where some legal form of editing has been done. Dan Boneh and Trisha Datta have done similar work, using Zero-knowledge Proof to verify the origin of images.

More broadly, however, any digitally authenticated message can undergo this form of verification: Jason Morton, who is working on the EZKL library (more on that in the next section), calls it “giving the blockchain eyes.” Any signed endpoint: (eg, Cloudflare's SXG service , third-party notary) generates a verifiable digital signature, which is useful for proving provenance and authenticity from a trusted party.

4. Decentralized inference or training: You want to do machine learning inference or training in a decentralized way and allow people to submit data to the public model. To do this, you can deploy an existing model on-chain, or build an entirely new network and compress the model using Zero-knowledge Proof. Jason Morton's EZKL library is creating a method for taking ONXX and JSON files and converting them into ZK-SNARK circuits. A recent demonstration at ETH Denver showed that this could be used to create applications such as image recognition-based on-chain treasure hunts, where the creator of the game can upload a photo, a proof of the image is generated, and the player can upload the image; the verifier checks the user-uploaded Whether the image sufficiently matches the attestation generated by the creator. EZKL can now validate models up to 100 million parameters, which means it can be used to validate ImageNet-sized models (with 60 million parameters) on-chain.

Other teams such as Modulus Labs are benchmarking different proof systems for on-chain inference . Modulus' benchmark runs up to 18 million parameters. In terms of training, Gensyn is building a decentralized computing system where users can input public data and train their models through a decentralized network of nodes and verify the correctness of the training.

5. Proof of Identity: You want to verify that someone is unique without compromising their privacy. To do this, you'll need to create a method of verification -- for example, a biometric scan, or a way to cryptographically submit your government ID. You would then use Zero-knowledge Proof to check that someone has been authenticated without revealing any information about that person's identity, whether that identity is fully identifiable, or a pseudonym like a public key.

Worldcoin does this through their Proof of Identity protocol , a method of ensuring Sybil resistance by generating unique iris codes for users. Crucially, the private keys created for WorldID (and other private keys for encrypted wallets created for Worldcoin users) are completely separate from the iris codes generated locally by the project's eye-scanning balls. This separation completely separates the biometric identification from any form of user keys that may come from a person. Worldcoin also allows applications to embed an SDK that allows users to log in with a WorldID, and uses Zero-knowledge Proof to protect privacy, allowing applications to check whether the person has a WorldID, but does not support individual user tracking (see this article for more details) .

This example is a form of using the privacy-preserving properties of Zero-knowledge Proof against weaker, more malicious forms of artificial intelligence, so it is quite different from the other examples listed above (e.g. proving that you are a real human, not a robot, without revealing any information about yourself).

Model Architecture and Challenges

Breakthroughs in proof systems that enable SNARKs (Succinct Non-Interactive Arguments of Knowledge) are a key driver for putting many machine learning models on-chain. Several teams are making custom circuits in existing architectures (including Plonk, Plonky2, Air, etc.). In terms of custom circuits, Halo 2 has become a popular backend used in work such as Daniel Kang and Jason Morton's EZKL project. Proof times for Halo 2 are quasi-linear, proof sizes are typically only a few kilobytes, and verification times are constant. Perhaps more importantly, Halo 2 has powerful developer tools, making it a popular SNARK backend for developers to use. Other teams, like Risc Zero, work on a general VM strategy. Others are creating custom frameworks using Justin Thaler's ultra-efficient proof system based on the checksum-sum protocol.

Proof generation and verification times are absolutely dependent on the hardware that generates and checks the proof, and the size of the circuit that generates the proof. But the key thing to note here is that regardless of the program represented, the size of the proof is always relatively small, so the burden on the verifier to check the proof is limited. However, there are some subtleties here: for proof systems like Plonky2 that use FRI-based commitment schemes, the proof size may increase. (Unless it's wrapped in a pairing-based SNARK like Plonk or Groth16, their size doesn't grow with statement complexity.)

The implication of machine learning models here is that once you design a proof system that accurately represents a model, the cost of actually verifying the output will be very cheap. The things that developers have to think about the most are proof time and memory: representing models in a way that they can be proved relatively quickly, and the ideal proof size is on the order of a few kilobytes. To demonstrate the correct execution of a machine learning model under zero knowledge, you need to encode the model architecture (layers, nodes, and activation functions), parameters, constraints, and matrix multiplication operations and represent them as circuits. This involves decomposing these properties into arithmetic operations that can be performed on finite fields.

The field is still in its infancy. Accuracy and fidelity may be compromised during the conversion of models to circuits. When a model is represented as an arithmetic circuit, those previously mentioned model parameters, constraints, and matrix multiplication operations may require approximations and simplifications. When arithmetic operations are encoded as elements in the proof's finite field, some precision may be lost (or without these optimizations, proofs would be prohibitively expensive to generate using current zero-knowledge frameworks). Furthermore, the parameters and activations of machine learning models are often encoded in 32 bits for precision, but today's Zero-knowledge Proof cannot represent 32-bit floating-point operations in the necessary arithmetic circuit format without significant overhead. Therefore, developers may choose to use quantized machine learning models whose 32-bit integers have been converted to 8-bit precision. These types of models are advantageously represented as Zero-knowledge Proof, but the model being verified may be a rough approximation of the high-quality initial model.

At this stage, it's undeniably a game of catch-up. As Zero-knowledge Proof become more optimized, the complexity of machine learning models grows. There are already many promising areas of optimization: proof recursion can reduce the overall proof size by allowing the proof to be used as input to the next proof, unlocking proof compression. There are also emerging frameworks, such as Linear A's fork of Apache's Tensor Virtual Machine (TVM), which improves upon a transpiler for converting floating-point numbers to zero-knowledge-friendly integer representations. Finally, we at a16z crypto are optimistic that future work will make it more reasonable to represent 32-bit integers in SNARKs.

Two Definitions of "Size"

Zero-knowledge Proof scale with compression: SNARKs allow you to take an extremely complex system (virtual machine, machine learning model) and represent it mathematically so that the cost of verifying it is less than the cost of running it. On the other hand, machine learning scales by scaling: today's models get better with more data, parameters, and GPU/TPU involved in the training and inference process. Centralized companies can run servers at virtually unlimited scale: charge a monthly fee for API calls, and cover operational costs.

The economic reality of blockchain networks works pretty much the opposite way: developers are encouraged to optimize their code so that it is computationally feasible (and cheap). This asymmetry has a huge benefit: it creates an environment that justifies the system's need to become more effective. We should be pushing to require the same benefits that blockchains provide in machine learning— i.e., verifiable ownership and a shared notion of truth.

While blockchains incentivize the optimization of zk-SNARKs, every area of computing will benefit.

Acknowledgments: Justin Thaler, Dan Boneh, Guy Wuollet, Sam Ragsdale, Ali Yahya, Chris Dixon, Eddy Lazzarin, Tim Roughgarden, Robert Hackett, Tim Sullivan, Jason Morton, Peiyuan Liao, Tarun Chitra, Brian Retford, Daniel Kang, Sun Yi, Anna Rose, Modulus Labs, DC Builder.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments