AI, Data Protection and Blockchain
— A Promising Distributed AI Platform that Address Data Issues
Author:Dr.Feng Xiao,Founder of PlatON
It’s an honor to have the opportunity to share my ideas in the AI conference. I’ve been doing research on blockchain for the past five years. I invested in blockchain and promoted its applications. So today I would like to talk about blockchain. And of course, since this is the AI conference, I am going to talk about blockchain in a way that is closely connected to artificial intelligence.
Blockchain technology can be used to help tackle some of the most challenging data problems that AI faces right now. In fact, these data issues are mostly associated with the rising awareness of data — data ownership, valuation and privacy, which closely relate to the AI’s development, or even derive from it.
The third wave of artificial intelligence in 2016 amazed us in the way that data was used to create such high value. We can’t help but wonder: where is my personal data stored? Is my data effectively managed or protected? Can I get a share if my data is used to create value?
With so much being questioned and discussed, I identify the major areas of data issues related to data privacy, valuation and sharing:
First, data ownership. I think most people in the room have left behind some data on the internet, and that yields several questions: i) How do we identify our personal data left behind on the internet or other platforms? ii) Who owns all those data? Is it us? The platforms on the internet? Or both us and those platforms keep the ownership of our personal data. In the case of medical platform, do we retain the ownership of our data such as the information of genes and medical records?
Second, data privacy. No one wants his personal data to be publicly circulated or even sold on the internet. That’s why we need to talk about data protection.
And third, collaborative data computing. Data is worth nothing without being used and thus it becomes more and more important for data computation. But there is a lack of a single platform on which multivariate and multi-dimensional data is computed and “produced” to meet the need of AI’s algorithms. Each platform, whether it’s an E-commerce or a social platform, has its own database. Try to imagine how AI could benefit from the unification of these massive database!
In the internet generation, however, no one is willing to hand over his data because it’s “once for all”. There seems to be no guarantee that these data will be free from leakage or circulation. Besides, it can’t be ensured that they receive effective protection under a good intent, because it’s technically impossible. Therefore, it remains a challenge for companies to win the trust of sharing data and to collaboratively compute these multivariate and multi-dimensional data for AI training to yield greater benefits.
Fourth, data value distribution. Can I get a share if my data is used to create massive commercial value? The answer is no, not in the context of internet. “I can enjoy some services for free” you might say. Some platforms do release alternative sources of revenue such as free services in exchange of client’s data, which represents the recognition of data value in an indirect way. But what about a more straightforward way in distributing data value that serves as an incentive to encourage data sharing? This incentive mechanism is the basis of collaborative data computing since it is under these incentives that we are willing to share our private data which might be used by scientists or business organizations to train robots, optimize algorithms and draw some conclusions.
How to use data is currently a heated topic among artificial intelligence researchers and practitioners. In fact, AI’s development gave rise to the listed problems regarding data application, but AI itself is not designed to tackle these problems. They will become AI’s Achilles’ Heel without proper solutions, particularly in the lack of an incentive mechanism for data sharing.
In the case of collecting 10 thousand cases of a particular disease, for example, a scientist would find it a difficult task if he goes right to the hospital. But with the help of blockchain technologies, a distributed AI platform combined with smart contracts, privacy protection algorithm and an incentive mechanism based on digital currency becomes possible for the scientist to approach disease cases from 10 thousand strangers because the payment of their sharing is guaranteed by the smart contracts and privacy protected by the algorithm.
The landscape and topics of AI will change dramatically if such a distributed AI platform is realized. Today AI researchers and practitioners talk about AI based on the fact that the technology and methodologies we used for building AI systems assume a centralization model at its core. The entire cycle of an AI project mostly starts from large centralized datasets. This is only one way of collecting data, which can be complemented by a decentralized approach to reach data that a centralized solution fails to provide.
Many cryptologists have been working on data security and have achieved good results, which partly contributed to a decentralized data collecting approach.
Hashing functions can verify whether a data is tampered or not. A hash function is used to convert an input of arbitrary size to data of a fixed size, be it a paragraph or a book. If a punctuation mark in a book is changed, so will the hash value of that book. Hashing secures data by providing certainty whether a file was altered. Instead of writing a guarantee letter or signing a contract, we only need to run through a hashing algorithm, calculate the hash of that data and compare it to a given hash value.
Asymmetric cryptography, also known as public key cryptography, helps ensure data security, integrity and anonymization, as well as confirm data ownership in a way in that the private key is known only to the owner. In blockchain a private key is the single proof of ownership, with which key holders can unlock the account and claim the objects of value and data inside it.
A Zero-knowledge proof lets the prover prove to the verifier that something encrypted is true without revealing any information and content of this truth.
Homomorphic encryption converts data into ciphertexts that can be analyzed and worked with, generating an encrypted result which could be decrypted only by the key holder. Privacy concerns are diminished as the analytics service provider operates on encrypted data rather than the original data or data on the plaintext.
Like homomorphic encryption, Multi-party computation (MPC) allows you to compute on encrypted values. This kind of methods enable parties to jointly compute a function on their inputs while keeping those inputs private. For example, in the case of collecting 10 thousand cases of a particular disease as we mentioned above, patients can be involved in the computing of data they provide or have the work done by other parties. No information is revealed about the parties’ inputs and the result of the computation is shared by each participant.
Although cryptographic algorithms have a long history, it isn’t until the recent two years that they are used in the burgeoning work of data security, and in the following step of collaborative computation.
But cryptographic algorithms alone can not help AI to make the most of data. In addition to data security, the issues of data ownership, authenticity, capitalization and collecting should be addressed in order to reach the full potential of data for AI’s development. Other technologies are needed in the determination of data ownership. As for data authenticity, hash functions help in identifying whether data is altered, but can not ensure data to be irreversible, unchangeable or trackable.
The determination of data value is based on the work of data capitalization, without which data valuation and trading, as well as data payback will be unfeasible. As the basis for trading data, the valuation of data as an asset in turn is premised on the determination of data ownership.
The last challenge is data collecting. An incentive mechanism is crucial to encouraging multiple entities to share their data, especially in the performing of decentralized computing, or shall we call it distributed and collaborative computing characterized by a peer-to-peer network. Cryptographic algorithms could not work out this incentive mechanism that helps to identify the distribution of data and give rewards to the data provider.
So what can be used to complement cryptographic algorithms and tackle the data issues above? Some might think Internet is one of the candidates, except it’s not.
First of all, it’s hard to trust the internet technologies to ensure data privacy and security. A couple days ago personal data and information from hotels operated by a famous hotel group was reportedly leaked, which affected tens of millions of customers. According to today’s report, the “data thief” was caught in time before he could sell the data. Again, we see the weakness of internet in data protection.
Secondly, personal information is collected and used to make profits by agencies, who have little concern for data privacy and ownership.
And there is a clash between the expectations of two parties. We expect to own exclusively the data we post and leave behind on the internet and get a share of the data profits while platforms on the internet or companies would like to claim ownership of those data.
Internet technologies are also faced with the challenge of managing data. Previously,an engineer from an influential logistics company was reported to have accidentally deleted one of the databases. The company was paralyzed for up to 590 minutes before the system was stored.
Therefore, it is clear that the internet can’t help cryptographic algorithms to tackle the listed data problems.
Now let’s back it up and reconsider the previous question — how can we solve the data problems? The answer lies in blockchain. Blockchain is often touted as “the next Internet”, but I think despite a few similarities, it differs significantly from the internet, mainly in the following areas:
When the internet first appeared it was reported as an “information net” by the U.S. media. The internet makes it much easier to acquire information since it reduces the costs of producing, exchanging and disseminating information to the point that the marginal cost is zero, and allows faster communication. Blockchain, on the other hand, is regarded as a “facts machine” because it helps to guarantee the validity of data by recording it not only on a main register but a connected distributed system of registers, through which data can only be added and is not feasible to revoke or tamper. A facts machine is apparently more beneficial to AI’s development.
The internet adopts a centralized trust mechanism, which means you have to trust platforms or companies and believe that they will protect your data from exploitation. But the reality is that this is not going to happen. Blockchain, however, represents a decentralized trust mechanism, which minimizes the amount of trust required from any single actor, either it is another user, an organization, or an institution. Behind blockchain is a set of algorithms that will not peek at or exploit your data. A distributed trust mechanism enabled by the consensus algorithm of blockchain seems way much better than a centralized one.
Unlike the internet, blockchain has developed an incentive compatible mechanism which allows data owners, algorithms providers, computing services providers and AI companies who seek massive computing resources to achieve the best outcome to himself just by acting according to his true preferences.
An app runs on the internet whereas a decentralized app (dapp) runs on blockchain. So what’s the major difference between apps and dapps? Let me put it this way. An author can receive 10% of the book’s retail price as royalties from traditional publishers. If he releases the book through the platforms of internet such as China Literature, a leading online literature platform, he is expected to contribute 25% of the profits to the platform and get a 75% share. But if the author publishes his work on a dapp, he gets 100% of the profits because of the absence of intermediaries. Dapps are based on a decentralized business model and we called it distributed business.
Companies seek to collect data and profit off their databases, so we can’t expect them to exchange data. This adds to the difficulty of collecting massive amount of data and hinders AI development. But on blockchain, a distributed ledger that is completely open to anyone, data sharing is feasible because the ledger records all transactions and copies to all participants.
Digital currencies are used on a blockchain network for data trading and serve as incentives to provide data, algorithms and analytic services. In the context of the internet, WeChat or Alipay might do the work of payment. But both of them are from a different system and can’t insure that data providers get the promised profits. On a blockchain network, however, a smart contract — a set of coded conditions, is designed to enforce the performance of a contract. A buyer launches a smart contract, in which related terms of digital currencies are determined. Data provided will be calculated and estimated based on a consented computer program. If the promise of the contract has been fulfilled or completed, a payment process will trigger and each party gets his portion of benefits according to the calculation result. In this context, digital currencies are “programmable money” rather than fixed numbers.
In short, by comparison of the internet and blockchain, it’s clear that the latter and cryptographic algorithms will be a potent combination in solving data issues. First of all, blockchain-based database guarantees data authenticity. Unlike the internet — an information net, blockchain, as a facts machine, prevents double spending — which means that without permission data can’t be copied at no cost.
Blockchain is an Internet of value where there’s an actual exchange of value rather than information. Information is exchanged when we send emails, which can be copied to other receivers. But if we send value such as bitcoins the way we send emails, let’s say, we copy the same bitcoin to 10 thousand people, the whole world be will be in a mess. Blockchain technology addresses the issue of double spending by implementing a confirmation mechanism. If a user decides to send a bitcoin from his blockchain wallet, the system will make sure that the bitcoin will be transferred from his account to the designated receiver. Again, it can’t be copied to a thousand, if not ten thousand people. The elimination of double spending is the basis for data capitalization, which is unfeasible on the internet where information is copied and disseminated without costs and permission.
Blockchain is also a peer-to-peer network that allows each participant to own his data and be involved in trade, which is resistant to data monopoly.
A decentralized trust mechanism ensures data security, as we mentioned above.
A new incentive mechanism for collaboration in data brought by programmable currencies is a perfect match with the advancement in cryptography.
Finally, I would like to briefly summarize some of the trends in the fields of blockchain and cryptographic algorithms.
More and more cryptologists in universities are joining blockchain startups. I met a couple of cryptologists from Stanford, MIT, Maryland and Berkeley. They all joined the industry and works on data protection and collaborative computation.
From our conversation, I noticed that the focused areas of cryptography see a huge shift — about a half of essays and proposals that organizers of the next International Cryptology Conference and the Central European Conference on Cryptology receive deal with Multiparty Computation. Data privacy became the most heated and important topic among the field. The PlatON Project that I started, also worked on the combination of cryptographic algorithms and blockchain technologies to solve the problems of MPC. We have achieved the 2PC and expect to realize the Three-Party Computation in 2019 and ultimately the MPC.
The combination of blockchain and cryptographic algorithms will strongly promote the development of AI by helping tackle some of the data issues that AI faces right now and meet its particular need for data. Data protection, an incentive mechanism for sharing data, determination of the data value and proper data management will be ensured while the results of data computation will be shared among participants. In three or five years, a decentralized and distributed AI platform will appear and will no longer rely on data provided by centralized organizations. A scientist can launch a smart contract calling for data providers, algorithms owners and analytic services providers to jointly conducted a scientific research. I truly believe that such a distributed platform will come into being in three to five years.
◆◆◆
Click on the link below to find PlatON elsewhere