PlatON CIO： The New Age of Big Data
New Big Data Era = Traditional Big Data + Privacy Big Data
As early as 1980, the famous futurologist Alvin Toffler praised big data in his book “The Third Wave” as “the third wave of cadenza”. Since about 2009, “Big Data” has become a popular vocabulary in the Internet information technology industry. According to monitoring statistics, the total amount of data in the world in 2017 is 21.6ZB (1 ZB is equal to 10 trillion bytes), and the current global data growth rate is about 40% per year. It is expected that the global data will reach 40ZB by 2020. In 2017, the total output value of China’s big data reached 370 billion RMB.
We say that traditional big data generally refers to Internet giants like BAT. They collect a large amount of user-related data and business-related data, as well as traditional industrial enterprises such as chemicals, biopharmaceuticals, hospitals, etc. They have cultivated deeply in their vertical industries and accumulated colossal amount of domain related data.
With the increasing amount of data on the development of big data in the past, we have paid a lot of attention to the technology of big data collection, storage, and processing. With the processing technology, our hardware has also been greatly developed, from CPU, GPU, to TPU and other heterogeneous computing hardware. The unit price of data processing is rapidly declining. At the same time, the technical architecture of data processing is also rapidly matured and widely applied to various industries.
The computation of traditional big data is relatively closed. Traditional big data companies use existing computing frameworks to perform closed operations on limited data. Although there are no privacy-related issues, it also faces cost increasing of data collection and processing, and insufficiency of data. For example, in the field of autonomous driving, no company can collect all the corner cases. The laws and regulations of countries in the field of transportation are not the same, and the conditions of roads are not the same. For example, in Europe and the United States, the markings are relatively clear, and pedestrians are more law-abiding. In China, there are a large number of second- and third-tier cities, and the road conditions are relatively poor. There are even no obvious markings. At the same time, all kinds of traffic on the roads are different. The automatic driving system developed by European and American manufacturers has become very difficult to apply under the road conditions in China, and it is difficult to ensure safety. In this case, due to legal and regulatory requirements, such traffic data is sensitive data, and it is difficult to directly share Chinese data with foreign manufacturers. Mr. Sun Lilin once suggested that the biggest challenge we all face in the all-digital world is that “like the blind men feeling an elephant”, which means no organization can grasp comprehensive data. Users need to obtain multi-dimensional data from multiple organizations, and organizations are reluctant to disclose too much data to user. square. The difficulty of commercializing sensitive data has also exacerbated the dilemma of lack of data that many artificial intelligence companies are facing.
With the large amount of data collected, the cost of data storage and processing is getting higher and higher, and the value brought by these data has not increased significantly as the amount of data increases. At the same time, a lot of data itself contains a lot of private information, which is risky for market monetization. Along with the strengthening of privacy protection by governments around the world, the marketization of these sensitive data with privacy will become more difficult. In particular, Europe recently introduced the most stringent personal data protection regulations “GDPR”. GDPR imposes a very high penalty on corporate violations, with a fine of 10 million euros or 2% of full-year revenue (the highest value of both). If the behavior is serious, it will be fined 20 million euros or 4% of full-year revenue (the highest value for both). This means that in the future, the cost of enterprises to violate the privacy will sharply boom. If they are fined, their survival will even be threatened.
Our traditional big data processing technology has natural flaws in handling privacy. Many of the frameworks and technologies for big data processing have not considered protecting data privacy from the beginning, and have not considered how to make sensitive data available to third parties. Traditional big data processing is basically based on plaintext processing. If the data contains sensitive information, it is only limited to internal use and it is difficult to use it for third parties to obtain more benefits. As business models become more and more open and also change more and more fast, more and more big data processing participants are coming from data producers, data transmission channel providers, data storage providers, and cloud computing capabilities providers etc. The risk of data privacy violation has increased dramatically. The call for new processing techniques to handle sensitive information is becoming a fundamental and effective way to protect data privacy.
The leakage of sensitive and valuable data will lead to significant losses for enterprises. Many companies do not dare to store sensitive core data on external cloud storage, often on private clouds inside the enterprise, such as the massive compounds database of pharmaceutical companies, the genetic database of genetic testing companies, etc.
How to monetize the company’s valuable sensitive data with low-cost and low-risk compliance in the market and generate greater value to balance the huge expenses brought by collection and storage is a problem that all traditional big data companies must face. It is also an important direction for the development of big data in the future.
We call the big data that used enterprise as the boundary of computing and value in the past as traditional big data. It is characterized by enterprise-centric collection, storage and calculation. Especially the data of some private information is very difficult to monetize. If you want to monetize it, you have to go through complicated scrutinization. The usual way is to sign NDA containing complex exemption clauses, then carry out complex data pre-processing, and then go through very complicated approval procedures and technical processing. Buying and selling. The cost of monetization is very high, the risk of data loss is uncontrollable, and even it is not feasible under the new regulations.
The new big data contains private big data, combined with traditional big data to form a complete business closed loop. In this new big data framework, all the data of the whole society can be securely collaboratively computed to excavate the full value of the data. Enterprises and individuals with regular data and sensitive private data will participate in a unified borderless computing framework. Effectively solve data boundary problems and privacy protection issues, and achieve compliance with regulations.
The characteristics of the new big data era are lower data usage costs, better privacy protection, and good compliance. Especially in the demand for massive data in artificial intelligence applications, it has solved the problem that enterprises have difficulty in obtaining sensitive data with high value which is limited by laws and regulations.
The big data processing methodology will also undergo fundamental changes, from the original bounded trusted computing to the privacy-preserving computing without boundaries, and the data is extended from the original limited data set to an infinite data pool which containing the full amount of data. The method of data calculation also transforms from traditional closed processing to multi-party secure collaborative computing. The new big data era has put forward new requirements for software technology, hardware and frameworks which are popular in the traditional big data era. Mr. Sun Lilin recently proposed to use “3 rights separation of data” to solve this challenge. The data executor is added between the data owner and the user, and the secure multi-party calculation is used to ensure that the data user can fully use the data without leakage of the data and privacy.
At present, MPC (Secure Multiparty Computation), HE (Homomorphic Encryption), VC (Verifiable Computation), SS (Secret Sharing) and other cryptographic methods are being quickly accepted by enterprises and applied to big data processing, especially It is the processing of sensitive privacy-related big data. As the name suggests, MPC is a multi-party computing method. The MPC protocol encrypts and transforms each part’s data by cryptographic method to protect data privacy, and transforms the algorithm itself to garbled circuits to protect the algorithm privacy. Homomorphic encryption encrypts data in advance and then is able to perform correct computation on the encrypted data. Most traditional algorithms do not have this characteristic. Verifiable computation enables the outsource of computation to a third party, and then quickly verify the correctness of the computation. Secret sharing can securely split and recover secrets such as privacy, keys etc. The comprehensive application of these technologies has greatly promoted the arrival of the new big data era. It is becoming the engine of the new big data era, effectively raising the ability of enterprises to monetize the sensitive data and reducing the risk of violations, promoting the improvement of corporate efficiency.
For example, the basic investment scale of an original innovative drug in the pharmaceutical industry is now more than 1 billion US dollars, and it has to go through several years of research and development. Some drugs even have decades of research and development such as penicillin and aspirin. Etc. In recent years, artificial intelligence with deep learning as the core has been greatly developed, and has also been well applied in the medical field, such as medical imaging and structural screening. Medical image data is the most sensitive and valuable asset for hospitals. If there are no corresponding privacy protection measures it is very difficult to commercialize these assets. We can apply the privacy-preserving trustless computing technology to the training of medical images for artificial intelligence system to facilitate the commercialization purpose.
Today’s world-renowned drug development companies have large libraries of hit compounds and lead compounds, usually containing millions of tens of millions of compounds. From the identification of target to the screening of hit compounds, to the screening of lead compounds, the determination of candidate compounds requires extensive structural simulations and experiments. In this process, we can use the privacy-preserving trustless computing technology on databases of private drug research and development companies to conduct joint structural simulations, improve the probability of effective structure discovery, reduce the screening risk, greatly shorten the screening process of the hit compounds and lead compounds in early stage. The cost and time of discovery are reduced as well as the total investment throughout the whole drug development life cycle.
Privacy computing can be used not only for medical, pharmaceutical, but also for a wider range of industries, such as automotive, finance, insurance, and the Internet of Things. Monetizing privacy-related private big data will open up a vast market for big data computing, even as large as the existing big data computing market, and also is a powerful complement to the existing big data market.
The overall size of the new big data market will reach trillions per year, and privacy computing big data technology will also be extended to various vertical industries. In particular, in recent years, companies such as Google and Facebook have successively experienced data breaches. Europe has implemented the GDPR data protection regulations. Other governments are actively promoting legislation related to privacy protection. It is expected that accelerated growth will occur in the privacy-preserving computing market in the future. Under the increasing strict privacy protection regulations, the risks of business operations using traditional big data technology will become higher and higher, and the rapid adaptation to technological revolution and the use of new advanced technologies to conduct business will become one of critical competitiveness for enterprises.
About the Author:
Honggang Tang is the CIO of PlatON, Former Alibaba Investment Director, System Architect. Baidu Mobile Systems Department Architect. Motorola System Architect. Tsinghua Tongfang department manager. In Alibaba, he invested in artificial intelligence unicorn companies such as Sensetime and Cambricon. He also has 20 Chinese and US granted technical utility patents.