保护隐私的机器学习能克服数据共享的烦恼吗？

显示全部楼层 · 2021-1-10 08:21:10

如果可以克服成本和复杂性障碍，隐私保护人工智能技术可以让研究人员从敏感数据中提取见解。但随着保护隐私的人工智能概念的成熟，数据量和复杂性也不断成熟。这个数字宇宙的字节数是可观测宇宙中恒星数量的40倍。到2025年，IDC预测这一数字可能会翻一番
. 相反，像ImageNet这样的无害数据集推动了机器学习的进步，因为它们是免费提供的. 保护敏感数据的传统策略是对其进行匿名化，去掉机密信息。

如果可以克服成本和复杂性障碍，隐私保护人工智能技术可以让研究人员从敏感数据中提取见解。但随着保护隐私的人工智能概念的成熟，数据量和复杂性也不断成熟。根据世界经济论坛（World Economic Forum）的数据，今年数字宇宙的规模可能达到44 zettabytes。这个数字宇宙的字节数是可观测宇宙中恒星数量的40倍。到2025年，IDC预测这一数字可能会翻一番
更多数据，更多隐私问题
虽然数据量的爆炸式增长，加上不断下降的计算成本，已经引起了人们对人工智能的兴趣，但很大一部分数据带来了潜的隐私和网络安全问题。有关数据的监管和网络安全问题比比皆是。人工智能研究人员受到数据质量和可用性的限制。例如，能够使他们了解常见疾病或杜绝金融欺诈的数据库（估计全球问题约为5万亿美元）很难获得。相反，像ImageNet这样的无害数据集推动了机器学习的进步，因为它们是免费提供的
保护敏感数据的传统策略是对其进行匿名化，去掉机密信息。“大多数隐私条例有一个条款，允许充分匿名，而不是应要求删除数据，”麦肯锡合伙人Lisa Donchak说
但问题是，数据的爆炸性增长使得隐藏的数据集中重新识别个体的任务变得越来越容易。RSA的首席技术官Zulfikar Ramzan说，保护隐私的目标是“越来越难解决，因为有太多可用的数据片段。”
物联网（IoT）使情况复杂化。从监控摄像机到工业工厂再到健身跟踪器，连接的传感器可以收集大量的敏感数据。有了适当的隐私保护措施，这些数据可能成为人工智能研究的金矿。但安全和隐私问题成为了障碍
解决这些障碍需要两件事。首先，前端提供用户控制和权限的框架保护进入数据库的数据。“这包括指定谁有权访问我的数据以及出于什么目的，”英特尔人工智能产品高级主管卡西米尔?威尔兹恩斯基（Casimir Wierzynski）说。其次，它需要足够的数据保护，包括数据静止或传输过程中对数据进行加密。后者可以说是一个更棘手的挑战

Privacy-preserving AI techniques could allow researchers to extract insights from sensitive data if cost and complexity barriers can be overcome. But as the concept of privacy-preserving artificial intelligence matures, so do data volumes and complexity. This year, the size of the digital universe could hit 44 zettabytes, according to the World Economic Forum. That sum is 40 times more bytes than the number of stars in the observable universe. And by 2025, IDC projects that number could nearly double.
More Data, More Privacy Problems
While the explosion in data volume, together with declining computation costs, has driven interest in artificial intelligence, a significant portion of data poses potential privacy and cybersecurity questions. Regulatory and cybersecurity issues concerning data abound. AI researchers are constrained by data quality and availability. Databases that would enable them, for instance, to shed light on common diseases or stamp out financial fraud — an estimated $5 trillion global problem — are difficult to obtain. Conversely, innocuous datasets like ImageNet have driven machine learning advances because they are freely available.
A traditional strategy to protect sensitive data is to anonymize it, stripping out confidential information. “Most of the privacy regulations have a clause that permits sufficiently anonymizing it instead of deleting data at request,” said Lisa Donchak, associate partner at McKinsey.
But the catch is, the explosion of data makes the task of re-identifying individuals in masked datasets progressively easier. The goal of protecting privacy is getting “harder and harder to solve because there are so many data snippets available,” said Zulfikar Ramzan, chief technology officer at RSA.
The Internet of Things (IoT) complicates the picture. Connected sensors, found in everything from surveillance cameras to industrial plants to fitness trackers, collect troves of sensitive data. With the appropriate privacy protections in place, such data could be a gold mine for AI research. But security and privacy concerns stand in the way.
Addressing such hurdles requires two things. First, a framework providing user controls and rights on the front-end protects data coming into a database. “That includes specifying who has access to my data and for what purpose,” said Casimir Wierzynski, senior director of AI products at Intel. Second, it requires sufficient data protection, including encrypting data while it is at rest or in transit. The latter is arguably a thornier challenge.