Bytes

Google Confirms It Scrapes Public Data To Train AI Models

Updated privacy policy makes it clear that your information could be used to train its artificial intelligence models

Published

July 14, 2023

Google has revised its privacy policy to explicitly state that it gathers publicly available data from the internet to train its AI models and services such as Bard and its cloud-hosted products.

The updated statement under the research and development section now states: “Google uses information to improve our services and to develop new products, features and technologies that benefit our users and the public. For example, we use publicly available information to help train Google’s AI models and build products and features like Google Translate, Bard and Cloud AI capabilities”

As reported by The Register, the quoted text was not visible to non-US users, but this PDF version of Google’s policy explicitly declares the following: “We may collect information that’s publicly available online or from other public sources to help train Google’s AI models and build products and features, like Google Translate, Bard and Cloud AI capabilities.”

Google’s changes now delineate the extent of its AI training. Formerly, the policy solely referenced ‘language models’ with Google Translate. However, the updated policy broadens the scope to include ‘AI models’, like Bard and other systems.

Speaking to The Register, a spokesperson from Google said the update to the policy does not fundamentally change the way the Android company trains its AI models. “This latest update simply clarifies that newer services like Bard are also included. We incorporate privacy principles and safeguards into the development of our AI technologies, in line with our AI Principles.”

Ethics of AI data collection

Over the past few years, developers have collected training data for AI systems by extracting information from various sources from the internet, photo albums, books, social networks, source code, music, articles and more.

This practice has sparked controversy as the materials involved are often protected by copyright, terms of use and licenses, resulting in legal disputes as some creators and celebrities are not thrilled to see their work being replicated by AI systems.

AI developers may contend that their endeavours fall within the realm of fair use, asserting that the output generated by the models represents a new form of work rather than a direct copy of the original training data. This issue is a subject of intense debate which we talked about in our interview with Harry Holmwood, CEO of Magicave.

“We’ve got situations where some copying or influence is okay, so long as you don’t go too far in terms of the output being similar to something else,” said Harry. “But we don’t know if that’s a valid argument with AI or not yet and I think we need some legal clarity on this stuff.”

In the past few weeks, AI companies like Stability AI and OpenAI have been sued for scraping images and text from the internet without permission.

“I think you should be very careful as a developer, be cautious about using Midjourney or Stable Diffusion to create your in-game assets at the moment when we don’t know whether that could be infringing someone or not,” said Harry.

As people become more aware of the training processes behind AI models, certain internet businesses have begun implementing changes. This includes charging developers for accessing their data. For instance, platforms like Stack Overflow, Reddit, and Twitter have introduced new fees or updated regulations for accessing their content through APIs.

In this article:AI / Featured / Google

Written By Isa Muhammad

Isa Muhammad is a writer and video game journalist covering many aspects of entertainment media including the film industry. He's steadily writing his way to the sharp end of journalism and enjoys staying informed. If he's not reading, playing video games or catching up on his favourite TV series, then he's probably writing about them.

BeyondGames.biz

Bytes

Google Confirms It Scrapes Public Data To Train AI Models

Ethics of AI data collection

You May Also Like

Subscribe to the future

Popular reading