Accurate, Focused Research on Law, Technology and Knowledge Discovery Since 2002

Why won’t Google give a straight answer on whether Bard was trained on Gmail data?

Skiff Blog: “… Google’s Smart Compose feature was trained on Gmail users’ private emails.Bard is not Google’s only language-focused machine learning model. Anyone who’s used Gmail in the past few years knows about the Smart Compose and Smart Reply features, which auto-complete sentences for you as you go. According to Google’s 2019 paper introducing Smart Compose, the feature was trained on “user-composed emails.” Along with the email’s contents, the model also made use of these emails’ subjects, dates and locations. So it’s plainly true that some of Google’s language models have been trained on Gmail users’ emails. Google has not confirmed whether any training data is shared between these earlier models and Bard, but the idea that a new model would build on the strengths of another doesn’t seem far-fetched…the fact that both Smart Compose and Smart Reply were unambiguously trained on Gmail users’ data seems to be an underappreciated topic of public interest in its own right, which brings us to point 3…3. Google researchers have extensively documented the risk of leaking private data from their own machine-learning models, some of which are acknowledged to be trained on “private text communications between users.”In a 2021 paper, Google researchers laid out the privacy risks presented by large language models. They wrote:“The most direct form of privacy leakage occurs when data is extracted from a model that was trained on confidential or private data. For example, GMail’s autocomplete model [10] is trained on private text communications between users, so the extraction of unique snippets of training data would break data secrecy.”As part of this research, Google’s scientists demonstrated their ability to extract “memorized” data — meaning raw training data that reveals its source — from OpenAI’s GPT-2. They emphasized that — although they had chosen to probe GPT-2 because it posed fewer ethical risks since it was trained on publicly available data — the attacks and techniques they laid out in their research “directly apply to any language model, including those trained on sensitive and non-public data”, of which they cite Smart Compose as an example. 4. Google has never denied that Bard was trained on data from Gmail. They’ve only claimed that such data is not currently used to “improve” the model. This point is subtle but significant. Following the controversy around AI researcher Kate Crawford’s tweet, Google crafted an official response to questions about Bard’s use of Gmail data (after having deleted a more immediate response discussed in point 1 above). That statement, which they added to Bard’s FAQ page, is:“Bard responses may also occasionally claim that it uses personal information from Gmail or other private apps and services. That’s not accurate, and as an LLM interface, Bard does not have the ability to determine these facts. We do not use personal data from your Gmail or other private apps and services to improve Bard.”There are two important details in this statement. One is the use of the adjective “personal”. Google has not said that it’s inaccurate that Bard uses information from Gmail, only that it’s inaccurate that it uses personal information from Gmail. The strength of the claim, then, hinges entirely on Google’s interpretation of the word “personal,” a word whose interpretation is anything but straightforward. The other, possibly more significant, detail is that Google has conspicuously never used the past tense in its denials of Bard’s use of Gmail data. In their first tweet on the subject, Google said Bard “is not trained on Gmail data” and in the official FAQ, they write that they do not “use personal data from your Gmail or other private apps and services to improve Bard.” Neither of these statements is inconsistent with Bard having been trained on Gmail data in the past…”

Sorry, comments are closed for this post.