When Mark Zuckerberg first made Facebook, he had a strong opinion about the users of his burgeoning social network, who gave him troves of their data: they are “dumb fucks.” Yes, that’s a real quote. Almost hard to believe, isn’t it? Zuckerberg kept asking for people’s data, we freely gave it, and Facebook became a behemoth, and eventually, became Meta. Soon his company contained unfathomable data, all submitted by people Mark Zuckerberg called “Dumb fucks.” We trained Facebook. The algorithms they later used to contribute to social unrest, even genocide, all over the world sourced their addictive and inciting capabilities from us. They used the data we gave them, and made so much more of it, beyond our intentions. Beyond even their own intentions.
Each of us, little pieces of a whole, building the data fortress that would trap us, none of us knowing the monster we’d create.
AI companies scrape the web. They scrounge for data, sucking up so much they claim they can’t filter it for hateful content, copyrighted material, or personal data. Perhaps we built monsters from social networks, but AI takes data without consent. It builds something without our willing or knowing input. As a result, AI companies can’t even keep their AI from spitting out other people’s personal data. Ingest the internet, and you’ll eventually find a way to get whatever is ingested. Build something unfathomable, and it’ll behave unexpectedly at times.
That brings us to the story of a ChatGPT user who found their account history included requests they never made. Many contained personal data. Immediately, the collective and underlying trust of another data hog pointed to an accusatory explanation: OpenAI leaked chat histories. But, according to the company, it’s not that. OpenAI says the user’s account was accessed in the United States and Sri Lanka nearly simultaneously, suggesting the user was hacked. The user doubts that, but it does seem like a reasonable explanation.
But that’s not the end of the story.
Insecure Chat
The user claims they had used a nine-character password with letters, both upper and lower, as well as symbols. They didn’t re-use the password. However, 9 characters is quite short. The minimum password I’d suggest would be something randomized and unique, 15 characters or longer. On top of that, I’d recommend using two-factor authentication, wherever possible. If you can’t use that, make your password as long as possible, it’ll make it harder to crack. Still, the user may have been safe, the password could have been leaked or cracked in another way.
OpenAI has the potential to collect a large amount of data, including your chat histories. But it’s more than that. You could hunt down where they used the output and find more information on them. ChatGPT isn’t just a fun little tool, you should protect your account like you would any other. However, OpenAI doesn’t have two-factor authentication for ChatGPT accounts. Since OpenAI won’t give you two-factor authentication, users should use the longest, most random passwords OpenAI will permit. It’s far from perfect, but OpenAI is not alone in keeping two factor authentication from users.
Training the Data Miner
Many jumped to the conclusion that OpenAI was leaking chat histories of other users, without a hack. You can’t blame someone for assuming that though. OpenAI already revealed private information if you asked ChatGPT to repeat the same words repeatedly. They showed that the data that trains ChatGPT could come back out, verbatim. The New York Times is suing for multiple reasons, that being one of them. Is it so strange to think that ChatGPT would use your chat histories to train itself further, and then spit out that data you put in verbatim? AI companies will scrape the web for every bit of data they can find. Is it so strange to think they would also train based on the data that is freely given to them in queries? If that data can leak out, and we know it can, how can we be so sure any data given to any AI is safe?
Honestly, if it hasn’t happened already, going unnoticed because the original query contained no private information, I’d be shocked. ChatGPT can output from its dataset verbatim. The only missing part would be knowing whether or not queries become part of the training data set. OpenAI is a data hog that has even gone after copyrighted data, and now hides the sources of the data used to train ChatGPT. To think they couldn’t possibly be used in that way would likely make you gullible.
Secrets go against the basic function of AI. It can’t learn through extrapolation, through actual means. It needs massive amounts of data to repeat patterns. Its very nature is to devour and output chat predictions, from anywhere, from anyone. Secrets, hallucinations, facts, what’s the difference? To an AI? It would be hard to discern, and in some cases, disadvantageous to do so. An AI doesn’t have to make a decision about the morality of a response if it lacks the capability to discern the difference. It’s all just data and patterns, fed through a text prediction algorithm.
If a program has no issues revealing the secrets of others, why would you trust it with your own? If you plan on using generative or chat-based AI, be sure to scrub it of anything you wouldn’t want anyone else seeing. You never know how an AI will use your data for its gain. No one does.
Sources:
- Dan Goodin, Ars Technica
- Laura Raphael, Esquire
- Aamir Siddiqui, Android Authority