#6: The Generative AI Balancing Act: Navigating Data Privacy and IP Concerns
Gen AI brings big opportunities but also poses real risks around data privacy and IP rights. Zahed and I discuss how companies can get creator buy-in and safeguard their data.
Generative AI is taking off with OpenAI's API and Python libraries lowering barriers and allowing developers to integrate generative AI capabilities into their apps, but should you jump on board? Concerns around data privacy, copyright, and attribution present real challenges. Zahed and I break it down.
Privacy, Data Confidentiality, IP Rights
First up: privacy. Remember that ChatGPT snafu where some users briefly saw others’ chat history titles? Not a great look.
Understandably, companies worry about keeping data confidential. Additionally, they don't want their info used by AI companies to train models without permission.
OpenAI says they get it. Their new policy: they don’t use API customers’ data for model training unless you opt-in. They'll retain what you submit via API for 30 days to monitor for abuse. After that, it's deleted unless legally required to keep it longer. But some still feel uneasy. Where's the data stored and who has access? OpenAI's own systems. Access is limited to authorized employees and contractors investigating abuse.
Next up: current copyright law clashes with AI. Artists are angry about models trained on their work without consent. And they argue AI-created content derived from their art infringes their rights. AI proponents argue that incorporating copyrighted content as part of a larger training dataset may fall under fair use if the generative work is transformative and does not compete with the original work.
Legal battles have already emerged. Remember those lawsuits against Midjourney and Stable Diffusion over unauthorized use? Software like GitHub Copilot also allegedly suggests copyrighted code snippets. CoPilot is also known to have suggested GPL code; GPL licensing structure requires that any derivative code should be licensed under the same rules as the original code.
On the other side of the debate, AI creators want their AI-generated work too. In response, the Copyright Office released a statement in March 2023 asserting that only the portions of AI-generated material involving significant human input, alteration, and creative control are protected. Like that AI-generated comic book "Zarya of the Dawn." The story got copyright protection thanks to “human creative input,” but not the AI art lacking it.
Source: Image generated using Stable Diffusion
AI Dilemma: Should you jump in?
No doubt, it's a muddy legal quagmire. But avoiding AI means falling behind the competition.
So what's the answer? First up, the privacy and security concerns. One approach: use OpenAI's API while self-hosting on the cloud. Microsoft Azure OpenAI Service, for example, is hosted in Azure and does not interact with external services operated by OpenAI. Microsoft lists several security guarantees on its website, among them that user prompts and completions, embeddings, and training data will not be available to other customers or to OpenAI and will not be used to improve OpenAI models or any Microsoft or third-party products or services.
But hold up – do you even need to connect all your data to AI? The usual reason to do it is customization. If you want to get an LLM to answer questions by employees or customers, you might need to retrain the LLM on your company’s data. Makes sense, but companies are rushing to "fine tune" too fast. There are alternatives. First, try prompt engineering instead. LLMs are ChatGPT are already trained on tons of info and can be good at answering domain-specific questions if you “ask nicely,” i.e. formulate your prompt in a way that helps the LLM better understand your needs. Another option: Retrieval Augmented Generation (RAG), which involves retrieving snippets of information that are relevant to the user’s question and appending the information to the prompt before submitting to an LLM. So, you are not providing your entire data to the LLM for fine tuning. Just sending targeted info to answer specific questions. There are other tradeoffs for sure which I will address in a later post.
Regarding IP and copyright concerns, consider the three C's:
Consent: Partner with creators. Get their buy-in to train models.
Credit: Accurately attribute contributions. Easier said than done with complex algorithms where it is impossible with today’s technology to determine which specific training data made the greatest contribution towards some AI-generated art. But it's important for ML researchers to solve this as AI spreads. For now, transparency about the training data is crucial.
Compensation: Cut creators in on the upside. Grimes is leading here - users can mimic her voice but she gets a royalty split. Expect more deals like this.
Bottom line: AI concerns are real but sitting it out means losing ground. Dive in and keep these solutions in mind.
(Prachi, one of our readers, asked me to cover legal and security concerns that LLMs pose. This article is in response to that)