As the global race to lead in advanced artificial intelligence (AI) technology intensifies, companies such as OpenAI, Google, and Meta are grappling with a significant challenge: acquiring sufficient data to fuel their AI systems’ growth while remaining ethical and compliant with copyright laws and company policies. To stay competitive and overcome data scarcity issues, these giants have reportedly cut corners, ignored corporate guidelines, and even debated testing the legal waters.
OpenAI found itself in a dire situation late in 2021. After exhausting every reputable English-language text source on the internet while developing its latest AI system, the laboratory needed more data to train the next iteration. In response, OpenAI researchers created a speech recognition tool called Whisper. Designed to transcribe audio from YouTube videos, Whisper provided new conversational text that made the AI smarter. Although the tool yielded significant benefits, concerns arose about YouTube’s rules regarding unauthorized use of its videos for applications that aren’t directly tied to the platform.
Google reportedly took a similar approach, using transcripts of YouTube videos for its AI models, potentially infringing on the copyrights of creators. This practice could have contributed significantly to OpenAI’s data scarcity problem, demonstrating the competitive pressure driving companies to seek unconventional ways to acquire the vast amounts of data required for AI development.
To catch up with OpenAI, Meta’s business development leaders, engineers, and lawyers met nearly daily in early 2023. During these confidential discussions, they debated paying high fees for full licensing rights to new books or even buying publishing companies like Simon & Schuster. While these strategies could have solved Meta’s data scarcity problem, the ethics of acquiring copyrighted material without permission or compensation for creators raised concerns.
These tactics not only risked damaging reputations but also posed potential legal risks. For instance, The New York Times filed a lawsuit against OpenAI and Microsoft last year for reportedly using copyrighted news articles without permission to train AI chatbots. The lawsuit highlighted the larger issue of intellectual property ownership and ethical considerations in AI data collection.
As the pursuit of AI dominance pushes companies to explore questionable tactics, there is an urgent need for clear guidelines on ethical and legal AI data acquisition. Regulators and policymakers must find a balance between innovation and ethics while ensuring that advancements in AI contribute positively to society. To achieve this, tech companies should focus on collaborating with various industries, including publishing, to obtain data access that benefits both parties.
In conclusion, the race to lead the AI industry has pushed some companies to explore questionable tactics to acquire the enormous amounts of data required for AI growth, prompting ethical and legal dilemmas. By working collaboratively with industry partners and setting transparent and ethical guidelines for AI data collection, companies can stay competitive while also respecting the intellectual property rights and maintaining the trust of all stakeholders, resulting in a sustainable and beneficial future for AI technology.
Related Articles
- Top AI researchers say OpenAI, Meta and more hinder independent evaluations - The Washington Post
- Open AI and Google trained AI models on YouTube videos - Mashable
- OpenAI Reportedly Transcribed 1 Million Hours of YouTube Videos to Train GPT-4 - Gizmodo
- Here’s Proof You Can Train an AI Model Without Slurping Copyrighted Content - WIRED
- Rubrik’s IPO filing reveals an AI governance committee. Get used to it. - TechCrunch