Stack Overflow, Reddit and More To Charge for Data
Stack Overflow is just the latest company to announce that it intends to charge developers to access its data for training AI systems like ChatGPT and Bard.
The bustling community for programmers, home to 20+ million users and approximately 50 million Q&As, will continue to supply some programmers with free data, but those developing AI systems for commercial use will need to start paying in June.
Reddit, Condé Nast, and Twitter are just a few other companies that have plans to charge for AI training data. “Community platforms that fuel LLMs absolutely should be compensated for their contributions so that companies like us can reinvest back into our communities to continue to make them thrive,” Stack Overflow’s CEO Prashanth Chandrasekar stresses. “We’re very supportive of Reddit’s approach.”
Last week, Reddit’s CEO Steve Huffman told The New York Times that “crawling Reddit, generating value, and not returning any of that value to our users is something we have a problem with.”
OpenAI, Meta, and Google have been compiling data from the web, including from Q&A platforms like Stack Overflow and Reddit, for free until now.
Elon Musk isn’t happy with how Microsoft and its partner OpenAI use Twitter data to train their AI tools either. “Lawsuit time,” he tweeted. Furthermore, Musk increased the prices for Twitter data to $42,000/month for 50 million tweets, about three times the volume that had been available for free in the past.
Shutterstock, a stock photography site, made a licensing deal with OpenAI. Its largest competitor, Getty, on the other hand, is suing Stability AI over the alleged illegal use of 12 million of its photos.
Chandrasekar explained AI companies are violating Stack Overflow’s Creative Commons license, which demands anyone using data from their platform to “attribute each and every one of the community members whose questions and answers were used to train the model.”
Training large language models (LLMs) with data from expert discussions helps them become more conversational and educated about complex topics like programming. Microsoft is already charging $19/month for its code generator GitHub Copilot. Google also added coding capabilities to its chatbot Bard.
The companies that intend to charge AI developers for their data haven’t released pricing information so far. “We’re working on that as we speak,” Reddit spokesperson Tim Rathschmidt confirmed. Chandrasekar unveiled Stack Overflow’s pricing structure, which is a combination of Reddit’s strategy and their own customers’ suggestions.
Meanwhile, AI companies struggle to reduce the hefty costs associated with AI development. With the growing number of companies demanding compensation for training data, AI developers face even bigger challenges. One of them is finding ways to generate a large enough profit from these technologies to cover the development costs, especially when users can currently access them for free.