Skip navigation

BLOGChatGPT – Is it Stealing Your Content?

by Randall CraigFiled in: Make It Happen Tipsheet, AI, Blog, Content, Data, TechnologyTagged as: ,

By now, pretty much everyone knows about ChatGPT and the various other generative AI tools. On one hand they almost seem magical: ask a question, and they respond with brilliant and reasonable answers. Putting aside that sometimes the AI just makes things up, they are a game changer. But how did they get so smart?

ChatGPT – Is it Stealing Your Content?

ChatGPT is a “large language model”: the system ingests copious amounts of training data, draws connections within it, does its abracadabra analysis, and then strings together responses based on the algorithm that it self-develops. So the question isn’t really how does an AI system become so smart, but rather where does the training data come from?

One of these data sets is Google’s C4, which contains the data from ~15 million websites. C4 is used by Google’s T5 and Facebook’s LLaMA. (While OpenAI does not share the data set used to train ChatGPT, it is unlikely to be that different than C4.) Here’s the rub: who gave Google permission to scrape the data from these 15 million websites and use it for this purpose? And when ChatGPT and their ilk use OUR data to construct its answers, where is the attribution?

In a groundbreaking article on this topic, the Washington Post analyzed the data within the Google C4 dataset. It determined how many “tokens” C4 had from each website, and ranked them.

In my particular case, was ranked 84,424 (out of 15 million), and had over 230K tokens. In a certain sense, I guess I should be proud of this rank, but on the other hand, it is a ranking of “chumps” whose copyrighted content has been stolen, without their knowledge, without their permission, without payment, and without attribution when the data is used.

This Week’s Action Plan:

Search for your website at the same link as above — how did you do?

AI Insight: To prevent your content from being used going forward, it is possible to program a robots.txt file that prevents your site from being indexed. Unfortunately, if you do this, it will also be invisible on Google (and Bing.) At least at this time, there are no easy answers.

Does this topic resonate? Reach out to Randall: he can present it to your group.  (More presentation topics)
Download Randall’s professional credentials: Speaker credentials one-sheet or Management Advisory credentials.

Content Authenticity Statement: 100% original content: no AI was used in creating this content.

@RandallCraig (Follow me for daily insights) Professional credentials site.



Randall Craig

Contact us for more on Randall’s topics, availability, and audience fit.

Back to top