The opt-out myths and realities

Marie-Avril Roux Steinkühler
Jan 6
2 min read

The 2019 Directive on copyright and related rights in the digital single market (Directive 2019/790) allows companies to carry out data mining operations without having to obtain specific licenses. The trade-off is provided by the opt-out option for rights holders. This allows them to prevent their works from being used by artificial intelligence.

However, the directive does not suggest at any point what form this opt-out mechanism should take. It confers a right without specifying the practical conditions.

Let's be pragmatic: this mechanism is virtually ineffective.

In practical terms, robots.txt files are the best defense against scrapers, crawlers, and other such tools. As the extension suggests, these are small text files that allow (or prevent) the exploration or indexing of a website. They are very easy to access and edit. As an illustration, the two lines of code below allow you to “prohibit” ChatGPT from browsing your site:

User-agent: ChatGPTBot

Disallow: /

However, this robots.txt file simply expresses the creator's refusal to have their content used as learning data. It is not a technical measure to close your site, but simply a piece of information.

Therefore, a company operating an LLM (Large Language Model) system that wishes to ignore the lack of consent and override the opt-out can do so and scrape all the content from the site.

This problem is exacerbated by illegal downloads. Let's take a concrete example: a major book publisher decides to implement these robots.txt files on its various portals and believes itself to be protected. In reality, these various sites will clearly express their lack of consent to text mining. But what about Joe Bloggs, who illegally acquires a protected work and then distributes it on his personal blog without permission? Artificial intelligence will access the content and integrate it into its learning data, even though this is, in principle, prohibited.

Today, there are no satisfactory solutions for implementing this opt-out right conferred by the European directive. This is all the more alarming when we consider how unsupervised learning, which forms the basis of almost all AI, works. To put it as simply as possible, the data collected goes into a black box. It becomes almost inseparable from previous data that may have been acquired with the permission of the rights holders. Thus, even when it is possible to precisely identify which data has been used, it is technically impossible to remove it without disrupting the functioning of the AI in question.

It is nevertheless recommended to opt out as much as possible, as some LLM systems respect this right, and this constitutes evidence and a legal basis for possible legal action.

Waiting for a return to opt-in?

Credits : Kehn Hermano

The opt-out myths and realities

Recent Posts

Comments