Web crawling and data indexing: what legal framework applies?

Automated data collection on the internet is an increasingly common legal issue, driven by the growth of data mining, big data and artificial intelligence.

Ownership and protection of website data

Data on a website belongs to its owner, the person who authorised its publication, or the users who uploaded it. These persons hold copyright over their content (provided it is original) and the sui generis right of database producers (provided there has been a substantial investment). Copyright protects the expression of the content and its presentation. The sui generis right protects the data itself (structure, logical organisation). Making data freely available does not reduce the level of protection.

Indexing and copyright

Data indexing by a crawler is generally regarded as a simple technical service that benefits the indexed site by driving traffic to it. French case law has confirmed that such indexing does not in itself constitute infringement (TGI Paris, Adenclassified, 1 February 2011). The Paris Court of Appeal (SAIF v. Google, 26 January 2011) noted that website operators can prevent indexing via their robots.txt file. No specific authorisation is therefore required for indexing.

Extraction for reuse: a stricter regime

Data collection for reuse is different. Articles L342-1 and L342-2 of the French Intellectual Property Code prohibit the substantial extraction and reuse of the contents of a protected database. If the extraction is qualitatively or quantitatively substantial, it constitutes infringement. Extracting and reusing data requires the prior authorisation of the producer, which typically involves a contract. If the extraction is not substantial and not repeated, it may be permissible, but the substantiality threshold is assessed on a case-by-case basis.

Website terms and personal data

A website’s terms of use may expressly prohibit the extraction or indexing of its database. Breach of these terms may be sanctioned. Where the collected data constitutes personal data (from social networks, for example), the GDPR requires consent or another lawful basis for processing. For further detail, see the article on GDPR contract compliance. For an overview, see the intellectual property services page.

Conclusion

Web crawling is lawful in principle, but regulated. Simple indexing does not constitute infringement, but substantial extraction from databases and unauthorised collection of personal data are unlawful. If you have questions about the legality of your data collection practices, book a call.

‍