Recently, the world has been informed about another “leak”. According to the Cyber ​​News portal, the database of over 1.3 million records from Clubhouse was published by hackers. Is the so-called data scraping a fresh trend among cybercriminals who want to earn money by selling data?

It’s another cybersecurity topic in addition to the most recent listing of the database with over 500 million records from LinkedIn. Half a billion Facebook profiles aroused huge controversy even earlier. Now Clubhouse joins the group of “scrapped” portals.

What is Clubhouse

It’s one of the fastest-growing social media channels. Clubhouse is an audio-chat-based social networking app. Users can listen in to conversations, interviews, and discussions between interesting people on a variety of subjects – it’s similar to listening to a podcast, except, it’s live.
Clubhouse is only open to those who have been invited. You can’t just download it off the app store and create an account. You have to be invited to join by an existing member. Real-world elitism, but online. The application itself works only on iOS.

More and more often, instead of application innovation or the growing base of uesers, there are doubts related to security issues.

Database with Clubhouse users, for free

Hacker forums are a goldmine of interesting content. On one of them, a fresh user uploaded a link to a SQL database containing 1.3 million records, and it was spotted quite quickly.

The data that was to be shared are primarily the Names of Accounts, User IDs, Photo URLs, Twitter nicknames, Instagram nicknames, Number of followers, Number of following, Time created, and Invited by user profile.

We can immediately notice that we are not dealing with a leak of sensitive data. All of this information is freely available and, like LinkedIn, has simply been pulled directly from the API and aggregated. However, it is very interesting data that can be used by various people.

As the number of active accounts in the application is now estimated at approx. 10 million, it’s quite serious. The data can also include more sensitive information. Cyber ​​News, which published the reports, said that they could be used for phishing attacks and identity theft.

Clubhouse: This is not a leak

Clubhouse CEO Paul Davison said that the report prepared by Cyber ​​News was false, describing it as a “clickbait”. He also claims that the platform has not registered any data breach, and all information collected in the SQL database is available to everyone using the Clubhouse API.

API has its own rules, as has recently emerged from the Supreme Court judgment in the case of Google and Oracle. However, the possibility of any user scraping and aggregating such enormous amounts of data may raise some concerns. Usually, API is either publicly open or guarded by some kind of key authorization, f.e. OAuth. Maybe this particular API wasn’t also defended by rate-limiting which is a normal security pattern. APIs are also usually protected by some kind of EULA, which can forbid scraping.

What is Web scraping and how it works?

Web scraping refers to the automated collection of organized web data. Web data extraction is another term for it. Price tracking, news monitoring, lead generation, and market research are only a few examples. People and companies who want to take advantage of the huge amount of publicly available data use web data extraction.

Web scraping uses intelligent automation to retrieve hundreds, millions, or even billions of data points from the internet’s almost limitless expanse using intelligent automation.

What are the recent trends in web scraping?

  • Cloud-based scraping platforms
  • Applying Machine Learning to automatically parse and extract data displayed by web pages to table format with no user intervention, just by the click of a button

So, is data scraping legal?

Web scraping itself is not illegal. But there are some other restrictions, that hold businesses back from using web data scrapers. Let’s see them:

1. Websites expressively forbid web scraping within their website terms and conditions.

2. Copyright – it may lead to a claim for copyright infringement.

3. Database rights – These rights are infringed when as a whole, or a part of a database is extracted without the owner’s consent.

4. Trademarks Reproducing a website owner’s trademarks without their consent, could lead to a claim for trademark infringement or passing off.

5. Data protection – Scraping for information on individuals (in some cases considered as “personal data”), without their knowledge, could infringe data protection laws.

6. Criminal Damage – It’s an offense to cause criminal damage to a computer (including damage to data) or to use a computer to gain to access data without proper authorization.

Laws are country and locality specific, but legality is entirely dependent on the legal jurisdiction. Scraping or collecting publicly accessible information is not illegal; if it were, Google would not operate as a corporation because they scrape data from any website in the world.

How do you protect against web scraping?

Basically, hindering scraping means that you need to make it difficult for scripts and machines to get the data from your website, whilst not making it difficult for real users and search engines.
Regrettably, it’s hard, and you will need to make compromises between preventing scraping and degrading the accessibility for real users and search engines.

How to prevent scraping

You can use some general methods to detect and deter scrapers.

Monitor your logs and traffic patterns; limit access if you notice unusual activity:
Check your logs regularly, and in case of many similar actions from the same IP address, block or limit access.

Here are a few suggestions:

  • Detect unusual activity:

If you see unusual activity, such as many similar requests from a specific IP address, someone looking at an excessive number of pages, or performing an unusual number of searches, you can prevent access, or show a captcha for subsequent requests.

  • Rate limiting:

Only allow users (and scrapers) to perform a limited number of actions in a certain time – for example, only allow a few searches per second from any specific IP address or user. This will slow down scrapers, and make them ineffective. You could also show a captcha if actions are completed too fast or faster than a real user would.

  • Don’t just monitor or rate limit by IP address – use other indicators too:

If you block or rate limit, don’t just do it on a per-IP address basis; you can use other indicators and methods to identify specific users or scrapers. Some indicators which can help you identify specific users/scrapers contain:

  • How fast users fill out forms, and where on a button they click;

You can gather a lot of information with JavaScript, such as screen size/resolution, timezone, etc; you can use this to identify users.

  • HTTP headers and their order, especially User-Agent.

As an example, if you get many requests from a single IP address, all using the same User-agent, the screen size (determined with JavaScript), and the user always clicks on the button in the same manner and at regular intervals, it’s probably a screen scraper, and you should temporarily block similar requests. This way you won’t discomfort real users on that IP address, eg. in case of a shared internet connection.

You can also look at the ‘Referer’ header. Usually, web scrapers don’t follow the normal flow of web navigation. Sometimes the author of the web scraper doesn’t create this header and you can detect it on server logs.

As you can identify similar requests, even if they come from different IP addresses, indicative of distributed scraping (a scraper using a botnet or a network of proxies). If you get a lot of otherwise identical requests, but they come from different IP addresses, you can block them. Again, be aware of not unintentionally blocking real users.

This can be effective against screenscrapers which run JavaScript, as you can get a lot of information from them.

Like it or not – Scraping is practically legal

There is a thin line between legal and illegal aspects of web scraping. Publicly available APIs are a tasty morsel. Clubhouse is another organization that has been the target of large-scale data scraping, but the process is far from a “leak”.  But it’s still very much debatable.

If you need more information about data security – contact us.