Dark Reading is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Operations

5/21/2020
04:55 PM
Connect Directly
Twitter
LinkedIn
RSS
E-Mail
50%
50%

Web Scrapers Have Bigger-Than-Perceived Impact on Digital Businesses

The economic impact of bot traffic can be unexpectedly substantial, a PerimeterX-commissioned study finds.

Automated bots that collect content, product descriptions, pricing, inventory data, and other public-facing information from websites have a greater economic and performance impact than many organizations might realize, a new study suggests.

Bot mitigation company PerimeterX recently commissioned market intelligence firm Aberdeen Group to look into how web-scraping bots might be affecting the revenues of digital businesses.

The study found bots account for between 40% and 60% of total website traffic in certain industries and can impact businesses in multiple ways, including overloading their infrastructure, skewing analytics data, and diminishing the value of their IP, marketing, and SEO investments. The impact to revenues from such factors is considerable, according to PerimeterX.

"Web scraping hurts your revenue in more ways than you know," says Deepak Patel, security evangelist at PerimeterX. For the e-commerce sector, website scraping can dilute overall annual website profitability by as much as 80%, the study shows.

"For the media sector, the median annual business impact of website scraping is as much as 27% of overall website profitability," Patel adds.

Many organizations don't view web-scraping bots as a security threat because they don't breach the network or exploit a security flaw. However, they do pose a big threat to business logic or proprietary content essential for maintaining a competitive edge.

"Malicious web-scraping bots can steal your exclusive, copyrighted content and images," says Patel, adding that it can also damage a site's SEO rankings when search engines detect pages with duplicate content.

Organizations routinely use web scrapers to look up information on their competition, to build services based off of third-party data, or for a variety of other reasons. The bots scour websites — in much the same way search engine crawlers do — and collect any information the operator might have publicly posted and would be useful to the organization using the bots.

Though there are some questions over the legality of the practice, numerous products and services are available that allow organizations to scrape another firm's website for information that is available publicly. In a lawsuit involving talent management advisory firm hiQ Labs and LinkedIn, the Ninth Circuit Court of Appeals last year held that the scraping of publicly available data does not violate US computer fraud laws. LinkedIn had wanted hiQ to stop scraping publicly available data from its site, which the latter was using to create analytics tools to help companies deal with employee retention issues.

"As a technical matter, web scraping is simply machine-automated web browsing and accesses and records the same information, which a human visitor to the site might do manually," the Electronic Frontier Foundation had noted in welcoming the appellate court's decision.

Bad Bots
The study shows that while humans and "good bots" — such as those used by search engines— represented a substantial proportion of web traffic, "bad bots" represented a significant proportion as well. Nearly 17% of all traffic on e-commerce websites, for example, was comprised of bad bots. On travel sites, the proportion was closer to 31% and on media sites around 9.5%.

Patel says bad bots are bots that crawl websites to perform abusive or malicious actions, including account takeover and content plagiarism. Such bots often mimic human behavior and use multiple IPs to evade detection.

They also can scrape content that other sites might have invested in substantially to develop — like SEO-optimized product descriptions or marketing content, for instance. For companies that are doing the scraping, such content can help reduce or even eliminate the need to develop their own content. Conversely, for digital businesses that are the targets, web scraping can potentially erode the value of their investments, the study found. Similarly, information that companies need to put on their sites — like pricing information or product availability — could help rivals gain valuable insight for making their own decisions.

Bot traffic can also overload web infrastructure by sending millions of requests to a specific path, such as login or checkout pages, causing a slowdown for users, Patel says. According to him, 80% of account logins originate from bad bots.

"Scraping bots can significantly impact website performance since they have to collect a lot of data quickly," Patel says. On retail sites, for example, the traffic from bots trying to keep pace with new product listings or pricing changes can degrade performance.

Many tools are commercially available that are designed to help digital businesses deal with web scrapers.

"But today's bots, unlike more crude, basic bots of the past, are becoming more adept at mimicking actual users and disguising their true purpose," Patel says. "Hyper-distributed scraping attacks, achieved by using many different user agents, IPs, and [autonomous system numbers] are even more dangerous, resulting in higher volume and higher difficulty of detection."

Related Content:

 

 
 
 
 
Learn from industry experts in a setting that is conducive to interaction and conversation about how to prepare for that "really  bad day" in cybersecurity. Click for more information and to register
 
Jai Vijayan is a seasoned technology reporter with over 20 years of experience in IT trade journalism. He was most recently a Senior Editor at Computerworld, where he covered information security and data privacy issues for the publication. Over the course of his 20-year ... View Full Bio
 

Recommended Reading:

Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
COVID-19: Latest Security News & Commentary
Dark Reading Staff 8/10/2020
Researcher Finds New Office Macro Attacks for MacOS
Curtis Franklin Jr., Senior Editor at Dark Reading,  8/7/2020
Healthcare Industry Sees Respite From Attacks in First Half of 2020
Robert Lemos, Contributing Writer,  8/13/2020
Register for Dark Reading Newsletters
White Papers
Video
Cartoon Contest
Write a Caption, Win an Amazon Gift Card! Click Here
Latest Comment: It's a technique known as breaking out of the sandbox kids.
Current Issue
Special Report: Computing's New Normal, a Dark Reading Perspective
This special report examines how IT security organizations have adapted to the "new normal" of computing and what the long-term effects will be. Read it and get a unique set of perspectives on issues ranging from new threats & vulnerabilities as a result of remote working to how enterprise security strategy will be affected long term.
Flash Poll
The Changing Face of Threat Intelligence
The Changing Face of Threat Intelligence
This special report takes a look at how enterprises are using threat intelligence, as well as emerging best practices for integrating threat intel into security operations and incident response. Download it today!
Twitter Feed
Dark Reading - Bug Report
Bug Report
Enterprise Vulnerabilities
From DHS/US-CERT's National Vulnerability Database
CVE-2019-20383
PUBLISHED: 2020-08-13
ABBYY network license server in ABBYY FineReader 15 before Release 4 (aka 15.0.112.2130) allows escalation of privileges by local users via manipulations involving files and using symbolic links.
CVE-2020-24348
PUBLISHED: 2020-08-13
njs through 0.4.3, used in NGINX, has an out-of-bounds read in njs_json_stringify_iterator in njs_json.c.
CVE-2020-24349
PUBLISHED: 2020-08-13
njs through 0.4.3, used in NGINX, allows control-flow hijack in njs_value_property in njs_value.c. NOTE: the vendor considers the issue to be "fluff" in the NGINX use case because there is no remote attack surface.
CVE-2020-7360
PUBLISHED: 2020-08-13
An Uncontrolled Search Path Element (CWE-427) vulnerability in SmartControl version 4.3.15 and versions released before April 15, 2020 may allow an authenticated user to escalate privileges by placing a specially crafted DLL file in the search path. This issue was fixed in version 1.0.7, which was r...
CVE-2020-24342
PUBLISHED: 2020-08-13
Lua through 5.4.0 allows a stack redzone cross in luaO_pushvfstring because a protection mechanism wrongly calls luaD_callnoyield twice in a row.