Web scraping or web data extraction?

By Jacob Laurvigen on September 20th, 2016




If you are a developer you might have noticed a change in wording surrounding what would normally be described as web scraping. This is now called web research or as we call it, web extraction. So why don’t we just call it web scraping? Depending on geographic location the understanding of web scraping and if it’s a good or bad thing ranges from “Web Scraping is a natural tool for data research” to “this is a gray zone”.

 

What web scraping really is

What web scraping really is, is to gather data from the web and the purpose for this could be just as different as why people read books, newspapers or any other source of knowledge. But what you can’t do is of course copying a text, image etc. and publish/present/re-sell this as your own work. Just like a journalist reads other news papers you would need to write your own version or simply use this newly acquired knowledge to navigate. What it really is, is applying robotics to a manual job as you already would be doing when you start your computer and open a browser to look for information. What the dexi.io web scraping robots (or web extraction robots as we call it) do is to automate a string of events to save you or your organisation of e.g. writing all this information down manually or pressing F5 all day long to look for changes. So web scraping is helping you automate a manual and time-consuming jobs and yes and that IS a good thing!

 

 

Classic web scraping examples

Let us start with some classic examples of web scraping and move from there.Firstly you have the sales team. What a sales team wants is some very good leads and one of the best ways to find them is looking through the web. If they are cold calling from scratch you would find a directory, look up the business categories that are most relevant to your product and then start from one end. But instead of copy pasting their names and contact information into Salesforce they can use a web scraping robot to simulate the work, import the list into Salesforce and save hours of frustrating work. If they want to only deal with the large companies they can build a dexi.io pipe robot and merge that data with public listed companies, eg. with a robot that looks up their list of companies from the stock market using another web scraping robot. If they want to push this further they can build a crawler that collects the companies’ URLs and yet another crawler that goes into the companies websites and looks for specific contact information as eg. email adresses.

 

Web scraping and comparison sites!

Then it could be worth mentioning another classic use of web scraping, the price comparison sites. Their business is to harvest the web for pricing on anything from Insurances, creditcards, hotels to mobile phones and just about anything you can think about. You even have price comparison sites that use web scraping to scrape other price comparison sites.

 

Is the Fintech industry using web scraping?

Let us then move into the more sophisticated web scrapers, within the Fintech industry.The financial industry is one of the most evolved industries when it comes to data harvesting and using web scraping. Basically what a financial institution does is to minimise risk and maximise profit and the only way to do this is to know more than the others trying to do just the same. Let me give you an example. If you want to make a sports bet you first of all choose a sport that you have some basic knowledge about. Then you choose a league or a country where you think that if anything abnormal would happen, you would probably know about it. But if you want to minimise your risk you need to gather more data than your bookmaker and if you want to get good odds you also need to gather more information than anyone else betting on the same match as you. The same thing goes for the financial industry. The more data the better the odds will be. Thats why one thing is to subscribe to services with Bloomberg just to ensure that you have access to all the basic data, but to get in front of the competitors you have to have data that no-one or as few as possible have access to. So they use web scraping for just this and some has thousands of robots looking for chances and abnormalities that will give them the small advantage, make fewer mistakes and maximise their clients’ profits.

 

But web scraping is also very helpful for research in general.

On the other hand you have researchers such as Universities, Governments, NGO’s that use Data Scientists that are using data web scraping to do research, build self driving cars, develop AI, monitoring the state of the Earth, Monitoring national security and so much more.

How do you get started with web scraping?

Dexi.io has developed the most advanced data extraction tool for both web scraping, web crawling and data refining. This gives you the power to harvest, process and connect your data engine to any system. You could say that machines needs data in order to live and dexi.io harvest and processes this data so that it CAN be fed to the machines. We are the fuel of the future!

 

What if you now want to get started with web scraping?

Well first of all then use the term data research or external data processing. This will help you greatly to avoid misunderstandings when presenting the project internally in your organisation. Secondly you need to define the intelligence behind the data, your unique algorithm and get some assumptions down on paper. This could give you saved time, more accurate information and faster decision making. And remember that what you are about to do might be the first time it is done, so be prepared to fail, learn and evolve. We are just getting started and only one thing is for sure, if you don’t start now, you will be eating with your hands from a palm leaf while everyone else are picking their favourite wine.

 

Sign up and try it for free!

So as you can see the use of web scraping can be very simple and extremely sophisticated, but all it actually is, is enabling one person to do what would normally takes a thousand guys, years to do making it virtually impossible also taking into account that data changes. This is where historical data and algorithms comes into use.

Click here to get started!

FacebookTwitterGoogle+EmailLinkedIn