Data mining the web: using Big Data analytics in Irish-owned global search engine

Tuesday 6 September 2016

eCommerce is the business of selling online – a rapidly growing market that was estimated to be worth $1.7 trillion (€1.5 trillion) in 2015. In the UK last year, Black Friday broke the £1 billion (€1.25 billion) threshold in sales. There are roughly one billion websites in the world, but only 173 million or so are active, and less than 1% of these are involved in eCommerce or web sales. The internet is the new marketplace, but there is a problem: how do businesses selling to online retailers find their sales leads? There is no directory listing all of the online shops out there. By using sites such as Google and Bing, businesses can find the top 500+ online retailers, but are not able to find critical information about these sites e.g. technology the website is using, merchant size or contact data. Companies lose thousands of hours by manually crawling the internet and entering information about the online retailer into their CRM system. They also use LinkedIn to generate sales leads, but online retailers do not generally reside here. They live in the world of Instagram, Facebook, Twitter and YouTube. There are no search engines or data-mining tools available for these platforms. This is the opportunity that SalesOptimize has identified and tackled head-on. We have created a business-to-business (B2B) search engine, a sales lead generation SaaS platform. We comb through the web looking for online eCommerce sites that would be of interest to our customer base. We do this using web-crawlers to download content and then analysing that content to understand whether the site is potentially of interest to our clients. In doing this, we need to harvest and search through vast amounts of information - terabytes of data must be downloaded and analysed on a weekly basis. We then need to augment the sites we have found with additional data, such as social media information, web traffic volumes, contacts for the site, and various types of website ranking metrics. We also need to perform company sizing calculations, data confidence metrics, and analyse websites to assess whether they are related to other sites, shared or owned by the same corporate entities. By doing all of this, it means that the online leads we generate are well prequalified, and therefore very little work is required from our customers. They just need to contact the company to identify whether it is interested in the products or service they provide.

Web crawlers – don’t kill your information source

A web crawler, much like creepy crawlers in the real word, tend to have a bad name. A badly developed crawler can wreak havoc on a website, with some so bad that they can cause the site to crash. So being mindful of what you do in the real world is very important. Web crawlers need to be really careful not to overload the site they are interested in analysing. They will get kicked out pretty quickly by any good firewall/traffic monitoring system if they do; but a lot of smaller sites do not have this facility, and at worst web crawlers could cause the site they are analysing to crash because of overloading. Here at SalesOptimize, we actively minimise our footprint on a site; at most we try to download only a single HTML page from a site around every 8-10 seconds. In reality, because of the volume we analyse, it can actually be minutes/hours between each page we take down. We spread the load across tens of thousands of websites at a time and we like to treat each site we contact as if it were one of our own. When it comes to scaling, having software capable of scaling across multiple servers and processing nodes is essential when it comes to accumulating and managing massive volumes of data. There are many ways to manage scale, but the most effective way of scaling that I have come across is to:

Break all of your services down into smaller segmented ones that do not have any inter-dependencies with other services;
Tie those services in to a distributed messaging system. Each service should only deal with a single specific type of message, and should consume that message from a distributed message queue;
When a message has completed execution, it can then chain further actions by persisting other message types to be consumed by other services.

We happen to persist the messages in a database before publication, and we always publish to the distributed messaging system from that data source. So for us, chaining further actions is really just about calling a message factory which will persist a set of messages in a database. We track the message from the database to the queue, and we store various nerdy execution metrics per message, e.g. how long it was hanging around before processing, how long took to execute, and whether any errors were encountered during executions. This helps us to understand the workloads we have and where we need additional services spun up.

Execution environment – size does matter

When dealing with scale, data pipe speed levels to the on internal networks and to the internet will make a massive difference. Currently, we run all of our services through Azure. Luckily, I also have experience with Amazon environments and both are equally good in my opinion. Regardless of which environment you choose, having massive download/uploads speeds is essential to good scalability and performance. Particularly when dealing with millions of distributed messages flying around the place along with data downloads and uploads taking place at the same time. Reducing your IO wait times on reads and writes to a minimum is essential in the modern environments in which we work. So my advice would be to make sure you’re happy with the performance of the internal and external environments you choose to run in. If you can manage your IO operations efficiently, it will help you to scale with the best of them. I have found that there is no one database solution to suit everything. Using Relational, NOSQL, Column Store, Graph and Name value data-stores all make sense in their own use cases. When dealing with scale, each has their place and understanding their usage is essential. In SalesOptimize, we currently use MongoDB, Microsoft SQL Server, and elastic search. We also use various ETL formats for shifting data between different environments. We constantly evaluate the types of data we need to store, analyse and report on. We consider production versus back-office analytical services, what is best for front of house (our customer facing UIs and services) and what works best in the dark rooms where the developers and analysts work. Because of this fact, there is no magic answer for which data store to use. Understand the options you have available to you: how do you choose one over another? Will it scale? Can it be clustered, shared and portioned? What is its data retrieval speed like, compared to its analytical speed? Are there reporting and analytical tools available to you to so you don’t need to develop your own? Answering these questions will help you to choose what’s best for the different environment your software needs to work in.

Confidence in data gathering

Because you never quite know what you are going to find when analysing a website, you need to build in a confidence based approach to your data gathering mechanisms. For example, we look for various shipping companies used by a website. We might search for DHL or UPS, but a match for ‘UPS’ may also be found in the phrase ‘ups and downs’. So how do we know if UPS is actually used as a delivery mechanism? You need to gather additional confidence in your data and place it in the context of other related data. In the example above, capitalisation and related context may matter; the mention of shipping/delivery or even other carries close to the data you found may (or may not) indicate whether UPS is actually used. Building up additional confidence metrics with every single data point you are interested in analysing is essential to providing relevant data to our customers, and probably to yours too. We run everything in the cloud, but we initially tried to use cloud hosting providers for our data and analytical stores. This was fine at the start as we were building up data, but we found ourselves landed with a monthly bill of almost €10,000 for one service. This was as a result of the volume of data we had built up. Needless to say, we had to change tack quite quickly. We really would have liked to hand over the management of large clusters of data to third parties; it saves us having to worry about the daily running and management of these, and frees us up to do what we need to do. Unfortunately, we had to take control of this cost. Otherwise it would become more and more expensive over time. This meant that we now needed to get up to speed very quickly; understanding clustering, partitioning, sharding, striping large volumes of data, and making those datasets resilient and scalable. As your business grows, you may be faced with the same challenges. If this is the case, you must be willing to devote time and resources to tackling this. Having a plan upfront (which we didn’t, by the way) will save you time, worry (aka hair) and costs in the long run. As our company grows, we will ultimately have a team dedicated to looking after these operations. For now, however, we as a developer group are looking after these issues.

SalesOptimize - in summary

When dealing with scale and big data, you need to:

Choose a software architecture that will allow you to scale. This is probably the most important decision you can make upfront, regardless of the execution environment you choose to run in;
Choose an execution environment that allows you to scale up. If you haven’t developed your software with scale in mind, these environments will only get you partially down the road;
Evaluate and choose the data sources that best suit your needs. Don’t be alarmed if you need many of them, and different ones for different personas in your company;
The Internet is a messy thing to deal with. If you are using it as a data source, you need to build confidence measuring metrics in to the data you harvest;
At some stage of scale, you will need to take ownership of your own technical environments. Knowing that, and planning for that eventuality will ease the transition.

[caption id="attachment_31665" align="alignright" width="300"]

Noel Lysaght, SalesOptimize[/caption] Noel Lysaght is chief technology officer with SalesOptimize. He has been writing, testing and delivering software solutions to customers for nearly 30 years. Lysaght is passionate about seeing the software he has worked on out in the wild, particularly when customers can use it in ways he had never imagined. He enjoys the technical challenges involved in delivering large scale, customer critical software and its dependent infrastructures.