Scraping with Python, Scrapy, Beautiful Soup or Selenium ?

Scrap is the Technic to pic public information from other web site and save it to analyse or only for you to manage it as different topics, let's see a few of legal examples:

  • Read different product prices from different web sites, this save time to you if you do the same manually.
  • Convert html tables to excel files, 
  • Check information on different social media media account that you own, 
  • Download pictures from different sites, 

All this task can be annoy if you do it manually in a huge web sites list, if you have a few knowledge about programming language, there are some interesting alternatives than software like web-scraper

lets see a comparison table of scrap libraries in python and I'll how some examples of each one, as soon i code it :-)



ScrapyBeautiful SoupSelenium

Pros:
Robust
Portable
Efficient
Pros:
Easy to learn
Friendly Interface
extensions
Pros:
JavaScript friendly
perfect for test automation
you can run it on dialog mode

Cons:
coding knowledge
Cons:
inconsistent

Big dependencies
Cons:
not really a full web scraper beside it does similar thinks

Lets see some examples of this:

Scrapy: Extract text from div's and span elements.

import scrapy
class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]
    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),

            }


Beautiful Soupbla bla
<code comming soon>
 
Seleniumbla bla
<code comming soon>


Code Sources:

 https://docs.scrapy.org/en/latest/intro/tutorial.html




Proxies what are and how to use them

What is a server proxy:

A server proxy is a computer system that manage internet connection and handle the traffic between 2 or more points, from a client computer to another server, the proxy is  used to move the traffic between a connection, it offer security, performance and more privacy.

What are the most common cons of using a proxy :

  • Privacy, If all users identify themselves as one, it is difficult for the accessed resource to differentiate them. But this can be bad, for example when identification must necessarily be made.
  • Abuse, Being willing to receive requests from many users and respond to them, you may do some work that you do not touch. You therefore need to control who has access and who does not to your services, which is usually very difficult.
  • Irregularity: The fact that the proxy represents more than one user gives problems in many scenarios, in particular those that presuppose direct communication between 1 sender and 1 receiver (such as TCP/IP).
How to setup a proxy configuration in a single system for all users:
You will need to setup a global environment variable that will affect to all users, for you have to edit the file etc/profile with root user and add the needed entries with the proxy information that you would like to use. If you would like to use a proxy for ftp and http connection, add the following entries:
export https_proxy=http://server-proxy:8080/
export ftp_proxy=http://210.113.232.28:8083/

Follow the next steps to setup a proxy trough the terminal in a linux system, if you have any scrap or craw program in your system, you will need it, you can use command http_proxy

export http_proxy=http://server-name:port//
maybe the proxy server requires authentication, if then, do this:

export http_proxy=http://user:password@server-name:port/

If you would like to use https for more security, use https_proxy like this:

export https_proxy=https://user:password@server-name:port/

To see your current configuration use the following commands:

echo $http_proxy
echo $https_proxy

If you need to delete the configuration, use the unset command.

unset http_proxy
unset https_proxy

Here you have a list of proxy server providers, remember that the proxies server will insert your IP address into the request header or they can sniff your traffic, it's important to care that if you use sensitive information, you have to trust on your proxy supplier.

Remember, if you have a scrapper or crawler running on your computer and/or server, it's recommended to use a good proxy  to avoid any ban from server





Web Scraper - How to use the Chrome Extension

Web Scraper is a commercial software ti pick public information from web sites, there are different options from the software that require an additional payment, but they have a very useful solution  for free in an Internet Browser extension, for Google Chrome and for Firefox, here is my test for Google Home.

You can download the software from the following link:  https://webscraper.io/

In this Step by Step guide you will see the few steps needed to install it and use to scrap text and link from amazon, for example to create promoted links with an amazon affiliate code.

The Steps are:

  1. - Installation:
  2. - First Steps:
  3. - Scrap:
  4. - Export Data:


1.- Installation:

Easy like clicking in a button, you go to the download section of web scraper site, look for prices chard and click on free extension, there you will see a install button, after licking it, it will be added in your Google Chrome browser.


Extension details before to install:

Click on extension in the tool bar of google chrome to see extension properties, you will see a list of all your extension in your internet browser, just look for web scraper and pin it to ensure that you can see it in the menu but as you will see in the next steps, it doesn't matter if you don't see it in the top bar extension button because we will open it from the developer tools in the bottom bar section of your Browser.

Detail of the extension pinned, I recommend to pin it, you can use more options.
Your Web Scrapper is installed and running, you don't need to restart the Browser neither the computer.

2.- First steps 

Open the "Inspect windows" with CTRL+SHIFT+I, you have to look for a new tab called "Web Scrapper", there is a easy configuration that you have to follow before you start scrapping a web site:

First click on the Sub-menu "Create new sitemap" and "Create"

you will see the next windows where we have to enter a url of a target web site, just add a sitemap name and the web site url, see the yellow mark in the next screenshot, I will explain it to you later.

Selecting the target, in this test we choose amazon site to download name and links to different amazon products, we make a product search until we select the product desired, ensure that you have site pagination to move forward all the products.

Then, we have to copy the target url into our sitemap, take care about the last part of the url, in this part we have to select the pagination number and replace it by a dynamic variable, remember the yellow mark that I explain to you before.

In this example we will use the dynamic variable (1-10), remember that you have to check the limit before the scrape. The Web Scraper will pass step by Step from 1 to 10, if you select 20000, it will try to do also.

Now we have to create a Selector, a selector is an object that  use web scraper to identify an element of the web site, for example you can create a selector for "Text Title", "Product Link" and "Picture", in this example we are picking  only the name and the link, for that we will create 2 selectors.

Once you create the selector and after you flag the option "multiple", you will see in amazon page that all this elements will be remarked, web scraper will search the site for all element of the same type, in this example you can see that after the second lick of the element, all other similar elements are also selected, in amazon you will see that promoted product don't have the same property than normal product, if you would like to pick them all, you will need to create an specific web scraped selector also for promoted products.

we can create many selectors, there is no limit.
In this example, we have created 2 elements, one for text and other for links, you will see this option in each element, once the selector is create you will see the selectors option menu, there you can preview selector chooses, the data preview ( list of data that will be scraped ), edit again or delete it.
Before you scrape you can see a data preview to ensure that you are scraping all the correct information, this is a recommendation is not a mandatory step.
Now it's time to scrape, just got to the menu option "Sitemap test" and select "Scrape"

After you click in the scrape option, the system will ask you for a default wait time, this is to avoid server checks, if the server have anything to avoid scrape, you can raise this number up to avoid it and look like you are a human pushin the mouse button like if there as not sun.

At the beginning, you will see a popup, take care to avoid closing the popup before the scrape ends. once it ends you will see the next report.



it's done, now we can export the data in different formats, I prefer CSV because you can open it in excel easily.

4.- Exporting the data:

in the same sitemap windows, just go to the menu and click on "Export data as CSV", I like it more than sitemap, but it depend of your objective

There are two steps to export the data, if you don't see the popup you ave to click in the "Download now" link.

Your file it's ready in the internet browser download folder.

This is the data exported, now in excel there are no limits to edit this information, you can for example add your amazon affiliate promo code in each link and then upload them in other site.

And now ? what we can do with that information ? Imagination don't have limitations... 

Diferent types of cloaking

Remember that cloaking is to hide change dynamically the web content for human user and robots, so you can show an HTML content to a bot and images or flash content to humans, we talk about in this post.

There are different types of cloaking

  • User.agent cloaking

    web sites that use this technique, identify the visit by the user agent field in the HTTP request, for example, you can identify a human visitor is you can see user browser information, bot's don't use internet browsers :-), with that easy php code you can identify the information of the user agent
    <?php
    echo $_SERVER['HTTP_USER_AGENT'];
    ?>
  • To identify a bot, you must use a code like this:

      if ( strstr( strtolower( $_SERVER['HTTP_USER_AGENT'] ), 'googlebot') )
          {
          // code to execute for bot's
          }

    Examples to use, remember that google checks will penalty you if they detect it.
    - Change image content to HTML content.
    - Redirect bots or humans to a different  pages.
    - etc...

  • IP based cloaking

    This different technique of cloaking is quick similar to the user agent but rather than check is the visitor is human or search engine bot, you can read the IP of your visit and change the behavior of your site, to detect the IP of your visitor in php you can use the next script:

    <?php 
    $ip = $_SERVER['REMOTE_ADDR'];
    ?>
    or
    <?php 
    $ip =  or $_SERVER['REMOTE_HOST']
    ?>

        Examples to use, this is not so bad as user-agent.
        - This is more common for geolocate sites, you can redirect your visitor to the correct language if   
           you can geo locate his IP.
        - You can apply IP filtering for develop testers, while you are developing a web site, maybe you    
           would like to avoid some controls in the ip that you are using.
        - You can identify the IP of your visit, search in your user database if is a returning visitor and you
           can show additional information based of the last visit information...
  • HTTP language header cloaking,

You can get this information using JavaScript or PHP, it is a property of the user loaded document, but not of its parent.

JavaScript example:

<script type='text/javascript'>
document.write(document.referrer);
</script>

PHP example:

<?php echo $_SERVER["HTTP_REFERER"]; ?>


        Other code  Example:

        <?php
            $langu = substr($_SERVER['HTTP_ACCEPT_LANGUAGE'], 0, 2);
            $acceptLang = ['de', 'es', 'en']; 
            $langu = in_array($langu, $acceptLang) ? $langu : 'jp';
            require_once "index{$langu }.php"; 
        ?>

        This header is a trick to use when while using the server has any option to determine the language         trough another way, like using specific URL and/or IP, that is chosen by an explicit decision of                 the user.

              • JavaScript cloaking

                It's not very different from that last examples explained before, remember that JavaScript is used as a client side, so after the server send all the http request and display the html to the user, you can apply there any JavaScript routines to  manipulate the content based on how your human visitor do in the website.

                For example you can redirect the user reading the client information using javascript, there you have more options but you are more vulnerable if the user have javascript block or hacks.
              <A HREF="YourAffiliateSiteUrl" onMouseOver="func_changeurl('v_code'); return true" onMouseOut="window.status=' '; return true">Link Title</a>
                      Whit that code, you can change your affiliates links once your human visitor click on it, you can do         the same using php and it's better and secure.