Web Scraping

What is it?

Scraping is the method pulling data down from a website, your browser naturally does this and presents the data on the screen for you to view in the way that the web designer had intended it to. For example on this very page the pictures, backgrounds and text are available for you to copy or save to a folder, if you so wished.

Using Python you can code and create a scraper that will pull down automatically all the information / data. Coding a scraper adds extra Python functionality to say for example, check the data for keywords or phrases or manipulate and respond to the downloaded content.

To understand scaring better let's look at Microsoft Excel. It has a simple scraper facility which can be used to pull down data from a website and return it to a list of cells within the spreadsheet. Then new calculations can be applied or any of the other Excel functions to the downloaded data. See the video on the left.

Excel Scraper

Scraping Pictures from a Webpage

Scraping data and text is alright but downloading pictures and images is better fun and more interesting. I wanted to create a simple program that would go to a website and move through each page downloading the images found on each of the pages. A simple code was available that makes use of the request library to basically scrape all the images form a webpage and store them in a folder. You will need to ensure that you have created the folder first and referenced it correctly in the code, for example if the folder is called pics2, the line in the code must read, f = open('pics2/%s' % url.split('/')[-1], 'w') check.  When the program is run it basically goes to the website and searches for the img//@scr following the link and then downloading the image it finds at the end of the link.

Scraping Code
File Size:	0 kb
File Type:	py

Download File

Spidering: Finding the Links within Links

Once you have entered a website address it would be useful to get all the links from the entire website downloaded and accessible in one easy method. This is called spidering, where the code iterates over the main website address and returns all the links for pages within the site. This makes Scraping more efficient, as you can scrape data from all the pages by only entering the websites home address. It also returns any links that may hidden from the public view. To set this up, Python requires a module called Mechanize: sudo easy install mechanzie

This program only uses a basic function of the module, further support ideas and code can be found here at this link: Mechanize Cheatsheet. I adapted the original code to add the found URLs to a list. Then they are combined with the original Home Page web address to create a list of every page link found on a website.

Follow Links
File Size:	0 kb
File Type:	py

Download File

Add URLs to list
File Size:	0 kb
File Type:	py

Download File

Thanks to @mattherveen and @biglesp who gave me the heads up and code advice

The Picture Bot

The final part of the program is to combine the spider code with the scraper code to create a program that allows you to enter a single websites address and then the code goes through every page in the website and downloads the pictures from that page before following the next link and repeating the same process. This is process is legal although there have been some court cases where data has been collected from competitors websites and prosecutions are pending. Ensure that you own the website you are spidering or you have permission first. (TeCoEd takes no responsibility for the use of this code which is for educational purposes only)

Firstly you will need to ensure that your Raspberry Pi is up to date,

In the LX Terminal type:
sudo apt-get upgrade

There are a number of installs for the various Scraping programs so best to download them all now in one sitting:

If you are using PIP:
sudo pip install readability-lxml
sudo pip install requests
sudo apt-get install python lxml

Picture-Bot
File Size:	1 kb
File Type:	py

Download File

COMING SOON
Scraping for text and content, Scraping for data xml