TeCoEd (Teaching Computing Education)
  • Home
    • Freelance
    • Book
    • Downloading
  • Python
    • Learn Python >
      • Python Modules
    • PyGame Zero
    • Python Programs >
      • Higher or Lower
      • Magic Calculator
      • Password Checker
      • Python Pit
    • What's News App
    • Pixels to Cells
    • Python Mosaics
    • Python OCR
    • L-1-AM
    • Web Scraping >
      • Scraping Trains
    • Weather App
    • Snakes and Windows
    • Python Web Server >
      • Flask
    • Python Picks
  • Raspberry Pi
    • All About the Pi
    • Getting Started
    • Remote Desktop and VNC
    • Static IP Address
    • Sonic Pi >
      • 3.14
    • Twitter Feed >
      • Tweepy
    • Android & Pi >
      • Advanced Apps
      • Odds
    • A.I on the the Pi
    • CRON
    • Pick Your Own
  • Pi Hardware
    • Pi HATS >
      • Sense Hat Hacks
      • AstroPi HAT
      • Unicorn-HAT >
        • Unicorn Alphabet Disco
        • Uni Codes / Programs
      • Skywriter
      • Piano HAT
    • STS Pi
    • Pi Camera >
      • Pi-Cam, Python & Email >
        • Time Lapse
      • Pi Noir
    • Pipsta >
      • Flask, Input & Printers
    • Raspberry Pi Power >
      • Energenie IR power
    • Pibrella
    • Distance Sensor
    • LCD Screen
    • Pi-Tooth
    • Robot Arm
    • PiGlow
    • PiFM
    • Accelerometer
    • PiFace >
      • Installing PiFace >
        • Python Commands
  • Pi-Hacks
    • Drone Hacks
    • Pi Glue Gun Hack
    • Blinkt!
    • Sonic Pixels
    • R2D2
    • Get to the chopper
    • Astro Bird
    • Twitter Translator
    • Hacking a Robot
    • Nature_Box >
      • Best Nature Photos
    • Wearable Tech >
      • Project New York
      • P.N.Y Part 2 Health
      • P.N.Y Part 3 Games
      • P.N.Y Part 4 Translation
    • Dino-Tweet
    • Other Links
  • Pi-Hacks 2
    • The Joker
    • Hologram Machine
    • Google Vision: Camera Tell
    • Yoda Tweets
    • Pi Phone
    • Darth Beats
    • Twitter Keyword Finder
    • Crimbo Lights Hack
    • Xmas Elf
    • Halloween 2016
    • Halloween Hack 2015
    • Socrative Zombie
    • Voice Translation
    • The Blue-Who Finder
    • GPIO, Twitter
    • Pi Chat Bot >
      • Dictionary Definitions
    • PiGlow & Email
    • Pibrella Alarm System
    • SMS with Python >
      • Spooking a Mobile
  • Pi-Hacks 3
    • David Bowie
    • Lamp Prank >
      • TEST
    • Yoda FM
    • Retro Player
    • LED Pixel Art
    • TARDIS
    • Battleships
    • LED Board
    • Night Vision
    • Enviro+ Weather
  • Minecraft
    • Minecraft API
    • Minecraft Sweeper
    • PiGlove: Minecraft Power Up
    • Minecraft Photo-booth
    • Rendering Pixels
    • Speed Cube
    • Lucky Dip
  • Computing
    • Why Computing?
    • Can You Compute
    • micro:bit
    • Coding Resources
    • Learn to Code >
      • Coding with iPads
      • Apps Creation Tools
      • sKratchInn
      • Sound Editing
    • Cheat Sheets
    • Theory
    • HOUR OF CODING
    • BEBRAS Computing Challange
    • Computer Facts
    • Free Software and Links
  • Contact Me

What is it?


Scraping is the method pulling data down from a website, your browser naturally does this and presents the data on the screen for you to view in the way that the web designer had intended it to.  For example on this very page the pictures, backgrounds and text are available for you to copy or save to a folder, if you so wished. 

Using Python you can code and create a scraper that will pull down automatically all the information / data.  Coding a scraper adds extra Python functionality to say for example, check the data for keywords or phrases or manipulate and respond to the downloaded content.  

To understand scaring better let's look at Microsoft Excel.  It has a simple scraper facility which can be used to pull down data from a website and return it to a list of cells within the spreadsheet. Then new calculations can be applied or any of the other Excel functions to the downloaded data.  See the video on the left.

Excel Scraper



Scraping Pictures from a Webpage


Scraping data and text is alright but downloading pictures and images is better fun and more interesting.   I wanted to create a simple program that would go to a website and move through each page downloading the images found on each of the pages.  A simple code was available that makes use of the request library to basically scrape all the images form a webpage and store them in a folder.  You will need to ensure that you have created the folder first and referenced it correctly in the code, for example if the folder is called pics2, the line in the code must read, f = open('pics2/%s' % url.split('/')[-1], 'w') check.  When the program is run it basically goes to the website and searches for the img//@scr following the link and then downloading the image it finds at the end of the link.
Scraping Code
File Size: 0 kb
File Type: py
Download File


Spidering: Finding the Links within Links


Once you have entered a website address it would be useful to get all  the links from the entire website downloaded and accessible in one easy method.  This is called spidering, where the code iterates over the main website address and returns all the links for pages within the site. This makes Scraping more efficient, as you can scrape data from all the pages by only entering the websites home address.  It also returns any links that may hidden from the public view.  To set this up, Python requires a module called Mechanize: sudo easy install mechanzie

This program only uses a basic function of the module, further support ideas and code can be found here at this link: Mechanize Cheatsheet.  I adapted the original code to add the found URLs to a list.   Then they are combined with the original Home Page web address to create a list of every page link found on a website.
Follow Links
File Size: 0 kb
File Type: py
Download File

Add URLs to list
File Size: 0 kb
File Type: py
Download File


Thanks to @mattherveen and @biglesp who gave me the heads up and code advice

The Picture Bot


The final part of the program is to combine the spider code with the scraper code to create a program that allows you to enter a single websites address and then the code goes through every page in the website and downloads the pictures from that page before following the next link and repeating the same process.  This is process is legal although there have been some court cases where data has been collected from competitors websites and prosecutions are pending.  Ensure that you own the website you are spidering or you have permission first. (TeCoEd takes no responsibility for the use of this code  which is for educational purposes only)

Firstly you will need to ensure that your Raspberry Pi is up to date, 

In the LX Terminal type:
sudo apt-get upgrade

There are a number of installs for the various Scraping programs so best to download them all now in one sitting:

If you are using PIP:
sudo pip install readability-lxml
sudo pip install requests
sudo apt-get install python lxml 
Picture-Bot
File Size: 1 kb
File Type: py
Download File


COMING SOON
Scraping for text and content, Scraping for data xml 
Copyright 2020 TeCoEd @dan_aldred