Web Scraping
Screen scraping is a technique in which a computer program extracts data from the display output of another program.
What? Why?
There could be several reasons you would want to scrape an application or website ... mine was to collect quotes.
Why you might ask? Well quotes are "open" are available for the public ... so why would i pay for a
list of quotes? they are available on the internet ... except not in list form ...
Ingredients
- Website (The source to scrape)
- Scripting or programming language (I used VB.net here)
- Regular Expressions
Let's start
In this example I'll show you how to scrape a website like RandomQuotes.org. If you surf to the
website, you'll see a random quote and 2 buttons ... it's rather a simple clean basic site.
Which is good for us, because we can easily find the quote in the source of the page ...
Which is good for us, because we can easily find the quote in the source of the page ...
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<"http://www.w3.org/TR/html4/loose.dtd">
...
<"http://www.w3.org/TR/html4/loose.dtd">
...
Scraping ...
So you can find the quote in the source ... now it's up to the computer to find the quote. But ... computers
don't think like humans, they don't spot the quote like our brains do ...
So lets analyse the source code. We'll see that the quote is located between the following html-tags
So lets analyse the source code. We'll see that the quote is located between the following html-tags
<td align="center"></td>
The easiest way to extract this quote is to make use of Regular Expressions.
\<td\salign\=\""center\""\>(?<Quote>.*?)\<\/td\>
VB.net Sample
Imports System.Text
Imports System.Net
Imports System.IO
Imports System.Text.RegularExpressions
Module ModMain
Private moRegularExpression As New Regex("\<td\salign\=\""center\""\>(?<Quote>.*?)\<", RegexOptions.IgnoreCase Or RegexOptions.Singleline Or RegexOptions.IgnorePatternWhitespace)
Sub Main()
For i As Integer = 0 To 1024
Dim sQuote As String = Quote("http://www.randomquotes.org/")
Console.WriteLine(i & "." & vbTab & sQuote)
System.Threading.Thread.CurrentThread.Sleep(200)
Next
End Sub
Public Function Quote(ByVal sURL As String) As String
Dim oClient As New WebClient
Dim oStreamReader As New StreamReader(oClient.OpenRead(sURL))
Dim sHTML As String = oStreamReader.ReadToEnd
If moRegularExpression.IsMatch(sHTML) Then
Return moRegularExpression.Matches(sHTML).Item(1).Groups("Quote").ToString
End If
End Function
End Module
Imports System.Net
Imports System.IO
Imports System.Text.RegularExpressions
Module ModMain
Private moRegularExpression As New Regex("\<td\salign\=\""center\""\>(?<Quote>.*?)\<", RegexOptions.IgnoreCase Or RegexOptions.Singleline Or RegexOptions.IgnorePatternWhitespace)
Sub Main()
For i As Integer = 0 To 1024
Dim sQuote As String = Quote("http://www.randomquotes.org/")
Console.WriteLine(i & "." & vbTab & sQuote)
System.Threading.Thread.CurrentThread.Sleep(200)
Next
End Sub
Public Function Quote(ByVal sURL As String) As String
Dim oClient As New WebClient
Dim oStreamReader As New StreamReader(oClient.OpenRead(sURL))
Dim sHTML As String = oStreamReader.ReadToEnd
If moRegularExpression.IsMatch(sHTML) Then
Return moRegularExpression.Matches(sHTML).Item(1).Groups("Quote").ToString
End If
End Function
End Module