Creating Quick Scrapers Using Xpather, Snoopy & PHP

I’ve been working on a bunch of review sites, and because of this I’ve had to gather some data from websites – mostly stuff that’s not included in the data feeds that I’m using and wanted to compliment my information with some additional specifications for my users to check out items this coming holiday season, with ease, and hopefully make them click happy for purchasing J

I started out reading a bunch of tutorials on using Curl and PHP to scrape and even built a little scraper that worked fucking awesome for the page I created it for, it had 21 regex’s, and used a bunch of curl lines to connect and grab the information I needed. I was so proud, so proud of this monstrosity. So, after testing it repeatedly on one page, I had it run through nearly 200 pages – guess what, I only got results from 2 pages because they were so different.

Fuck.

So, after spending about a day learning some regex’s, and testing out all this crazy stuff, it was pointless. I was on IRC venting my frustrations and Datarecall jumped in and told me about this awesome program called Xpather. It’s a plugin for firefox that lets you just simply right click any element on the page, and view the full xpath for an item.

I did learn that Firefox will automatically inject TBODY tags, so you may have to remove those for your scraper to work. Thought I should mention this incase you run into troubles like I did.

Anyway, so yeah, with this plugin and Firefox I was now armed with the tools needed to create a new scraper, that actually grabbed all the data from the pages I needed.

I ended up using the Snoopy class from sourceforge to make the curl connections nice and easy, and simplify a lot of the base coding so I didn’t have to do it myself. If I had more time, maybe, but for now it’s quicker using something pre built and does the job. Remember, we’re trying to save time now and in the future right?

Anyway, so after grabbing snoopy and making sure it connected to the site I wanted, I created some basic code to get me started;

I opened a file called snooper.php and added the following to it;

<?php  
error_reporting(E_ERROR);// | E_WARNING | E_PARSE);  
include ('Snoopy.class.php');  
function convert_smart_quotes($string)

{  
$search = array(chr(194),chr(146),chr(147),chr(148),chr(151),chr(174)," ");  
$replace = array("","'",'"','"','-',' ',' ');  
return str_replace($search, $replace, $string);  
}

What this does is sets up the error reporting (when dealing with scraping, you might see a lot of errors, this tones it down a bit for us. You can always comment it out if you want to see all the errors for debugging or troubleshooting purposes).

Following the error reporting, we include the snoopy class so that it sets up the connection to our website that we’re going to scrape.

Now, the next lines are optional, but I found with some funky characters in my output they really helped. That’s what the Convert_Smart_Quotes function does, it will go ahead and take any fancy characters, double spaces, or whatever you throw at it and convert them to whatever proper character you want. You might not need this, or want to edit it, but I found it nearly invaluable to getting nice clean data at the end.

Next we’re going to want to add a few lines to load all our urls from a text file.

$lines = file('urls.list');
foreach ($lines as $line_num => $line)
{

If you’re quick, you’ll notice that we’re missing a closing bracket, that’s fine we’ll add it later.

What this does is pretty simple, it’s going to look for a file called “urls.list” in the same directory as the script, and for each line of the file it will run your scraper. This is where you’d put your list of urls in the format that looks like this.

url1  
url2  
url3  
etc

So, if you want to test you can just put one or two urls in here, and then populate it later it’s really up to you, but it’s nicer to have this part of the overall scraper already done for you. I’m not going to tell you where or how to get your URL list, gotta leave some things up to you right?

Now, between our foreach loop we’re going to want to add some more code, this is the start of our scraper, and the connections it makes;

unset($snooper);  
unset($output);  
$snooper = new Snoopy();

What this does is unsets the snooper variable each time it runs through the foreach loop on our urls, this is important so we don’t get doubled up information in our output later on. The same thing for the output variable here too and why we’re unsetting it too. The next line sets up our snooper with the class we downloaded from sourceforge, no fuss no muss.

Now we’re getting somewhere. We have a script that’s going to turn off error reporting, load up the snooper class, loop through a list of urls, creating a snooper connection for each url.

Now we need to get some data, this is where the xpather plugin for firefox comes in handy. Go to one of the pages you want to scrape, and select an element on the page. For our example, lets say there’s a table with two rows and we want to grab the information from each one. We might add something like this following the code we added above.

if ($snooper->fetch($line)) {

$dom = new DomDocument();  
@$dom->loadHTML($snooper->results);  
$x = new DomXPath($dom);

foreach ($nodes = $x->query("//td[@class='tablerow1']") as $node) {  
$row1 = convert_smart_quotes(str_replace(array("n", "r", "t"), ",$node->nodeValue)) . "";  
$output['row1'][] = $row1;  
}  
foreach ($nodes = $x->query("//td[@class='tablerow2']") as $node ){  
$row2 = convert_smart_quotes(str_replace(array("n", "r", "t"), ",$node->nodeValue)) . "";  
$output['row2info'][] = $row2;  
}

So, what this does for us, is checks the source of the items in the url list, for a table with a class of ‘tablerow1’ and ‘tablerow2’, obviously you’ll have to change that for what’s specific to your case.

This will give you the results of everything from row1, and row2 on the page. Now you need to display it. So, we’ll add something like this after this new code

$display .= trim(str_replace(array("n", "r", "t"), ", "<h2>Row 2 Info</h2><ul>"));  
foreach($output['row2info'] as $k => $v) {  
$display .= "<li>" . str_replace(array("n", "r", "t"), ", trim($output['row1'][$k]) . trim($v)) . "</li>";  
}  
$display .= str_replace(array("n", "r", "t"), ", trim("</ul>"));  
$display .= "n";  
}

What this does, is gives you a heading with ROW2, and then an un-ordered list with each item in row 2 displayed for you, with ROW1 as the heading for each.

You can change how it displays entirely, this just happened to suit the particular need I had for it, which was to grab a table with many rows of specifications followed by the specification detail.

After this, you just have to add the following lines and you should be good to go.

print($display); 
?>

You can download the full script here if you don’t want to paste all this crazy stuff into a file, and just upload it along with Snoopy to your server and start playing around with it. Remember you’ll have to create a file called urls.list to work with, and modify the xpath of the items you’re trying to grab using Xpather.

Hope this helps somebody else with scraping.

Comments

Comment by klaas on 2010-05-25 23:24:41 -0500

Thanks for the tbody tip!

Comment by Matt on 2010-05-25 23:28:47 -0500

no problem klaas

Comment by Gary B on 2010-07-30 19:38:43 -0500

I’ve been using methods for this sort of thing for a long time. It’s a great time saver. One thing I suggest: some websites out there have excruciatingly bad HTML that can make the DOM throw up its hands. So, to prevent that, I will use the tidy functions to repair the HTML before I run it to the DOM.

I don’t know how this will display, but here goes:

$tidy_config = array(
‘clean’ => true,
‘output-xhtml’ => true,
‘show-body-only’ => true,
‘wrap’ => 0,
);
$tidy = new tidy();
$file = $tidy->repairString($pagehtml, $tidy_config, ’latin1’);

I then usually run a preg_replace on the result to strip out non-ASCII characters, depending on the purpose.

Comments#

Comment by klaas on 2010-05-25 23:24:41 -0500#

Comment by Matt on 2010-05-25 23:28:47 -0500#

Comment by Gary B on 2010-07-30 19:38:43 -0500#

Comments

Comment by klaas on 2010-05-25 23:24:41 -0500

Comment by Matt on 2010-05-25 23:28:47 -0500

Comment by Gary B on 2010-07-30 19:38:43 -0500