Main function to get the data from each page and save all of it as a CSV file that Drupal can use to import with the Feeds module:
function scrape() {
// get array of urls to query.
$url_queries = array();
// go through each letter in the alphabet and figure out how many pages on each letter.
$alphabet = range('a', 'z');
$alphabet[] = 'other';
// base path for every page.
$base_path = 'https://web.archive.org/web/20160410015723/http://aquariusrecords.org/cat/';
foreach ($alphabet as $letter) {
$current_url = $base_path . $letter . '.html';
$total = getPagesForLetter($current_url);
$url_queries[$letter] = $total;
}
$final_list_urls = [];
foreach ($url_queries as $letter => $total_pages) {
$final_list_urls[$letter] = $base_path . $letter . '.html';
for ($i=2; $i <=$total_pages ; $i++) {
$final_list_urls[$letter . $i] = $base_path . $letter . $i . '.html';
}
}
// pre_print($final_list_urls);
// redefine to single url if you want to test single pages for oddly formatted reviews:
// $final_list_urls = [];
// $final_list_urls['w8'] = 'https://web.archive.org/web/20160410015723/http://aquariusrecords.org/cat/w8.html';
// $final_list_urls['w3'] = 'https://web.archive.org/web/20160410015723/http://aquariusrecords.org/cat/w8.html';
// $final_list_urls['f3'] = 'https://web.archive.org/web/20160410015723/http://aquariusrecords.org/cat/f3.html';
$final_csv = [];
foreach ($final_list_urls as $url_index => $query) {
$formatted_reviews = scraper($query);
$csv_arrays = add_reviews_to_csv_string($formatted_reviews, $url_index);
$final_csv = array_merge($final_csv, $csv_arrays);
}
// pre_print($final_csv);
// send response headers to the browser
drupal_add_http_header('Content-Type', 'text/csv');
drupal_add_http_header('Content-Disposition', 'attachment;filename=csvfile.csv');
$fp = fopen('php://output', 'w');
$headers = array('key','artist','title','labelinfo','img','audio','audio-title','body');
fputcsv($fp, $headers, "|", '"');
foreach($final_csv as $line){
fputcsv($fp, $line, "|", '"');
}
fclose($fp);
drupal_exit();
return 'done';
}
This is the initial function I used to parse the HTML on each page into an array of raw reviews. It was tricky because reviews were not separated into different divs, it was just a lot of paragraphs. Thankfully, at the top level of the DOM each review was a paragraph, so I was able to separate them into separate entries in an array, even though each paragraph contained several paragraphs with the actual review body, title, artist, label information, price, and audio samples (all of which were optional fields). It was also easy enough to look for image tags, except that I had to carefully restrict them to only save images with the alt tag of "album cover" because there was a blank image used as a spacer all over the site called "dot.gif". This was an old website :)
function scraper($url) {
// Get all records on one page.
$html = new DOMDocument();
@$html->loadHTMLFile($url);
$reviews = array(); //hold raw review data.
$count = 0; // need this because reviews are stored as paragraph tags. But not the first 4 lol :)
$review_index = 0; // we'll use this to number the reviews as we store them in the $reviews array.
foreach($html->getElementsByTagName('p') as $ptag) {
$count++;
if ($count > 4) {
// at this point this is ONE ALBUM REVIEW.
foreach ($ptag->childNodes as $node) {
// for each thing inside a review...
if ($node->nodeValue) {
// store all paragraphs as a review. Includes variable amounts of data.
// always includes artist, title, label info.
// could include: review body paragraphs, song titles, and extra data such as messages like 'out of print' that we don't want to save in the database. Store it all for now.
// $formatted_reviews will parse this.
$reviews['review' . $review_index][] = $node->nodeValue;
}
}
// Now get all image tags to find album art for current album.
foreach ($ptag->getElementsByTagName('img') as $imgnode) {
// make sure the img tag you're storing is the album cover.
// raw HTML uses a blank 'dot.gif' image as spacer that we don't want to store.
// also, not all albums have album art stored. When you find one store it in the review under 'img'
if ($imgnode->getAttribute('alt') == 'album cover') {
$reviews['review' . $review_index]['img'] = 'https://web.archive.org' . $imgnode->getAttribute('src');
}
}
foreach ($ptag->getElementsByTagName('a') as $link) {
$reviews['review' . $review_index]['audio'][$link->nodeValue] = 'https://web.archive.org' . $link->getAttribute('href');
}
$review_index++;
}
}
// Each entry in $reviews holds an array of paragraph data with variable length including the album art:
/*
[review0] => Array
(
[0] =>
[1] => ACID MOTHERS TEMPLE & THE MELTING PARAISO U.F.O.
[2] => In O To Infinity
[3] => (Important) cd 14.98
[4] => THIS IS CURRENTLY OUT OF PRINT OR OTHERWISE UNAVAILABLE TO US AT THE MOMENT, SO PLEASE DO NOT ORDER IT. SORRY.
[5] => It's really hard to keep up with these guys. We love pretty much everything they do...
[6] => In O To Infinity marks two milestones of sorts for the bands, and for fans, one is the return of Cotton Casino to the fold...
[7] => And as much as we love all the different facets of AMT...
[8] => "In A" is all chants and grunts and weird vocalizations...
[9] => And finally, "In Infinity", which contrary to what we were ...
[10] => Another AMT winner, to add to that dedicated shelf in your house ...
[11] => MPEG Stream: "In O"
[12] =>
[13] => MPEG Stream: "In A"
[14] =>
[img] => https://web.archive.org/web/20150322130420im_/http://aquariusrecords.org/images/amtinotocd.jpg
[audio] => Array
(
["In O"] => https://web.archive.org/web/20150322130420/http://aquariusrecords.org/audio/amtinotoinfino.m3u
["In A"] => https://web.archive.org/web/20150322130420/http://aquariusrecords.org/audio/amtinotoinfina.m3u
)
)
*/
// pre_print($reviews);
// return 'test';
return format_reviews_for_saving($reviews);
}