Description:

Aquarius Records was a famous independant record store in San Francisco, founded in 1970. It was famous for promoting underground music and their site had 40 years of thoughtful, eloquent reviews that were fun to browse to discover new music. By using the keyword search on their website you could discover tons of new music by searching not just by album title or artist name, but by the keywords used in reviews such as "psychedelic", "drone", "vintage", "blues", or even names of other bands you like.

The problem was that when the store was sold to new owners, the website was taken down. The archived version does not have a working search function anymore. So I wanted to preserve all that hard work and add some extra features by porting the reviews to a Drupal site and each one as a custom content type.

I created a module to export the HTML from over 49,000 archived album reviews into a CSV file so that I could import them into Drupal nodes using the Feeds module. This allowed me to add search functionality using Views exposed filters, and  I have also added a "Favorites List" so that a user can bookmark their favorite albums or a page of search results that they want to read and share later. I also added a Forum so that users can share recommendations or share their own music.

This site could potentially help the former owner if he wants to publish his work as a book, and it could help the community of underground music lovers to continue to grow and find new music.

 

Code Sample:

Main function to get the data from each page and save all of it as a CSV file that Drupal can use to import with the Feeds module: 


function scrape() {

  // get array of urls to query. 
  $url_queries = array();
  // go through each letter in the alphabet and figure out how many pages on each letter.
  $alphabet = range('a', 'z');
  $alphabet[] = 'other';
  // base path for every page.
  $base_path = 'https://web.archive.org/web/20160410015723/http://aquariusrecords.org/cat/';
  
  foreach ($alphabet as $letter) {
    $current_url = $base_path . $letter . '.html';
    $total = getPagesForLetter($current_url);
    $url_queries[$letter] = $total; 
  }

  $final_list_urls = [];
  foreach ($url_queries as $letter => $total_pages) {
    $final_list_urls[$letter] = $base_path . $letter . '.html';
    for ($i=2; $i <=$total_pages ; $i++) { 
      $final_list_urls[$letter . $i] = $base_path . $letter . $i . '.html';
    }
  }

  // pre_print($final_list_urls);

  // redefine to single url if you want to test single pages for oddly formatted reviews:
  // $final_list_urls = [];
  // $final_list_urls['w8'] = 'https://web.archive.org/web/20160410015723/http://aquariusrecords.org/cat/w8.html';
  // $final_list_urls['w3'] = 'https://web.archive.org/web/20160410015723/http://aquariusrecords.org/cat/w8.html';
  // $final_list_urls['f3'] = 'https://web.archive.org/web/20160410015723/http://aquariusrecords.org/cat/f3.html';

  $final_csv = [];

  foreach ($final_list_urls as $url_index => $query) {
    $formatted_reviews = scraper($query);
    $csv_arrays = add_reviews_to_csv_string($formatted_reviews, $url_index);
    $final_csv = array_merge($final_csv, $csv_arrays);
  }

  // pre_print($final_csv);

  // send response headers to the browser
  drupal_add_http_header('Content-Type', 'text/csv');
  drupal_add_http_header('Content-Disposition', 'attachment;filename=csvfile.csv');

  $fp = fopen('php://output', 'w');

  $headers = array('key','artist','title','labelinfo','img','audio','audio-title','body');
  fputcsv($fp, $headers, "|", '"');
  
  foreach($final_csv as $line){
    fputcsv($fp, $line, "|", '"');
  }

  fclose($fp);
  drupal_exit();
  return 'done';
}

This is the initial function I used to parse the HTML on each page into an array of raw reviews. It was tricky because reviews were not separated into different divs, it was just a lot of paragraphs. Thankfully, at the top level of the DOM each review was a paragraph, so I was able to separate them into separate entries in an array, even though each paragraph contained several paragraphs with the actual review body, title, artist, label information, price, and audio samples (all of which were optional fields). It was also easy enough to look for image tags, except that I had to carefully restrict them to only save images with the alt tag of "album cover" because there was a blank image used as a spacer all over the site called "dot.gif". This was an old website :)

function scraper($url) {
  // Get all records on one page.
  $html = new DOMDocument();
  @$html->loadHTMLFile($url);

  $reviews = array(); //hold raw review data.
  $count = 0; // need this because reviews are stored as paragraph tags. But not the first 4 lol :)
  $review_index = 0; // we'll use this to number the reviews as we store them in the $reviews array.

  foreach($html->getElementsByTagName('p') as $ptag) { 
    $count++;
    if ($count > 4) {
      // at this point this is ONE ALBUM REVIEW.
      foreach ($ptag->childNodes as $node) {
      // for each thing inside a review...
        if ($node->nodeValue) { 
          // store all paragraphs as a review. Includes variable amounts of data. 
          // always includes artist, title, label info.
          // could include: review body paragraphs, song titles, and extra data such as messages like 'out of print' that we don't want to save in the database. Store it all for now. 
          // $formatted_reviews will parse this.
          $reviews['review' . $review_index][] = $node->nodeValue;
        }
      }

      // Now get all image tags to find album art for current album.
      foreach ($ptag->getElementsByTagName('img') as $imgnode) {
        // make sure the img tag you're storing is the album cover. 
        // raw HTML uses a blank 'dot.gif' image as spacer that we don't want to store. 
        // also, not all albums have album art stored. When you find one store it in the review under 'img'
        if ($imgnode->getAttribute('alt') == 'album cover') {
          $reviews['review' . $review_index]['img'] = 'https://web.archive.org' . $imgnode->getAttribute('src');
        }
      }
      foreach ($ptag->getElementsByTagName('a') as $link) {
        $reviews['review' . $review_index]['audio'][$link->nodeValue] = 'https://web.archive.org' . $link->getAttribute('href');
      }
      $review_index++;
    }
  } 

  // Each entry in $reviews holds an array of paragraph data with variable length including the album art:
  /*
[review0] => Array
    (
        [0] => 


        [1] => ACID MOTHERS TEMPLE & THE MELTING PARAISO U.F.O.
        [2] => In O To Infinity
        [3] => (Important) cd 14.98
        [4] => THIS IS CURRENTLY OUT OF PRINT OR OTHERWISE UNAVAILABLE TO US AT THE MOMENT, SO PLEASE DO NOT ORDER IT. SORRY.
        [5] => It's really hard to keep up with these guys. We love pretty much everything they do...
        [6] => In O To Infinity marks two milestones of sorts for the bands, and for fans, one is the return of Cotton Casino to the fold...
        [7] => And as much as we love all the different facets of AMT...
        [8] => "In A" is all chants and grunts and weird vocalizations...
        [9] => And finally, "In Infinity", which contrary to what we were ...
        [10] => Another AMT winner, to add to that dedicated shelf in your house ...


        [11] => MPEG Stream: "In O"
        [12] => 




        [13] => MPEG Stream: "In A"
        [14] => 




        [img] => https://web.archive.org/web/20150322130420im_/http://aquariusrecords.org/images/amtinotocd.jpg
        [audio] => Array
            (
                ["In O"] => https://web.archive.org/web/20150322130420/http://aquariusrecords.org/audio/amtinotoinfino.m3u
                ["In A"] => https://web.archive.org/web/20150322130420/http://aquariusrecords.org/audio/amtinotoinfina.m3u
            )

    )

  */

  // pre_print($reviews);
  // return 'test';
  return format_reviews_for_saving($reviews);
}

 

Challenges:

It was challenging to parse the data because the initial website was clearly very old - the album title, artist, review body and the label information were all stored in paragraph tags with no clear start or end to each review. I couldn't easily go by CSS classes, and not every album had audio samples, an image or a review, so I had to carefully look for patterns in order to structure the data into fields for each Drupal node.

Also, working with this much content was something exciting. I initially thought to directly save the nodes from the code in my module without exporting to a CSV file but this was too slow and tended to time out for more than one page. Each letter of the alphabet had something like 50 pages of reviews, and each page itself had 50 reviews on it. It turned out to be a good solution to export the CSV file and then deploy my site and use the Feeds module to import all the content on in the live version of the site.

Technologies Used:
Drupal, PHP
, ,