User login

You are here

Drupal Migrate with QueryPath (Handy for Lotus Notes, Custom Blogs etc)

Recently I had the chance to work with some really talented guys porting an old (albeit financially successful ) site to Drupal from Lotus Notes, What? I hear you shout, Lotus Notes??, Yes.

The main issue we had was porting content over, sometimes we had a good insider that could write us a script to generated nice XML for us to consume, but othertime's that resource was not available.

So it was back to using Drupal migrate module to populate build our site content. Having used a much older version of Migrate, I was pleasantly surprised to find the 7.x-2.4 release quite stable but with a few quirks which I hope to help you out with here.

Introducing QueryPath

QueryPath is something very similar to jQuery, infact it's based heavily on it, except that it's server based, so you can execute your query as PHP instead of Javascript, perfect for finding exactly what you want in someone else's HTML!

From the QueryPath page "QueryPath is a tool for PHP developers. It is designed to make it easier for software developers to work with XML or HTML documents and web services."

Down to it..

Goals
- Migrate Requirement: Scrape a list of articles to use as an 'index'
- Migrate Requirement: Scrape each article
- Content Requirement: Save any images locally and rewrite the path
- Content Requirement: Connect to correct content fields, including taxonomy, create terms where they don't exist

Whilst i'm not going to go into the exact code with you, it's pretty trivial, you need to

  • Include the class with your my_import.install ( files[] = example.inc ) and flush your Classes cache to pick it up
  • Map your fields in your Migrate class implementation

    Some nice things in 2.4 here, you can for example tell Migrate to create the term if it doesnt exist

    $this->addFieldMapping('field_author_reference','author')->arguments(array('create_term' => TRUE));

    Or if you're scraping from another Drupal you can tell it the data is already a matching term-id

    $this->addFieldMapping('field_categories','categories')->arguments(array('source_type' => 'tid'));
  • Scrape a list with QueryPath
    This bit is easy! Just return a list, the list is then your array of ID's, it is passed to ScrapeItem::getItem($item) where $item is a member of that array, best to use something like the URL as the ID! makes it a no brainer.
  • Grab each item in the list and use QueryPath to rip out the bits you like!
    Maybe something like
        $return->title = trim($branch->xpath("//h1[@class='articleHeading']")->text());
        $return->title = html_entity_decode($return->title, ENT_QUOTES, 'UTF-8');
    
  • Implement the getCount()

If all goes well, you should have a really nice class ready in the Migrate interface control panel, one side note, the base MigrateList class does not support the highwater feature

and Migrate will double check all existing items before continuing, so that means it will run the ::getItem(...) for all existing imports every time you run the Migrate, to avoid this, remove the already processed items from your array in getList/getCount

   $list = myArrayOfThingsKeyedByURL();

    $result = db_select('migrate_map_examplemigrate')
      ->fields('migrate_map_examplemigrate', array('sourceid1'))
      ->condition('needs_update', 0)
      ->execute();

    foreach($result as $value) {
      if(array_key_exists($value->sourceid1, $list)) {
        unset($list[$value->sourceid1]);
      }
    }

have fun!