How to harvest WordPress (or other RSS feeds) into Primo

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

Created By: Deborah Fitchett

Created on: 2/18/2019

We know students sometimes search in Primo for content that's actually on our website, so we wanted to index our website content in Primo. Luckily our website runs on WordPress - this doesn't have the OAI feed that Primo harvests, but it does have an RSS feed. So we wrote some PHP code to convert the RSS feed into a basic OAI format. The code is available as RSS2OAI on gitHub and you can see an example record on Primo.

To use the code

In WordPress:

You probably want to include pages as well as posts - in this case you'll need to install the plugin "RSS Includes Pages"

On your own PHP-enabled web-server (we use version 5.6.35):

Save index.php
Configure it:
- The URL of your WordPress site's RSS feed
- Your contact email address
- (Optional) A text statement for the dc.rights field - eg if you want to mark it as open access, or as all rights reserved, or something else
- How you want to deal with $from dates (see the README for more background on this). If your WordPress site is just a blog and you don't edit old posts, then $useBuildDate should be false. But if you edit WordPress pages and want Primo to have the most recent version, then $useBuildDate should be true.
- Whether your new OAI feed will be accessed over https (recommended)

In Primo Back Office:

Add a new source (Local Data > Data Sources > Add a New Data Source) using
- system=Other
- format=DC
- file-splitter=OAI splitter
- record path=oai_dc:dc
- character set=UTF-8
Create a normalisation set (Local Data > Normalization Sets) - it's easiest to duplicate from an existing OAI-based norm set if you have one, otherwise just make sure you at least deal with:
- dc:type
- dc:identifier
- dc:date
- dc:title
- dc:creator
- dc:subject
- dc:description - note that there'll be two copies of dc:description. The first is a short 'blurb' which you may want to display on the details tab; the second is the full-text of the post/page which you may want to index for searching.
- dc:publisher
- dc:rights (if you didn't leave this blank).
- We additionally created a links:thumbnail rule, and a ranking:booster1 rule.
Create a line in the delivery mapping table "GetIT! Link 2 Configuration" (General > Mapping Tables) for your data source code. This stops the GetIt link from appearing on your records. The fields in this line should be:
- Online Resource
- not_restricted
- display
- linktorsrc
Deploy & Utilities > Deploy All
Create a new regular pipe (Publishing > Create New Pipe) with
- harvesting method=OAI
- metadata format=oai_dc
- server=https://www.yoursite.com/path/to/index.php
Run your pipe, wait 12-24 hours for indexing, and test.
When you're happy, schedule your pipe to run daily (Publishing > Scheduler).

This feed doesn't deal with pages that might have been deleted from your WordPress site, so from time to time you might need to also run a delete-and-reload pipe to keep things tidy.

Report