PressForward and Ethical Content Scraping

What is PressForward?

Roughly speaking PressForward is a back-end WordPress plugin developed by the Center for History and New Media that allows users to aggregate, curate, and redistribute web content pulled from RSS or ATOM feeds, or through the use of PressForward’s Bookmarklet tool. Once a site-runner has added their desired feeds to the plug-in, or has marked content for rehosting through the Bookmarklet tool, they can review specific pieces, add metadata, format them for Word Press, add any categories or tags they wish, and finally publish the content on their blog.

Screenshot of the PressForward Dashboard, taken from the PressForward User Manual.

Aggregate

There are a few ways to start collecting content for rehosting through PressForward, but let’s start with web feeds. RSS (Really Simple Syndication or alternatively Rich Site Summary) basically takes unique text files from websites that a user would like to “subscribe” to that, when uploaded to a feed reader program like Feedly or The Old Reader, then allows users to create their own feeds of automatically aggregated content. So instead of visiting a bunch of blogs individually, you could just have posts from all of them pulled into the feed reader program to create your own newsfeed. ATOM on the other hand is a more recently created alternative format to RSS. Linking these feeds to PressForward creates a feed of content within your WordPress site (visible only to you), from which you can begin to select specific content for rehosting.

The second key way to collect content is through PressForward’s “Nominate This” bookmarklet. While RSS feeds pull content from designated sites as it is published, “Nominate This” allows for a more intentional selection of specific content from specific sites. Say you found a cool blog post on a site you have not incorporated into your RSS feed, or for which RSS is not available. In this case you can just click the “Nominate This” button on your browser’s toolbar and send the selected content to your WordPress Drafts section manually. If the site does have an RSS feed you are unsubscribed to, this tool also offers you the option to do so if such a feed exists.  

Nominate This bookmarklet in action. In this instance the old homepage of the Center for History and New Media website has been pulled for republication. Image taken from the “Installing and Using the Nominate This Bookmarklet” section of the PressForward User Manual

Curate

 Once you’ve got your feeds set up/articles from other sources nominated, it’s time to curate. At this stage you can start picking out content from your feeds for republication on your blog. There are two key panels to use here, the “All Content” panel and the “Nominated Panel.” The former contains all of the content pulled from your RSS feeds that are pending review and nomination, the latter contains content that you have marked for republication. At either stage you can use the Reader View option to open the content to check for readability and any errors in the text or formatting before sending it over to either the Nominated Panel or to a Word Press Draft.

Redistribute

Now that you’ve sent content over to the drafts section all that remains is formatting/editing the post and publishing it to your blog like any other post. Which brings us to the overarching goal of this plug in: to disseminate scholarship, blogs, digital projects, etc. to a wider audience by allowing bloggers and site runners to curate their own informal journals so to speak. Unlike content-scrapers, which have a less than stellar reputation among digital content creators, PressForward is not intended to be a platform by which people can collect and republish content to their sites in an unethical drive to increase their own site traffic (and ad revenue) by rehosting others’ unattributed work. Yet when you get down to brass tax I don’t really think its all that far off from such tools.

PressForward does a few things to encourage responsible aggregation and republication: it “offers the option to auto-redirect back to the original source,” it “retains detailed metadata about each aggregated post,” and “the original author’s name will appear with a republished post if you use WordPress default themes such as Twenty Fourteen.” The FAQs also emphasize that author consent should be sought before republishing. Reading through the plug-in’s Manual and FAQs I noticed that there are a lot of “ifs” involved when it comes to the display of metadata. If users want to display more metadata, they have to use Custom Fields. If users have the overwrite author option enabled (it is by default but can be shut off), the author of the original post will be displayed on your rehosted site. Links to the original post are contained in the new Draft post, but can be deleted if the user chooses to do so. None of these options seem to impose a strict requirement that users include metadata in their final posts. If a user does not “use default themes,” will the metadata still appear?

I don’t mean to be overly critical about PressForward in this respect, especially as there are far easier ways to go about plagiarism, and chances are digital humanities scholars aren’t exactly the same level of target for content-scrapers as say artists or tech-reviewers. But, I do think the conversation surrounding the ethics of content-scraping and rehosting is an interesting one to have especially if we are talking about shifts in the landscape of scholarly publication. While scholars may not be producing their content for ad revenue as other types of digital producers may be, is it ethical for a “big” blog like Digital Humanities Now (which does actually publish a full list of the feeds they are subscribed to) to pull content (and views) away from their pages? Is re-hosting really all that different from linking to a blog post as a form of citation (I think it is)? While it could certainly be argued that there are philosophical differences in the motivations behind publishing a scholarly article and a swing-cover of Nirvana’s “Smells Like Teen Spirit,” shouldn’t scholars still have a right to their labors?

10 Replies to “PressForward and Ethical Content Scraping”

  1. Thanks for the great post, Sean! I came across a post on the PressForward website about how the software can be used by grad students. The author cautions grad students to only display a passage from the original post and to include a link back to the original article because “you want to attract more traffic to the original post; not draw traffic away from it.” The problem is that the onus is on the user. While some scholars may use it to share and comment on scholarship in their field in order to engage in the literature, others may not be so conscientious about intellectual property.

    One aspect of this software that I find interesting is how to use PressForward to collect your own intellectual property in one place. As someone thinking about establishing my own online presence one day, this tool is great in that regard. As the article states, it’s also great to have all your work in one place for potential employers, peers, or collaborators to see!

    For the full post, see: https://pressforward.org/using-pressforward-as-a-graduate-student/

    1. Agreed, I feel like this is a situation in which the program should be more forceful in its insistence that users adhere to ethical sourcing. PressForward certainly does make an effort to encourage such behavior, but I just don’t see why it should even be an option for a user to choose whether and how metadata about the source they’re rehosting is displayed.

      Interesting article thanks for sharing! That’s a really creative way to use the program Erica! I had not even considered using it as a means to consolidate your own online portfolio.

  2. Very interesting perspective Sean. This reminds me of when Yahoo was criticized a few years ago for a rehosting issue that many considered just plain plagiarism. On one hand, I can see rehosting as a great benefit to sharing scholarly content; however, how do you make sure traffic still flows to the original source?

    1. I think doing something similar to what Kotaku does is a good start, though I’m pretty sure Kotaku does what it does because the sites it rehosts content from are all part of the same corporate media network (not sure about this). When you read through its homepage it appears as if all of the content is being hosted directly by Kotaku itself however, when you click to read more about a post you are sometimes re-directed to the original site of publication like Gizmodo or The A.V. Club. PressForward’s auto-redirect feature does something similar but can be turned off by the user. Allowing users to strike a middle ground in this way seems like the best compromise here. It would allow re-hosted posts to appear seamlessly in your blog’s content feed while forcing a redirect to the original site of publication in order to prevent content leeching. If your goal is simply the dissemination of scholarship, the fact that its not “really” being hosted on your blog should be a non-issue.

  3. There’s no question that scholars should own the fruits of their labor! It seems to me that the arguments in favor of strict, ethical content-scraping may actually benefit the field in complementary ways to Fitzpatrick’s suggestions for reforming academic publishing. Content-scraping blogs such as Arts and Letters Daily (https://www.aldaily.com/) gather both works of the news press as well as scholarship and have wide readership. By taking what can ethically be fed, at least in the short term, from academic presses and databases such as ProQuest and JSTOR and aggregating them on an easy-to-navigate site, people may be more inclined to read book chapters or journal articles if they are accompanied by other, non-academic works. This way, scholars gain both credit for their work and recognition of it from a wider audience that those with access to the digital spaces behind paywalls.

    1. *gather both news press as well as scholarly writing that is published in popular venues with wide readership.

    2. I think you make a great point about the pairing of academic works with non-academic works and definitely agree that doing so could be a great way to disseminate a wide-variety of content for users to explore at a rate they are comfortable with. Your comment got me thinking however, about whether or not academic articles and book chapters, as a form of writing, are conducive to content scraping for this objective. Are these types of scholarly products compatible with the blog form?

      In technical terms, the primary obstacle here, at least if we’re using PressForward, seems to be the paywall and academic databases’ preference for only uploading articles as PDFs. Obviously as you pointed out with Arts and Letters, the machinery is out there to do so, I’m just not sure that PressForward can do it. The crux of it is that I am not sure if PressForward’s Nominate This bookmarklet is compatible with PDFs. If it is then this is a non-issue, but if it is not we’ve got a bit of a problem. Based upon my reading of the User Manual PressForward seems to be primarily geared towards pulling full-text content from a single web-page, while academic databases like ProQuest and JSTOR often tuck the actual content away in the form of a downloadable pdf link. Nominating the webpage hosting the PDF file would only give you the text of the page which isn’t the actual content we want to pull. If the Nominate This bookmarklet could pull the full-text from a pdf file then we’re in business. But at that point wouldn’t it have been easier to just download the file and upload it as an attachment to say a blog post about that article? That said, I don’t know what the copyright implications of doing so are.

      There’s also the issue of translating a full-blown academic article or book chapter to the blog form. I can only speak for myself in this regard but, I personally have a great deal of trouble focusing on blog posts that translate into 10-plus page documents. Something about the lack of pages to orient myself makes it incredibly difficult for me to stay on track and actively consume the information being presented. Seeing a full reproduction of a book chapter or lengthy article as one long, unbroken chain of text is the surest way to make me lose any sense of focus.

  4. I think this is a super interesting concept and certainly one worth pursuing. Putting aside the ethical concerns for the moment, I think projects like this can be a valuable tool to share historical scholarship far and wide. There are so many great articles or chapters of books that general audiences and non-University students could gain a lot from, but will never read them because they are too hard to find. But if there was a blog that people could visit that periodically posts really interesting articles about history, then I think historical articles would get much wider attention. This in turn would help make the public conversation around history a more nuanced history.

    I know I read more journalism when people I trust are posting the articles to their Twitter or their blog and I think finding an ethical way to do that for historical articles and book chapters would be a great tool.

    1. Definitely agree! I think tools like PressForward have a lot of potential to help put academics’, and especially us public historians’, calls for greater public access to historical knowledge into practice. The one thing I’ve kind of realized in reading through everyone’s comments and re-visiting the PressForward page today though is that I’m actually not super sure what the copyright implications are in disseminating articles and book chapters in this manner. Would it fall under fair use since its for educational purposes?

  5. Okay, so I’m going to date myself here, but back in the old BBS installs you could essentially character limit a quote block. This forced users to be both sensitive to the limits of the server, but also be cognizant of the fact that they may end up plagiarizing when they copy and paste whole articles. Nowadays, nobody uses this kind of functionality, and on some major websites, they tell users to only quote the first paragraph or else the moderators will get you.

    I’m of two minds on this: something like PressForward would definitely benefit from the option to limit excerpt size with an automatic “(read more)” linkback, but on the other hand, moderation and self-moderation are important tools for any (dating myself again here) netizen. Emphasis on careful excerpting length is a lost art in these kinds of apps. Aggregation is apparently the wave of the future.

Leave a Reply to Emily Rheault Cancel reply

Your email address will not be published. Required fields are marked *