Web scraping base on feeds


Web scraping is the process of automatically collecting information from the World Wide Web.Current web scraping solutions range from the ad-hoc, requiring human effort, to fully automated systems that are able to convert entire web sites into structured information, with limitations.

Feeds is a module that used to import or aggregate data as nodes, users, taxonomy terms or simple database records, common case include:
•Import or aggregate RSS/Atom feeds
•Import or aggregate CSV files
•Import or aggregate OPML files

We could also use it to import or aggregate web data without RSS.  At this session, I will introduce my experience of  collecting information from internet base on feeds, include:

  • 1, Install related modules: Feeds,Feeds crawler,Feeds xpathparser,Feeds selfnode processor,Feeds smartparser
  • 2, Import RSS through Feeds,Import Full article content through Feeds selfnode processor and Feeds smartparser
  • 3, Import HTML list pages without RSS through Feeds xpathparser and Feeds crawler, Import Full article content through Feeds selfnode processor and Feeds smartparser
  • 4, How to use crontab to periodically importing web data.
  • 5, How to handle common exceptions when importing web data.
  • 6, How to save remote image/file of main content into local server.

Speaker's Real World Experience
1,Author of several feeds related modules, such as Field collection feedsFeeds Tamper: Search term by fieldFeeds Smart Parser.
Author and Maintainer of several other modules, such as Field validation, Image URL formatter, User picture field, Static page.

2,Work on several real projects which using feeds to import data.

3,I have made a similar presentation at DrupalCampChina 2014 in Chinese, it is very usefull for Chinese Drupalers, you could find it at here.

Speaker's shortcoming
I am not a Native English speaker, and this will be the my first time that speak publicly use english. But I think I could overcome it:
1, Making enough preparation.
2, Instead of speak a lot, I will do it step by step with a concise explanation.


Session Track

Site Building

Experience Level


Drupal Version