# web2atom A simple utility for parsing data from a web page with multiple items and making it into an atom feed. Requirements: * Perl Get it: git clone git://gopher.icu/web2atom ## Commandline usage curl -s 'http://greatsite.com' | web2atom -p "great_site" ## Sfeed integration This utility was designed to work with sfeed[1]. To enable this modify ~/.sfeed/sfeedrc to look like the following: ``` feeds() { feed 'Ebay - ZX Spectrum' 'https://www.ebay.co.uk/sch/i.html?_nkw=zx+spectrum&_sop=10&rt=nc&LH_PrefLoc=1' 'ebay_watcher' #feed 'Another great site' 'http://greatsite.com' 'great_site' } parse() { if [ "$3" = "ebay_watcher" ]; then web2atom -p "$3" | sfeed #else if [ "$3" = "great_site" ]; then # web2atom -p "$3" | sfeed else sfeed "$3" fi } ``` The above is a working example. You will of course wish to tailor this for your own purposes. ## web2atom Inside the program you will find a list of profiles: ``` my %profiles = ( 'ebay_watcher' => { itm => '<li class="s-item s-item__pl-on-bottom s-item--watch-at-corner".*?>.*?<\/div><\/div><\/li>', dmap => { ## Default Atom link => 'class=s-item__link href=(.*?)\?.*?>', title => '<h3 class=\"?s-item__title.*?>(?:<span.*?<\/span>)?(.*?)<\/h3>', published => 's-item__listingDate"><span class=BOLD>(.*?)<\/span><\/span>', content => '', ## Custom fields used in applyCustomFormating() price => '<span class=s-item__price>(.*?)<\/span>', postage => '<span class="s-item__shipping s-item__logisticsCost">(.*?) postage<\/span>', buyprice => '<span class=s-item__price>(.*?)<\/span>', buyitnow => '<span class="s-item__dynamic s-item__buyItNowOption">(.*?)<\/span>' } } ); ``` You must create a profile such as 'ebay_watcher' for each site you wish to parse data for. This entry must exist in the program itself and also in the feeds() and parse() functions in ~/.sfeed/sfeedrc as per the example . It is important that the 'itm' regular expression encompasses each item you wish to capture. *note* Setting the DEBUG constant to 1 will be useful while experimenting. Using firefox developer tools to view the html layout of a page and select the containers you're interested in is useful for creating the regular expressions. ## Default fields Treat the 'Default Atom' section as required fields for making a basic atom feed. The content regular expression is left blank as this will be populated later in the script with whatever is matched by the 'itm' regular expression. ## Custom fields The 'Custom fields' are totally optional and allow a great deal of flexibility. You can use the applyCustomFormatting() function to manipulate any of the captured fields. The below example shows how dates can be manipulated and how the title can be amended to include other data: ``` sub applyCustomFormatting { my ($src, @entry) = @_; foreach my $entry (@entry) { if ($src =~ /^ebay/) { # Fix and format date my ($date, $time) = split(/ /, $entry->{'published'}); $date .= qx#echo -n `date +"-%Y"`#; $entry->{'published'} = qx#echo -n `date --date="$date $time" +"%a, %d %b %Y %T %z"`#; # Append stuff to title $entry->{'title'} = "[BID $entry->{'price'} $entry->{'postage'}] - $entry->{'title'}"; if ($entry->{'buyitnow'}) { $entry->{'title'} = "[BUY $entry->{'buyprice'}]$entry->{'title'}"; } } } } ``` [1](https://codemadness.org/sfeed.html)