proxy70

# web2atom 

A simple utility for parsing data from a web page with multiple items
and making it into an atom feed.

Requirements:
  * Perl

Get it: git clone git://gopher.icu/web2atom


## Commandline usage
curl -s 'http://greatsite.com' | web2atom -p "great_site"


## Sfeed integration
This utility was designed to work with sfeed[1].

To enable this modify ~/.sfeed/sfeedrc to look like the following:
```
feeds() {
	feed 'Ebay - ZX Spectrum' 'https://www.ebay.co.uk/sch/i.html?_nkw=zx+spectrum&_sop=10&rt=nc&LH_PrefLoc=1' 'ebay_watcher'
	#feed 'Another great site' 'http://greatsite.com' 'great_site'
}

parse() {
	if [ "$3" = "ebay_watcher" ]; then
		web2atom -p "$3" | sfeed	
	#else if [ "$3" = "great_site" ]; then
	#   web2atom -p "$3" | sfeed
	else
		sfeed "$3"	
	fi
}

```

The above is a working example. You will of course wish to tailor
this for your own purposes.


## web2atom
Inside the program you will find a list of profiles:
```
my %profiles = (
	'ebay_watcher' => {
		itm  => '<li class="s-item s-item__pl-on-bottom s-item--watch-at-corner".*?>.*?<\/div><\/div><\/li>',
		dmap =>	{	
			## Default Atom
			link      => 'class=s-item__link href=(.*?)\?.*?>',
			title     => '<h3 class=\"?s-item__title.*?>(?:<span.*?<\/span>)?(.*?)<\/h3>',
			published => 's-item__listingDate"><span class=BOLD>(.*?)<\/span><\/span>',
			content   => '',
			
			## Custom fields used in applyCustomFormating()
			price     => '<span class=s-item__price>(.*?)<\/span>',
			postage   => '<span class="s-item__shipping s-item__logisticsCost">(.*?) postage<\/span>',
			buyprice  => '<span class=s-item__price>(.*?)<\/span>',
			buyitnow  => '<span class="s-item__dynamic s-item__buyItNowOption">(.*?)<\/span>'
		}
	}
);
```

You must create a profile such as 'ebay_watcher' for each site you
wish to parse data for. This entry must exist in the program itself 
and also in the feeds() and parse() functions in ~/.sfeed/sfeedrc
as per the example . It is important that the 'itm' regular 
expression encompasses each item you wish to capture. 

*note* Setting the DEBUG constant to 1 will be useful while
experimenting. Using firefox developer tools to view the html layout
of a page and select the containers you're interested in is useful 
for creating the regular expressions.


## Default fields
Treat the 'Default Atom' section as required fields for making a
basic atom feed. The content regular expression is left blank as this
will be populated later in the script with whatever is matched by the
'itm' regular expression.


## Custom fields
The 'Custom fields' are totally optional and allow a great deal of
flexibility. You can use the applyCustomFormatting() function to
manipulate any of the captured fields. 
The below example shows how dates can be manipulated and how the 
title can be amended to include other data:

```
sub applyCustomFormatting
{
	my ($src, @entry) = @_;

	foreach my $entry (@entry)
	{
		if ($src =~ /^ebay/)
		{
			# Fix and format date
			my ($date, $time) = split(/ /, $entry->{'published'});
			$date .= qx#echo -n `date +"-%Y"`#;
			$entry->{'published'} = qx#echo -n `date --date="$date $time" +"%a, %d %b %Y %T %z"`#;

			# Append stuff to title
			$entry->{'title'} = "[BID $entry->{'price'} $entry->{'postage'}] - $entry->{'title'}";
			if ($entry->{'buyitnow'})
			{
				$entry->{'title'} = "[BUY $entry->{'buyprice'}]$entry->{'title'}";
			}
		}
	}
}
```


[1](https://codemadness.org/sfeed.html)