PDA

View Full Version : Needing some help


atomic22
May 28th, 2009, 05:09 PM
Hello
I'm trying to scrape a site for some titles, ids and other information.
I've figured out the path to the flv and i know how to get it to work in a player, now I just need to figure out how to scrape the site for the information and I hope I'll be one step closer.
Can anyone give me some pointers on how to scrape a site?

I read Plugin Creation Tutorial by Voinage and I tried to use the

match=re.compile('<a href="#" onClick="playerArticleID(.+?); return true;"><img height=".+?" alt=".+?" hspace=".+?" src="(.+?)" width=".+?" align=".+?" border=".+?" /><br>\r\n (.+?)</a>').findall(link)
but I can't get it to work.

More specifically I'm not getting anything returned.

Any help would be really appreciated.

xmcnuggetx
May 28th, 2009, 07:38 PM
personally i would suggest using the yahoo pipes and scraping from there. its a bit of a different interface, but then if something changes you could just update the pipe and users wouldn't need to update their plugin.

you can check this thread:
http://forum.boxee.tv/showthread.php?t=8082

and view any of the example's source.

ameno
May 28th, 2009, 08:29 PM
I personally use a PHP proxy on my site and the preg_match function to scrape.

The issue you are having with your regular expressions is that you are not giving it enough to work with.

For this one, my regular expressions would look something like

"/onClick=\"playerArticleID([^;]+).*src=\"([^\"]+).* ([^<]+)/"

I didn't look at documentation, I may not have properly escaped the semicolon in the first match or the double quotes in the second. But, when done right, using preg_match, this would return an array where indexes "1", "2", and "3" would be the matches that you want.

Then you can either pop those into an RSS feed on the fly, store them in a DB and use the DB to populate RSS (my personal method), or return them back as a response from a urllib "API call",

atomic22
May 28th, 2009, 10:49 PM
I personally use a PHP proxy on my site and the preg_match function to scrape.

The issue you are having with your regular expressions is that you are not giving it enough to work with.

For this one, my regular expressions would look something like

"/onClick=\"playerArticleID([^;]+).*src=\"([^\"]+).* ([^<]+)/"

I didn't look at documentation, I may not have properly escaped the semicolon in the first match or the double quotes in the second. But, when done right, using preg_match, this would return an array where indexes "1", "2", and "3" would be the matches that you want.

Then you can either pop those into an RSS feed on the fly, store them in a DB and use the DB to populate RSS (my personal method), or return them back as a response from a urllib "API call",

personally i would suggest using the yahoo pipes and scraping from there. its a bit of a different interface, but then if something changes you could just update the pipe and users wouldn't need to update their plugin.

you can check this thread:
http://forum.boxee.tv/showthread.php?t=8082

and view any of the example's source.

Thank you both. I'll take a look at each suggestion.