TITLE: Content Grabber
AUTHOR: Benjamin Mullin
ABSTRACT:
This
program will automate grabbing information from multiple pages that don’t have
RSS feeds. It will allow the user to
pick the content that will go on their personal page.
WHAT:
This
application will work as a web page.
From this web page, you will send a webpage URL that you want to extract
content from. The webpage content, both
text and images, will be displayed in a broken down into items. The items will be displayed along side
checkboxes. Each check box will be
content that will be extracted from the page.
The content grabbed will be displayed on one page.
WHY:
Every morning, there are a handful of
daily bulletin sites that I read.
Specifically, I grab content from my.yahoo.com,
www.sfexaminer.com, www.surfpulse.com, www.blakestah.com. I want to have most of the information on one
page.
HOW:
In order to extract the content into
basics, just text and images, I will make an XSLT sheet that will attempt to
filter out all markup tags. Then there
will be a form that will use servlet pages to pick
the content the user wants. Then a DTD will
be created for each page to have content grabbed from. There will be ways to change your personal
page. I may expand from the Wahoo! project.
QUESTIONS:
Do you think this will be challenging
enough? What advice do you have?