Sscoop is a program to crawl through a scoop-based web site, downloading stories and comments into a local database for off-line reading or presentation in a different format. Sscoop can also generate one such alternate format, a time-sorted linear list of recent comments and/or stories; this makes it easier, in my opinion, to keep up with activity on a busy site.
Sscoop is written in perl 5. There are some required modules: it relies on GDBM; an interface to that is, as far as I know, installed by default when perl is built, at least on linux systems. Also, you need Time::Local and LWP from the CPAN, and that last has some prerequisites itself. To check if this all works, run the following command: "perl -c sscoop". This does a "dry-run" compilation of the program, but doesn't execute it. If that tells you "sscoop syntax OK", you're all set to go. I will try to gather a list of CPAN modules required to run and present it, along with some instructions on how to install everything.
The first time you run the program, it will spend several hours (depending on the size of the site) downloading lots of stories, diary entries, and many many comments; however, that's just elapsed time, since the program tries to be polite and so it sleeps between accesses to the website. Afterwards, it gets more or less just new stuff, so it runs much faster: in my tests, it's taken about five or six minutes or so (again, depending on the site). Again, that's elapsed time not CPU time; CPU time will be a fraction of a second.
I wrote sscoop to keep up with traffic on IP-wars.net, and that is the only site against which I've tested it. I have very briefly examined the HTML returned by one or two other scoop-based websites, and did not immediately see any differences that would break sscoop; but no guarantees. If you do find anything that breaks the program, let me know. There is one thing that's specific to IP-wars.net, and that is the name under which an anonymous poster gets listed: at IP-wars.net, such posters are listed as "Potential Recruit", while at many other sites they're referred to as "Anonymous Coward" or simply as "anonymous". This is something you'll need to investigate.
Here is the tarball containing sscoop and this readme file. Below is the help message which sscoop prints out if you invoke it as "sscoop --help". Farther down is a sample output.
Important: if you have used the previous version of sscoop, be aware that I've changed the format of the database, so you won't be able to use your old database. I did this for greater flexibility in storing stuff, it wasn't completely arbitrary... I do have a conversion script, which I'll be happy to hand out to anyone who wants. But one of the changes I made was to store ratings information, which wasn't in the old version of the database; so the first thing that'll get done will be to re-fetch everything, and thus you might as well just start over. Hopefully that won't happen again.
Enjoy!
Uwe Hollerbach
<korg@korgwal.com>
Copyright 2004, Uwe Hollerbach
Available under GPL v2. Share and enjoy!
This is sscoop version 0.1.0, a program for downloading a scoop-based website, sticking it into a local database, and organizing and displaying the downloaded data.
Usage:
./sscoop --helpprint this message, then exit./sscoop --init website db-file [site-name]./sscoop [--cached] [--save] [--log log-file] db-fileset up a new database "db-file" for a website: "website" should be specified just as the hostname of the computer, for example "www.ip-wars.net". site-name is an optional nick-name, for example "ip-wars" which to use when generating HTML.
./sscoop --dump db-fileget new stuff, optionally looking in cache directory first, and optionally saving downloaded stuff into cache directory; optionally write status messages to log file
./sscoop --html SPEC [--recent-first] [--no-anon] [--summary] db-file HTML-fileprint author, title, etc. for every entry in db, then exit
generate an HTML page with stories and comments selected according to SPEC, and write it to HTML-file. By default, write stuff oldest- first, ie, at the top of the page, but if --recent-first is specified, reverse that. Optionally, generate just a summary, ie, leave out the actual text of the story or comment. Optionally also leave out postings by anonymous posters.
SPEC is a string of the form "$name:$interval" where $name is the name of some section, ie, "diary" or "legaldocs" or "comment", etc, or "story" to show everything except comments, and $interval is some amount of time prior to the present from which you want items: ie, '3d' means "stuff from the last three days", or '2w' means "stuff from the last two weeks". Suffixes recognized are 'h', 'd', 'w', 'm', and 'y', meaning hours, days, weeks, months, or years, respectively; if no suffix is given, it means minutes.
If you want stuff from all sections (including comments), leave out the "$name:" part of the spec: ie, if SPEC is just '3d' it means "everything in all sections in the last three days". If you want everything in one particular section, leave out the ":$interval" part of the spec: ie, if SPEC is just "comment" it means "all comments from the beginning of time onward".
Alternately, if you specify SPEC as "author:$name:$interval", it means to get everything by the particular author named $name in the specified interval, or if you specify SPEC as "rating:$value:$interval", it means to get every comment rated at $value or higher in the specified interval. These last two are still under development; it would be nice to leave out parts of the spec, but that isn't implemented yet.
Examples:
set up a new database, pointing it at www.ip-wars.net, and giving that site the nickname ip-wars (this'll show up in the HTML page that gets generated)
./sscoop --init www.ip-wars.net database.db ip-warsupdate the database
./sscoop --log ip-wars.log database.dbget all stories (ie, non-comments), most recent first, and write them into a file called "stories.html"
./sscoop --recent-first --html story database.db stories.htmlget all comments from the past three days, leaving out anonymous posters, and write them to standard output, most recent first
./sscoop --recent-first --no-anon --html comment:3d database.db -By default, the program will fetch new stuff from the website and update its database.
All command-line flags can be written with a single '-' instead of two, and they can be abbreviated (up to the point where they're unique: since 'html' and 'help' both start with 'h', you'd need to write '-he' for help, and '-ht' for HTML).
If you want to start over, just delete the database file.
Here is an example of the HTML produced by the program: this is a list of the comments posted to www.ip-wars.net in the last two hours, as of 4:25 pm EST on Saturday the 1st of January, 2005. After the local database was updated by running "sscoop database.db", this output was generated with the command "sscoop -html 2h database.db foo.html" and then foo.html was cut'n'pasted into this file.
The site: ip-wars
Last local db update at Sat Jan 1 16:14:09 2005
Last ratings check at Mon Dec 27 16:05:37 2004
Entries since Sat Jan 1 14:25:41 2005
All Sections
Future of Open-Source Software ( none / 1 ) by br3n at Sat Jan 1 14:37:37 2005http://www.linuxinsider.com/story/Future-of-Open-Source-Software-39284.html
Knight Ridder/Tribune Business News
Linux may not be ready for high-end transaction systems or "earned in blood" reliability, but it is certainly ready for mission-critical firewall and Web applications. Linux has become big business for a number of companies, and it continues to grow in terms of economic and technological importance.
br3n
[ Reply to This ]