wiki:wk2txt

wk2txt

A Website to Text Converter

DESCRIPTION

wk2txt is a command line tool based on the Webkit Engine to get the text content of a websites as seen by current webbrowsers which includes the evaluation of javascript code present in many modern webpages.

wk2txt can process a whole list of URLs, which can be built by using the "--input" option multiple times or by passing the filename of a list of URLs to the "--urlfile". This file is expected to have 1 URL per line.

The contents of the webpages is written to one file per webpage. The files are stored in the directory specified by the "--output" option. The filename is derived from the URL, it is the SHA1 hash code for the URL.

By default, wk2txt dumps the contents and the meta data of the webpage in plain text format. If the "--xml" option is used, content and meta data is dumped in XML format.

SYNOPSIS

wk2txt *options*

Options:
   --input URL               convert *URL* to text
   --urlfile file            convert URLs in *file* to text
   --output directory        save output in directory *directory*
   --xml                     dump in XML format
   --timeout inverval        timeout after *interval* seconds (NOT IMPLEMENTED)
   --debug                   print diagnostic information
   --help                    print help on options
   --version                 print version number

AUTHORS

Christian Plessl <christian.plessl@..remove..oetiker.ch>

Roman Plessl <roman.plessl@..remove..oetiker.ch>

SOURCE CODE

Download and get the source code from github http://github.com/rplessl/wk2txt

Last modified 10 years ago Last modified on Feb 7, 2010, 9:17:26 PM