Posts tagged ‘crawl’

Howto dump all texts of a web site

Recently I wanted to get all texts from a web site to prepare a translation. Although I found the one or other tip, here is the easy solution with a linux system.
Simply use lynx:

lynx -crawl -traversal http://website.de

(just replace website.de with the URL you want to get the texts from).

To get all html pages (no images etc) you can use wget:
wget -r -k -L -A htm,html http://website.de

With lynx you will get all texts as dat files and can start reading/translating offline.