Move over, wget! Mirroring sites with httrack
August 23, 2007
Wget is great; I use it all the time for simple and *ahem* “bulk” downloads. But when you're after the spirit of a web page, httrack seems to do a much more thorough job. Turning a site from dynamic content has never been easier.
I didn't want to believe that wget could be bested, but try as I may, it just wasn't working. CSS style elements were missing when I tried to mirror a dynamic site for a client, and reconstructing a stylesheet and spending developer time on fixing links just wasn't going to happen.
After tweaking my wget commands and re-downloading a couple of times, I decided it was time to try another tactic. That's when I stumbled upon httrack, wget's big brother. Httrack is licensed under the GPL as well, so there's no need to feel dirty by running a proprietary solution. Let the content you download handle that for you…
httrack has many options, but Fred Cohen's suggested command worked great for me on the first pass:
httrack "http://www.example.com/" -O "./www.example.com" "+*.example.com/*" -v
httrack –help will give you all the available options, but honestly — you probably won't need them.
Posted in 

content rss
October 11th, 2007 at 10:10 pm
[...] useful tool for mirroring websites is httrack. I blogged about it a couple of weeks ago here. Bookmark! blog entries Permalink Comments [...]
November 19th, 2007 at 6:28 pm
I’ve tried that on a number of sites and it doesn’t work. Admittedly I’m using it to download .jpg or .gif files from some people’s sites, but I do it right; I set the delay to 20 kbps so I don’t suck up their bandwidth.
Still doesn’t work, tho, even if you turn robots off.