Move over, wget! Mirroring sites with httrack

Date August 23, 2007

Wget is great; I use it all the time for simple and *ahem* “bulk” downloads. But when you're after the spirit of a web page, httrack seems to do a much more thorough job. Turning a site from dynamic content has never been easier.

I didn't want to believe that wget could be bested, but try as I may, it just wasn't working. CSS style elements were missing when I tried to mirror a dynamic site for a client, and reconstructing a stylesheet and spending developer time on fixing links just wasn't going to happen.

After tweaking my wget commands and re-downloading a couple of times, I decided it was time to try another tactic. That's when I stumbled upon httrack, wget's big brother. Httrack is licensed under the GPL as well, so there's no need to feel dirty by running a proprietary solution. Let the content you download handle that for you…

httrack has many options, but Fred Cohen's suggested command worked great for me on the first pass:

httrack "http://www.example.com/" -O "./www.example.com" "+*.example.com/*" -v

httrack –help will give you all the available options, but honestly — you probably won't need them.

2 Responses to “Move over, wget! Mirroring sites with httrack”

  1. wget: some quick tips » Tip o’ the Day said:

    [...] useful tool for mirroring websites is httrack. I blogged about it a couple of weeks ago here. Bookmark! blog entries Permalink Comments [...]

  2. Ten Sigh said:

    I’ve tried that on a number of sites and it doesn’t work. Admittedly I’m using it to download .jpg or .gif files from some people’s sites, but I do it right; I set the delay to 20 kbps so I don’t suck up their bandwidth.

    Still doesn’t work, tho, even if you turn robots off.

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>