The other day I discovered this awesome command line tool that lets you effectively download a website and save it locally. I used it to download a website of mine that is in WordPress and convert it to just a static backup of the site (images and all).

Download and install

The main website for httrack is https://www.httrack.com/ [1]

Ubuntu

Installing on Ubuntu is cake

> sudo apt-get install webhttrack

Cygwin

When looking how to do this I found this page http://sourceforge.net/projects/cygwin-ports/ [3].

Which led me to this page

http://cygwinports.org/ [4]

From cygwin run setup.exe but use the -K to point to the cygwin ports project.

> cygstart -- /cygdrive/c/cygwin64/setup-x86_64.exe -K http://cygwinports.org/ports.gpg

When you get to "Choose a Download Site"

Add ftp://ftp.cygwinports.org/pub/cygwinports

Click Add and Next

Doh! Error

unable to get setup.ini from ftp://ftp/cygwinports.org/pub/cygwinports

… what did I do wrong?

Oh I had a space in the URL. Once I removed that it worked.

Cool now I can just search for httrack and it shows up J

Start a new cygwin window and check if it'whichs installed.

> which httrack

How does it work?

Well for a more advanced explanation check out https://www.httrack.com/html/filters.html [2]

Basically you give it a start location, for example look at this command.

> httrack -v "http://www.whiteboardcoder.com/2015/08/" -O whiteboardcoder

It will then go to http://www.whiteboardcoder.com/2015/08/index.html copy the page and start copying images and other pages that the first starter page links to. It's smart, it won't start copying pages outside the URL you gave it. In fact, unless you tell it otherwise, it will only drill down, not up. So… even if it as a link to http://www.whiteboardcoder.com it won't copy http://www.whiteboardcoder.com/index.html . It won't even copy other subdomains… ex http://other-subdomain.whiteboardcoder.com/

You can define how you want to filter and even what you want to filter out.

For me I want copy all of *.whiteboardcoder.com. Here is the command that would do that.

> httrack -v "http://www.whiteboardcoder.com" -O whiteboardcoder "+*.whiteboardcoder.com/*"

This is the filter

"+*.whiteboardcoder.com/*"

All subdomains and every file (that the site link to… as long as they are part of the same domain).

If you want to get detailed on what you skip and what you get see https://www.httrack.com/html/filters.html [2] May be of some value to you… for example if you want to skip all .zip files you could with

"-*.zip"

Let me run this with the time command to see how long it takes to download my entire blog (which is run at blogger.com)

> time httrack -v "http://www.whiteboardcoder.com" -O whiteboardcoder "+*.whiteboardcoder.com/*"

Took almost 1.5 hrs to download my site.

Now I have a nice self-contained version of my website in my desktop.

External links still will take me to other URLs. Clicking from page to page just loads files locally on my hard drive.

And the original URL is preserved in the folder structure. Nice.

References

[1] htrack main site

https://www.httrack.com/

Accessed 10/2015

[2] Filters

https://www.httrack.com/html/filters.html

Accessed 10/2015

[3] cygwin-ports

http://sourceforge.net/projects/cygwin-ports/

Accessed 10/2015

[4] cygwin-ports how to run

http://cygwinports.org/

Accessed 10/2015

Use httrack to download an entire website

Posted on Thursday, October 1, 2015

Download and install

Ubuntu

Cygwin

How does it work?

References

No comments:

Post a Comment