The
other day I discovered this awesome command line tool that lets you effectively
download a website and save it locally.
I used it to download a website of mine that is in WordPress and convert
it to just a static backup of the site
(images and all).
Download and install
Ubuntu
Installing
on Ubuntu is cake
> sudo apt-get install webhttrack
|
Cygwin
When looking how to do this I found this page http://sourceforge.net/projects/cygwin-ports/
[3].
Which led me to this page
From cygwin run setup.exe but use the -K to point to the
cygwin ports project.
> cygstart --
/cygdrive/c/cygwin64/setup-x86_64.exe -K http://cygwinports.org/ports.gpg
|
When you get to "Choose a Download Site"
Add ftp://ftp.cygwinports.org/pub/cygwinports
Click Add and Next
Doh! Error
unable to get setup.ini from ftp://ftp/cygwinports.org/pub/cygwinports
unable to get setup.ini from ftp://ftp/cygwinports.org/pub/cygwinports
… what did I do wrong?
Oh I had a space in the URL.
Once I removed that it worked.
Cool now I can just search for httrack and it shows up J
Start a new cygwin window and check if it'whichs installed.
> which httrack
|
How does it work?
Well for a more advanced explanation check out https://www.httrack.com/html/filters.html
[2]
Basically you give it a start location, for example look at
this command.
> httrack -v "http://www.whiteboardcoder.com/2015/08/" -O whiteboardcoder
|
It will then go to http://www.whiteboardcoder.com/2015/08/index.html
copy the page and start copying images and other pages that the first starter
page links to. It's smart, it won't
start copying pages outside the URL you gave it. In fact, unless you tell it otherwise, it
will only drill down, not up. So… even
if it as a link to http://www.whiteboardcoder.com it won't copy
http://www.whiteboardcoder.com/index.html .
It won't even copy other subdomains… ex http://other-subdomain.whiteboardcoder.com/
You can define how you want to filter and even what you want
to filter out.
For me I want copy all of *.whiteboardcoder.com. Here is the command that would do that.
> httrack
-v "http://www.whiteboardcoder.com" -O whiteboardcoder "+*.whiteboardcoder.com/*"
|
This is the filter
"+*.whiteboardcoder.com/*"
|
All subdomains and every file (that the site link to… as
long as they are part of the same domain).
If you want to get detailed on what you skip and what you
get see https://www.httrack.com/html/filters.html
[2] May be of some value to you… for
example if you want to skip all .zip files you could with
"-*.zip"
Let me run this with the time command to see how long it
takes to download my entire blog (which is run at blogger.com)
> time httrack -v "http://www.whiteboardcoder.com" -O whiteboardcoder "+*.whiteboardcoder.com/*"
|
Took almost 1.5 hrs to download my site.
Now I have a nice self-contained version of my website in my
desktop.
External links still will take me to other URLs. Clicking from page to page just loads files
locally on my hard drive.
And the original URL is preserved in the folder
structure. Nice.
References
[1] htrack main site
Accessed
10/2015
[2] Filters
Accessed
10/2015
[3] cygwin-ports
Accessed
10/2015
[4] cygwin-ports how to run
Accessed
10/2015
No comments:
Post a Comment