It can sometimes be useful to download a complete website. The HTTrack tool is ideal for this. However, think before you start, because it creates a lot of extra data traffic at the provider of the site that you are going to download.
Downloading a website is useful to always have it on hand locally. In our modern regions, that problem no longer exists. Gone are the days of dial-up modems and slow, unreliable connections. But if you’re going on vacation or a work trip to faraway places, a reliable internet connection is a lot less obvious. That can be a reason to download a certain site that you definitely need for work or study in terms of reference material before leaving.
A much more interesting scenario is downloading orphaned, historically interesting websites with unique content. More and more of these are disappearing for good. For example because its owner dies, a hoster removes old ‘junk’ and so on.
History disappears little by little. Do you have a specific hobby for which you use sites that you notice that there have been no changes for years, then the chance of unexpected disappearance increases. Such a scenario justifies download.
Ideal for on the nose
In principle, do not randomly download current and well-maintained websites. That costs the provider a lot of data traffic. Moreover, the downloaded material is then only a snapshot in time and very quickly obsolete. Focus on those really special sites, or on those rare occasions when you need such a snapshot in an internet-less area.
To be able to download a site comes the free tool – without mentioning a veteran who has been around for a long time – HTTrack best suited. It is available in a variant for Linux and Windows. The Linux variant in particular is very interesting. Downloading a large site can take days to sometimes more than a week. Speeds are all too often limited from the website side.
It means that the Windows variant is somewhat clumsy, because you have to leave a PC on for days and nights at a time. Not only a waste of energy consumption, but the chance that Windows restarts between companies because of some update is also quite high.
For example, you’d better install a virtual Linux system on your NAS (if you have that option), and install HTTrack on it. That thing is usually on 24/7 anyway, so a nice job on the side won’t cost you extra energy consumption. Or think about a raspberry pi with a somewhat more generously sized SD storage.
Debian as an example
We will start with the Linux variant (but if you use Windows you can join too!), a Debian machine running as a VM on a Synology NAS. To install HTTrack, start the app Software (click on Activities ltop left and then into the open dock on the shopping bag). In the opened window, click on the magnifying glass at the top left and enter the search term in the appropriate field web trackâ€
click on WebHTTrack Website Copier and install the tool using the appropriate button on the opened page. You can close the virtual software store again.
The Linux version can then be found in your software overview, which can be reached in Debian with a click on the multi-dotted button at the bottom of the dock. Your browser will then be launched. Exactly that is the essential difference between the Windows and Linux version. The Windows version is a traditional .exe program, the Linux version is displayed in your browser via a local web server.
However, the functionality is the same, with the only downside being that the Windows version is somewhat ‘dated’ in appearance. But it works flawlessly in both cases, and that’s what matters, of course. Press the button Next one – at the bottom right of the screen – to get started.
To work
As an example, let’s take a copy of https://www.landley.net†Don’t mirror this site locally, it’s just an example. In any case, a site with a lot of historically interesting information, especially under Computer Historyâ€
To get started, fill in New project name (the Dutch language module is not completely error-free) enter a description of your project. This can – logically – be the name of the site, possibly with a keyword in question. You can basically leave the other settings as they are. click on Next oneâ€
Copy the URL of the website to be mirrored locally and paste it in the URL-block. Then click on Define settings†There are many options, hereafter advice on settings that lead to the best result.
Optimal settings
The settings under Links, Structure, MIME types, Browser ID, Filters, Protocol, Index, Cache and proxy leave it at the default values. You can only tinker here if you have very specific wishes, but that will rarely or never be necessary. are we going to Expert†Experience shows that there the – under Global Walkthrough Mode – the option Stay on the same domain often produces better (read: more complete) results than the default selected option Stay at the same address (default†If you don’t get the complete site (or just the part you want), then this is the place to experiment a bit.
Pay attention to the maximum
At the top, tap Power Control, here too it is in principle best to leave the settings as they are. Optionally you can enter the number of connections N# connections raise something. Might make mirroring a bit faster, but also puts a higher load on the server the site is on.
Let’s try to keep it a bit neat and not fill in something like ‘100’ here… Click on Limits and put the option here Maximum external depth on 0†If you don’t, you run the risk of downloading the entire internet. That will not only take a long time, but also – understatement – ​​a lot of local storage space.
The Maximum transfer speed you can increase at your own discretion, for example half of your available download bandwidth. Have a look at the other settings, you may find something to your liking there.
On very large and complex sites, it is not unwise to Maximum depth ilimit them somewhat, for example to five levels. You may not get the complete site with that, but you will get the most important things. Saves download time and especially storage space. By the way, a field with no value means ‘no limit’.
Robot
Many websites – if not all, nowadays – have a text file built in that indicates that the site may not be copied by robots. This is to prevent all kinds of automatic tools and crawlers from downloading entire sites and completely cutting the bandwidth. In principle, HTTrack also adheres to this rule. The result can be guessed: nothing or almost nothing is downloaded from the site you want to copy.
In other words: we now have to act very rude and ignore that rule (wish). Click on spider and choose under Follow the rules in Robot.txt the option No robots.txt lines†In principle, you are now ready in terms of settings; click on OK to make all the above adjustments at once.
Handy index page
click on Next one and then Start, after which the download (local mirroring) from the site will begin. As mentioned, it can take quite some time with larger sites, so sticking with it is usually of little use. When the mirroring is complete, you will automatically see a notification and the browser window can be closed.
You will find the site in the folder chosen in the very first step, under Debian this is by default the home folder (Personal folder) associated with your Debian account. In it you will find the websites folder with the downloaded sites.
It is nice that HTTrack makes a handy index page. Double click on the file index.html in the websites folder and you will see links to all underlying downloaded sites appear there.
â€