The original, goal of archiving services was to keep some sort of historical record. As we use books less and less and convert everything into digital form, as services and websites open and close, there is a real risk of losing extremely valuable records of modern human history.
But in recent years, there has been a shift in how and why internet users archive sites, to serve a less altruistic and a more hostile purposes as political and harassment tool.
Several political subreddit now use moderation bots to replace links to “enemy” sites by archived clones in order to deny revenues to those sites.
The phrase “Anything you say can, and will be used against you.” isn’t limited to law enforcement, anything you say can effectively haunt you for ever.
Archive sites have stopped following robots.txt directives, and now also conceal their user-agent to avoid detection, why would you do this if you are running an honest service?
Some of them also refuse to remove archived documents unless they are forced to, a lot of unsavory individuals use that protection to guarantee that information that can be used to harass and hurt other people remains online.
As explained earlier, archive sites can take away your visitors despite the fact you produced the content.
You wrote a great article? That’s great, but you will never know about it if an archive is collecting the page views for you.
Only instead of big brother, it’s by internet trolls and social justice warriors, and there is no such thing as statute of limitations for them.
Because archiving services now masquerade as regular users, you don’t have a lot of options for blocking them.
You can check archive.ethernia.net/ip where i maintain a (relatively) up-to-date list of ip addresses known to be used by the two most popular archiving services.
Archive.org also allows you to opt-out of their archiving system if you want to, and the IP ranges they own are clearly referenced in the whois database.
As for manually collecting them, it’s basically a case of creating a specific URL just for them, use their crawling API/interface to ask for an archive, and check your server logs to see which addresses are trying to reach that URL.
Do keep in mind that with IPv6, most servers have access to huge IP ranges, if you notice that they always come from the same general address type, it might be worth blocking the entire /64 subnet they are part of.
That’s the simplest method for blocking, the main issue with it is that it’s not going to create any logs that you can monitor, and to the archive site, your server doesn’t exist. Simple, effective, but not so fun. If you’ve used iptables before, the syntax is pretty simple:
iptables -A INPUT -s XXX.XXX.XXX.XXX -j DROP (block a single address)
iptables -A INPUT -s XXX.XXX.XXX.0/24 -j DROP (block an entire ipv4 class C subnet)
iptables -A INPUT -m iprange –src-range XXX.XXX.XXX.XXX-XXX.XXX.XXX.YYY -j DROP (block a range of addresses)
I believe it is the same syntax with ip6tables.
NGINX is a really nice web server, it’s also what I happen to use so that’s what this quick tutorial will focus on.
Your server need the GEO module https://nginx.org/en/docs/http/ngx_http_geo_module.html with this we will be able to classify users based on ip address ranges.
You need to create a .conf file in the nginx config folder, global/ip-bans.conf for example, here is the syntax:
geo $is_bad_bot {
default 0;
XXX.XXX.XXX.XXX 1;
XXX.XXX.XXX.0/24 1;
}
Note: I also have a geo config file available here: http://archive.ethernia.net/nginx-geo.conf
You can also use ipv6 directives in this, it works exactly the same.
You then must include that file in the “http” section of your nginx.conf file, something like: include global/ip-bans.conf;
From this point, you can now refer to this $is_bad_bot variable in your nginx server definitions, here are a few example:
“Not authorized” (or you can use 404 for a simple “not found”), the main issue of this approach is that it makes it very clear that there is a problem.
if ($is_bad_bot) {
return 403;
}
Redirect them somewhere else (and yes, if you are cheeky, why not themselves?).
if ($is_bad_bot) {
return 301 https://google.com;
}
Make them wait for ever: (202 means that the request has been accepted, but processing has not yet been completed, seem to throw them for a loop, but because they keep trying, they expose more of their ip addresses)
if ($is_bad_bot) {
return 202;
}
I’ll add more interesting methods in the future.