Building Web Projects

with Server-Side Includes, a little Perl, and some JavaScript
A More Useful 404

A More Useful 404

Published: May 30, 2024
Revised: September 18, 2024
Post a comment
Note: I originally published this in "A List Apart." This article contains updates as well as new sections, "Added Benefits" and "Other Uses."

Problem:

Encountering 404 errors on the internet is not new. Often, developers have provided custom 404 pages in an effort to make the experience a little less frustrating. However, for a custom 404 page to be truly useful, it should not only provide relevant and specific information to the user, but should also provide immediate feedback to the developer so that when possible, the problem can be fixed.

To accomplish this, I developed a custom 404 page that uses server-side includes to execute a Perl script. The 404 page is built to have the look and feel of the website in order to provide consistency while the Perl script does the processing to determine the cause of the 404 error and takes the appropriate action.

Overall design

To provide useful and specific information to the user, it is necessary to define the possible causes of a 404 error. Here are four possible causes:

Case 1. The user mistyped the URL or followed an out-of-date bookmark. These are grouped together because we'll see that it's not possible to distinguish one from the other.

Case 2. The user encountered a 404 error as a result of a broken link on one of my pages on the website.

Case 3. The 404 error was the result of a broken link returned by a search engine.

Case 4. The 404 error was caused by a broken link on another website, but not a search engine.

In each of these cases, the user is provided information about the specific cause of the error. If the broken link is either on my website or someone else's website, but not returned via a search engine, the Perl script sends me an email message telling me about the broken link including the URL of the page that the link is on and the page the user was trying to reach.

Custom 404 page

A common use of SSI is to include snippets of static HTML, maybe a header and footer, in order to share these elements throughout the site. SSI pages, which typically have a .shtml extension, are processed by the server prior to being sent to the browser.

When an SSI directive such as <!--#include virtual="/inc/header.html" --> is encountered in the .shtml file, the server will replace that line with the contents of the file specified. (Note, you can also include a .shtml file which will also be processed by the server.)

However, in addition to this rather simple function, SSI is capable of executing programs such as Perl scripts. In this case, the output generated by the Perl script is sent to the browser.

Since I wanted my custom 404 page to provide specific information to the user as well as send information to me, my custom 404 page is a .shtml file in which I use SSI to execute a Perl script that does all of the work. For my site, the SSI directive looks like this.

<!--#include virtual="/cgi-bin/404.pl" -->

The code for 404.pl is shown toward the end of this article.

Enabling custom 404 pages

In order to use SSI, the web server needs to be configured. This can be done by either using a .htaccess file, or modifying the Apache httpd.conf file.

First, in order to have Apache serve up my specific 404 page when a 404 error is encountered, I add the ErrorDocument directive to the httpd.conf file, or the .htaccess file. It looks like this.

ErrorDocument 404 /errorpages/404.shtml

And it is the 404.shtml page contains the SSI directive shown above, <!--#include virtual="/cgi-bin/404.pl" -->. The code for 404.shtml is shown toward the end of this article.

Second, in order to tell Apache to execute CGI scripts, I need to make sure the httpd.conf file has the ExecCGI parameter added to the Options directive, or I can just add

Options +ExecCGI

to the .htaccess file.

Perl script

The Perl script does all of the processing in order to determine the appropriate action. In order to identify the source of the 404 error, the Perl script accesses the HTTP_REFERER environmental variable. HTTP_REFERER will contain the URL of the page that the user just came from. I realize that there are no guarantees that this is accurate because it can be faked, but this isn't really a concern for this application.

In general, the Perl code performs the following steps.

  1. Check HTTP_REFERER to determine the source of the 404 error.
  2. Display the appropriate message to the user.
  3. Send me an email message, if needed for the particular error.

Case 1: Mistyped URL or out-of-date bookmark

In the case of a mistyped URL or an out-of-date bookmark, the HTTP_REFERER will be blank. In Perl, I check for this using the following.

if (length($ENV{'HTTP_REFERER'}) == 0)

The Perl script then displays a message to the user in the custom 404 page saying what the problem is. In the messages displayed to the user, as well as any email messages, I provide the URL of the requested page using:

my $requested = "$ENV{'SERVER_SCHEME'}://$ENV{'SERVER_NAME'}$ENV{'REQUEST_URI'}";

Now some readers may realize that HTTP_REFERER can be faked so checking for it to be blank is not perfect.

Case 2: Broken link on my website

When HTTP_REFERER is not blank, I then check it to see if it refers to my site, somebody else's site, or a search engine. If it contains my domain name, then I know the user followed a link from one of my pages. The Perl I use to check for this is,

if ((index($ENV{'HTTP_REFERER'}, $ENV{'SERVER_NAME'}) >= 0))

The index function will return the position of SERVER_NAME in the HTTP_REFERER string. If it's there, index will be a number >= 0 and I'll know that the user was on a page on my site.

In this case, I present a message to the user stating that I have a broken link on my page. However, rather than ask the user to send me an email telling me this, the Perl script sends me an email containing all of the necessary information. At the same time, I let the user know that an email has just been sent and the broken link will be corrected shortly.

In the email message, I set the subject of the message to clearly identify that there is a broken link on my site and provide the domain name using $ENV{'SERVER_NAME'}. This allows me to use this script on multiple sites but simplifies the sorting of any incoming messages. The body of the email tells me the URL of the page the user was on, as well as the URL of the requested page.

Case 3: Broken link on returned from a search engine

To determine if the user came from a search engine results page, I check HTTP_REFERER against a list of search engine domains. This list is stored in a simple text file that the Perl script reads. By using an external file containing a list of URLs I can update the list at any time and not have to modify the Perl.

Here are the Perl snippets for this case:

my $SEARCHENGINE = "false";
open(FILE, "searchengines.txt") or die "cannot open the file";
while () {
   $_ =~ s/\s+$//; # Remove trailing whitespace
   if (index($referrer, $_) >= 0) {
      $SEARCHENGINE = "true";
   }
}

then,

if ($SEARCHENGINE eq "true")

In this case, I let the user know that the search engine returned an old link. Since there really isn't anything I can do about it, I don't need an email message, however, I may want one just so I know about it.

Case 4: Broken link on somebody else's website

If the 404 was not the result of any of the three previous situations, then I know it was caused by a broken link on somebody else's page. So again, the Perl script displays the appropriate information to the user and sends me an email message. I can then go to the page with the broken link and if the page owner has provided contact information, I'm able to notify them of the problem.

Finally

Implementing this custom 404 page improves the usability of my site by helping the user, and it keeps me informed of broken links. The table below shows the four cases discussed along with the message displayed to the user and any email message that is sent.

Case Message to User Email message
1. Mistyped URL or out-of-date bookmark

Sorry, but the page you were trying to get to, https://www.mydomain.com/no-such-page.shtml, does not exist.

It looks like this was the result of either

  • a mistyped address,
  • or an out-of-date bookmark in your web browser. You may want to try searching this site or using our site map to find what you were looking for.
No
2. Broken link on one of my pages

Sorry, but the page you were trying to get to, https://www.mydomain.com/no-such-page.shtml, does not exist.

Apparently, we have a broken link on our page. An email has just been sent to the person who can fix this and it should be corrected shortly. No further action is required on your part.

From: The www.mydomain.com 404 script

Subject: Broken link on my site, www.mydomain.com.

Message:
BROKEN LINK ON MY SITE

There appears to be a broken link on my page, https://www.mydomain.com/badlink.shtml. Someone was trying to get to https://www.mydomain.com/no-such-page.shtml from that page. Why don't you take a look at it and see what's wrong?

3. Broken link on a search engine results page

Sorry, but the page you were trying to get to, https://www.mydomain.com/no-such-page.shtml, does not exist.

It looks like the search engine has returned a link to an old page. These old links should eventually be removed from their indexes but since these are automatically generated there is no one to contact to try to correct the problem.

You may want to try searching this site or using our site map to find what you were looking for.

Optional. An email message is not needed because there isn't much I can do about the broken link but I may go ahead and have the script send me one just so I know about it.
4. Broken link of somebody else's page

Sorry, but the page you were trying to get to, https://www.mydomain.com/no-such-page.shtml, does not exist.

Apparently, there is a broken link on the page you just came from. We have been notified and will attempt to contact the owner of that page and let them know about it.

You may want to try searching this site or using our site map to find what you were looking for.

From: The www.mydomain.com 404 script

Subject: Broken link on somebody else's site.

Message:
BROKEN LINK ON SOMEBODY ELSE'S SITE

There appears to be a broken link on the page, https://www.somedomain.com/badlink.shtml. Someone was trying to get to https://www.mydomain.com/no-such-page.shtml from that page. Why don't you take a look at it and see if you can contact the page owner and let them know about it?

Added Benefits

Missing Resources

One of the added benefits of this code is that it allows you to find missing items (e.g. images) on your webpage. Let's say you've just moved your site to production. As you click through it, if you come across a page with a missing image, for example, the server will generate a 404 error and this code will generate an email message to you (assuming you're the recepient identified in the code.) Now I realize that often missing images are obvious, but this is not always the case and this code will tell you which ones.

Hacking Attempts

Another benefit of this code I've discovered is that when there's a brute force attack on my website, I'll receive emails showing what files are trying to be accessed, and the IP of the source of the attacks. Many times I've had attackers hit mysite anywhere between 5 to 200+ times in the span of a few seconds or minutes, indicating that they're using a script to try to find a way in.

When this happens I'll often do is block that IP, or a range of IPs if it's from China, or Russia, or other places that I don't care about. This is easy to do in Apache. Here's an example of a block of code that you can place in the .htaccess file.

<RequireAll>
   Require all granted

   # This will block a single IP
   # No need to do this if I block the range including this as shown below.
   Require not ip 1.61.33.44

   # The following blocks the range of IP addresses, 1.56.0.0 - 1.63.255.255
   Require not ip 1.56.0.0/13
</RequireAll>

One of the things I quickly discovered after implementing my 404 code is that most hacking attempts are from sites trying to access Wordpress-specific files. I would sometimes get 20-200+ emails showing me that a specific IP was trying to access various wordpress files. To me that indicates that these have vulnerabilities and someone is trying to exploit them. Now I don't use Wordpress, although I experimented with it several years ago. What I discocvered is that it took me longer to try to figure out how to do something in Wordpress that for me to just write the code. That, coupled with the fact that hackers are targeting Wordpress files, was enough to make me avoid it. Yes, I know there are lots of good things about Wordpress, it just wasnt' for me.

The other thing I'll often do when that happens it to contact the abuse email address for the IP (found using a whois search) and send them the list of access log entries showing these attempts. And more than once I've received replies from the owner of the IP thanking me for that and that they have blocked the offending user. I don't do this if the owner is in China or Russia because I'm sure they don't care.

See it in action

Example from entering a bad URL

Want to see this in action? Simply modifiy the URL to this page by adding something to the end of it.

For example, I entered https://www.buildingwebprojects.com/no-such-page.shtml. The browser then displayed the following message.

Webpage Display

Sorry, but the page you were trying to get to, https://www.buildingwebprojects.com/no-such-page.shtml, does not exist.

It looks like this was the result of either:

Just access the menu to see all available articles.

Then, the system sends me an email. Here's what that looks like.

Email Message

Sent: Mon, Apr 1, 2024 at 09:04:10 AM;

Bad URL entered.

Someone at 11.22.33.44 (11.22.33.44) was trying to get to http://buildingwebprojects.com/articles/page-dependent-content/dfghj after either entering the URL, following an out-of-date bookmark, or something else (i.e. there was no HTTP_REFERER).

This is an automatically generated email message. It won't do you any good to reply to it because no one will ever see your message.

Example from someone else's site

If the user followed a link to the bad URL they'll see the following message. (You can see this for yourself simply by creating a link to the bad URL from a webpage of your own.)

Webpage Display

Sorry, but the page you were trying to get to, http://buildingwebprojects.com/articles/page-dependent-content/dfghj, does not exist.

Apparently, there is a broken link on the page you just came from. We have been notified and will attempt to contact the owner of that page and let them know about it.

And here's the email I received.

Email Message

Sent: Mon, Apr 1, 2024 at 09:15:07 AM;

BAD LINK ON SOMEBODY ELSE'S SITE

There appears to be a broken link on the page, http://bwp-dev/. Someone at 111.222.333.444 was trying to get to http://dev.buildingwebprojects.com/articles/page-dependent-content/dfghj from that page. Why don't you take a look at it and see if you can contact the page owner and let them know about it?

This is an automatically generated email message. It won't do you any good to reply to it because no one will ever see your message.

Other Uses

I recently found another use for this code. Here's the situation.

I had discovered a problem with someone else's website, actually several problems, and I took a series of screenshots planning to notify the site owner. However, if I attached all of those images to an email message, the message would've become excessively large. So instead, I created a simple webpage where I posted those images along with some brief text explaining the problem. Then, I sent a short email message to the site owner explaining that I had noticed some issues with the site and had posted some screenshots, pointing them to the webpage I had built.

Well, I waited, assuming that I would receive a reply from the owner, at least acknowledging that they had received my note and were looking into the problem, but I never did. In fact after checking my server logs it turned out that no one other than myself had ever accessed the page showing the probllems.

Well, I guess I was a little surprised that this company (a government contractor) had no interest in reading emails (which I assume they didn't because they never checked the webpage I had built), and therefore no interest in creating a quality product. Since my code will generate emails when 404s are encountered, I decided to intentionally have the page generate a 404 and then I would receive an email and I would know that the page had been viewed.

So, here's what I did. I added a link to a non-existent image on the webpage I had built. And to prevent a broken image symbol from appearing on the webpage I applied a little CSS to hide it. Here's the code.

<img src="blank.gif" style="visibility: hidden;">

I can put this anywhere in the file and it won't affect the displayed page.

Code

Here's the 404.shtml file. I've stipped out the unnecessary pieces only showing what's important. You'll want to modify the extra HTML to produce the look and feel of your site.

<!DOCTYPE html>
<html lang="en">
<head>
      <!--#include virtual="/inc/head.shtml" -->
      <title>Building Web Projects - with Perl, SSI, JS, and some jQuery</title>
      <style>
         .error-404-message {
            margin-left: 6px;
         }
         .sorry {
            margin-top: 100px;
            margin-bottom: 20px;
         }
         header a,
         header a:hover {
            color: #e0e0e0;
            text-decoration: none;
         }
      </style>
</head>
<body>

   <header>
      <h1><a href="/">Building Web Projects</a></h1>
      with Server Side Includes, a little Perl, and some JavaScript
   </header>

   <div class="error-404-message">
      <!--#include virtual="/cgi-bin/404/404.pl" -->
   </div>
   
   <!--#include virtual="/inc/footer.shtml" -->
</body>
</html>

Here's the 404.pl file.

Summary

To summarize, here's how everything is put together.