In Touch With .htaccess

(I’m actually not sure why I have named this post that, but it has stuck in my brain and refuses to budge. Also I seem to have a four-word title thing going on since moving to Webby, so I’m going with it for now.)

One of the many, many things I am loving about Webby is getting to play with .htaccess files. They are such brilliant little utilities to have at your disposal. I actually still find it a novelty to upload files and directories to a server and then see those pages in my web browser, and getting to do fun tricks with .htaccess is just icing on a very yummy cake. I’m not going anti-web-framework or anything, really I’m not, but it sure is refreshing to get back to basics.

.htaccess vs. Sitewide Config

According to Apache, you should avoid using .htaccess files as much as possible and instead put your instructions into the main server configuration file. They have some good arguments, but I think .htaccess files make a lot more sense for the uses I’m describing. (Of course, it’s a moot point since I am on shared hosting and don’t have access to my main server configuration file.)

I like .htaccess files since they are utterly immediate. You put them right where you need their functionality. Your 5 lines of code aren’t lost in a 4,000 line conf file in some directory you can never find. You can see it, change it, version it. You won’t forget that it’s there. If you delete the directory, you delete the configuration along with it. It will always be deployed, automatically, at the same time as the rest of your site. It’s a perfect fit with Webby.

Redirects

I set up my first .htaccess file to do some permanent redirects. I made a big mistake with my old blog. I displayed the full text of articles on lots of pages throughout the site, like tag pages and monthly and yearly archive pages, but I didn’t put NOINDEX tags on these pages. So, search engines had no way of knowing which was the actual permanent location of the blog post content, and which was just a /tagged/with/buzzword page. Unfortunately, in many cases the tag page ended up being much more prominent on Google than the actual blog post. I’m guessing this was because I was silly enough to put tag clouds in my sidebar, so these /tagged/with pages had lots of internal links pointing to them, hence they looked more important.

I have now decided that tags are a total waste of time anyway. Nobody needs tags to find your content. That’s what search is for. Google knows how to semantically parse your content much better than you do. It will find the important words. If you really want to have tags on your blog then go bookmark all your blog posts in del.icio.us. Or, better yet, go and see what other people have tagged your posts with. Those are the tags that really matter anyway, the ones your readers assign.

Getting back to the point, if you have any of your content duplicated anywhere on your site, then NOINDEX tags are your friend. Only let the search engines see 1 copy of your content. Everyone will be much happier that way. Since I didn’t do this, and I still get (relatively) plenty of inward traffic to some of these /tagged/with pages, the first .htaccess file I set up was along these lines:

Redirect 301 /tagged/with/pictobrowser http://ananelson.com/said/on/2007/12/17/screencasts-are-so-2005
Redirect 301 /tagged/with/skitch http://ananelson.com/said/on/2007/12/17/screencasts-are-so-2005
Redirect 301 /tagged/with/loldocs http://ananelson.com/said/on/2007/12/17/screencasts-are-so-2005

This is the sort of thing I probably would put into a sitewide conf file if I had access to it. They are redirects which can be safely forgotten about, I’ll probably take them out eventually but they won’t do any harm if they stay there indefinitely. I redirected the more popular tag pages directly to the blog posts they refer to, then I added a catch-all to redirect any other pages to the list of all my old blog posts:

RedirectMatch 303 ^/tagged http://ananelson.com/said/on

As with any regular expression based scenario, make sure catch-alls go AFTER everything else that their pattern might match.

By the way, you can increase the verbosity of Apache mod_rewrite logging temporarily to help you debug RedirectMatch statements. Comes in handy if you are trying to figure out how to do fancy regular expression redirects. (But don’t leave it on a high setting unless you are impressed by the size of really large log files.)

Custom Directory Listings

I have a /tmp directory which I am experimenting with at the moment. It’s, well, a place for me to dump temporary files. I knew, of course, that Apache will simply display the contents of a directory unless you tell it not to. I didn’t know until yesterday that you can customize the way it does this. You can even add your own stylesheets so that the directory contents list looks like the rest of your site. So, of course I had to play with this! I have a .htaccess file in /tmp which says:

IndexOptions +SuppressHTMLPreamble +IgnoreCase NameWidth=*
HeaderName apache-index-header.html
ReadmeName apache-index-footer.html

The +SuppressHTMLPreamble lets you define our own HTML headers (and therefore stylesheets), otherwise you are stuck with HTML 3! You specify a file HeaderName which should include opening <html> and <body> tags, and then a file called ReadmeName which should include the closing </body> and </html> tags. Apache will stick your directory contents in the middle.

Restricting Access

Another obvious use for .htaccess files is to restrict access to places you don’t want people to go. Sometimes, you just don’t want a directory’s contents to be listed but the items in that directory still need to be accessible. I have a /bin directory which contains some files, like the comment submission form, which need to be accessible, but I don’t really want people snooping in my /bin directory. (Not for security, just aesthetics.) So, there’s a .htaccess file in there with Options -Indexes. I have some other directories which contain files that are only going to be used via the file system on the server. These I can restrict access to completely with deny from all. It’s convenient for me to deploy these files using Webby, and with deny from all they are neatly hidden.

Drafts in Webby

A nice way to work on drafts of blog posts in Webby is to add a .htaccess file to the directory containing your post with a deny from all statement in it. (I put each post in its own directory with the post in an index.html file.) In this way the blog post is invisible on your website, so you can continue to deploy your site as normal, but locally Webby’s built in heel server will ignore .htaccess files so you will be able to see the drafts yourself when you autobuild. You need to combine this with some way to prevent the blog post from being mentioned in your RSS feed or on your archive pages. I stick index: false in the metadata until I am ready to publish (see this thread on the Webby mailing list) and I comment out the created at timestamp. In fact I do this in my blog template so blog posts are “unpublished” by default.

#####
# Uncomment created_at time,
# delete index: false
# and delete the .htaccess file
# when ready to post this article.
created_at: # <%= Time.now.to_y %>
index: false
#####

Those of you who got a feedful of lorem from me last night will, no doubt, appreciate this precaution. Sorry about that, and props to Fintan who took the time to reply in Latin.

You put the feed where?

When I was setting up my new atom feed I noticed that WordPress put the old feed at http://ananelson.com/feed/. A directory. Seriously, mod_rewrite has a lot to answer for. .htaccess to the rescue, though. I was able to solve this by creating a directory feed/ and putting

DirectoryIndex blog.xml

so that requests for /feed/ fetch the blog.xml file. But please, this is just wrong, update your subscriptions to the new feed.

Mime Trouble

You may have noticed the source code files in the sidebar. They will get an explanatory post of their own soon. There is lovely pygments syntax highlighting1 in my blog posts, but in the sidebar I want people to be able to click and just view the full plain source in their browsers. (If you would rather download that’s what the .tgz is for.) Unfortunately some file extensions won’t display in the browser, they will download and try to open themselves with really bizarre applications. QuickTime for Ragel files?

But, as you may have guessed, this can be fixed with a .htaccess file! All we need to do is:

AddType text/plain .sh
AddType text/plain .rl
AddType text/plain .php

For whichever file extensions we want. I put this .htaccess file in my blog/ directory so it only affects subdirectories of that and I can have working files with these extensions elsewhere in my site.

Right, that’s all. I’m going off to play with mod_expires now.

1 As used by GitHub. Yes, that’s right, they use a syntax highlighter written in Python.