Text Formats 101: How Drupal filters work

Drupal’s text formatting and filtering system is one of its most critical components. The system performs dual duty by both providing features to make content editing for the web both more secure, by allowing us to filter out dangerous HTML tags and such from content posted by untrusted users, and more approachable, by allowing mere mortals to avoid entering content in pure HTML encoding and doing the necessary conversions when necessary. Despite its importance, the system tends not to get much hype or limelight when Drupal is discussed. An unglamorous workhorse, text formats are designed to be something that you set and then forget.

Yes, even if you use it every day, the filtering system can be easy to forget about. All the more reason that you shouldn’t pay close attention to how your site’s filters are set up, and maybe even review them once in a while to make sure they’re still configured in a manner matching how you want your site to be used.

[NB: "Text formats" is the name under which the filters are managed in Drupal 7; in Drupal 6 and earlier versions, they were called "input formats." Otherwise the same basic concepts apply.]

Let’s go over the inner workings of the formatting system.

Filters work on output

The first thing to know is that one of the fundamental concepts of Drupal is that it doesn’t modify content on input; only on output. That means that content is going to enter the database in the same manner as it is inputted, and it’s only on output - before being sent to a browser, usually - that that content will be modified for safety and convenience.


Diagram of content filtering workflow

Content to be filtered enters the database directly. The filtering only happens between retrieval from the database and display (typically in a browser). Graphic: Jen Schultes


You may worry that this means that potentially unsafe content is making it to your database (though Drupal does do the necessary filtering to nullify database SQL injection attacks as the data is saved), but consider that if Drupal were filtering content before inserting it in the database, a misconfigured filter could potentially mangle your data beyond recognition, and there’d be no way to get it back. With a filter-on-output approach, if a misconfigured filter mangles your data on output, the original data is still safe and sound in the database, so all you have to do is correct the misconfiguration and you’re good to go again. However, the downside to that is that you may have insecure content in your database right now, ready to possibly be exposed to the world, if you don’t have your filters configured properly to filter it out.

Different filter sets for different purposes, available to different roles

The formatting system allows you, as the site administrator, to set up text formats for content to be run through.


Diagram of input formats with sequences of filters

Input formats consist of instances of text filters in a particular sequence. Each filter instance may also have distinct settings affecting its behavior. Graphic: Jen Schultes


Each format is a series of input filters in a particular sequence, each optionally with its own configuration settings - each instance of a filter in a format will have its own configuration settings. Each filter will take text as an input, perform alterations on the text, then output the altered text to be passed on to the next filter in the format’s sequence, if any. Finally, the standard Drupal permissions system allows you to dictate which roles’ members are allowed to use which text formats. Thus, when creating and configuring text formats, there are four points of configuration to keep in mind:

  • The presence (or absence) of a filter in a format;
  • The order of the filter instances in the format;
  • The configuration settings (if applicable) for each filter instance in a format;
  • The permissions configuration for the format (who has access to use what format).


Barista making filtered coffee

Just as a paper filter helps turn water and ground beans into something greater than the sum of its parts, text filters can be greatly beneficial for the security and usefulness of your site. Photo: David Sifry @ Flickr http://www.flickr.com/photos/dsif...


Example

Let’s look at a theoretical site’s text formats to see how these interact. Let’s say our site has content editors who like to mark up content using either straight HTML, or the Markdown markup language. For them, we’ve created a text format called “Markdown format” which uses the Markdown Drupal module’s filter to convert Markup to HTML, then we use the core “Correct faulty and chopped off HTML” filter to correct any minor HTML errors that might have passed through. We trust our content editors pretty strongly, so we don’t use the core “Limit allowed HTML tags” filter to limit the HTML that they can use. However, we do use the permissions system to make sure that only users who have our “Content editor” role can use this input format.

Our theoretical site also has a forum on it which registered users can post to, and like many other forums on the web, we want it to support the popular BBCode markup language so that users can easily add links, images, and simple styling to their post. However, we don’t totally trust the people posting on our forum, so we want to limit the HTML they can use to a pre-defined whitelist of tags. So we create another text format and call it “BBCode format.” First, it has the Bbcode (sic) Drupal module to provide a filter to convert BBCode to HTML. Just like with our previous format, we’ll also use the “Correct faulty and chopped off HTML” filter to correct minor HTML errors. However, since we want to limit the HTML that the user can use, we’ll add the core “Limit allowed HTML tags” filter to the end of this format.

“Limit allowed HTML tags” is an example of a filter where how you configure its settings is crucial, and it may take some experimentation to get the correct result. Its settings has an “Allow HTML tags” field which allows you to list which HTML tags to permit. You want to permit the HTML tags corresponding to the BBCode tags we’re permitting to be used with the BBCode to HTML filter this format is using; otherwise, the BBCode filter will convert some BBCode to HTML, just to have that HTML be stripped out by this filter. However, we don’t want to permit tags which allow these untrusted users to do potentially dangerous things, like <script>, or even just annoying things, like <blink>.

This format also gives an example of why the order of filters is an important element of configuration. Note that we have the HTML filtering… er… filter after the HTML correcting filter. This is because the latter filter is going to expect its input to already be correctly-formatted HTML; it may be possible to “trick” it using incorrectly formatted HTML. If we have the HTML filtering filter first, a user may be able to pass malicious but ill-formed HTML into the system that the HTML filtering filter can’t recognize and strip out, but which the HTML correcting filter then converts to correctly-formed, malicious HTML. Thus, it’s generally a good idea to place the HTML filtering filter as the last filter in the filter order.

Finally, our hypothetical site will occasionally take Webform submissions from unregistered users who we don’t want using HTML at all, but we would like the line breaks in their submissions to still come through when displayed. For them, we have created a “Plain text format” which uses the core “Display any HTML as plain text” filter followed by the core “Convert line breaks into HTML” filter. The former makes sure that any HTML in the submission is escaped, but the latter will convert line breaks into HTML <br /> and <p> tags. In the case of this format, neither filter has any settings that we need to deal with, but the order of the filter instances is again vital; if we have the line break inserting filter before the HTML escaping filter instead of after, then the former will add line break HTML tags to the text, just to have them escaped away by the latter. Not the desired result…

Settings affect everything retroactively

Let’s follow up with a couple more points about text filters before we finish up. First, you should keep in mind that any changes you make to text formats will have an effect on all current content using that format. This sounds obvious, but it means that you need to consider those effects when making drastic changes to your filters’ configurations. For example, if we’re finding that the Markdown filter in our Markdown text format in the previous example is just getting in the way and we want the content editors to just use straight HTML in their posts from now on, we can remove the Markdown filter from that format - but now all currently-existing content using that format which did use Markdown is not going to display correctly. If you find yourself in need of making large changes to a current text format but have a lot of content which already uses that format, it may be a better idea instead to create a new text format which reflects the changes you want to make to the old one, then have new content use that new format while old content continues to use the old one. (You could use the permissions system to make the older text format inaccessible to your content editors so that they can’t mistakenly create new content with the old format, but keep in mind that this will render them unable to edit the older content that uses the older format.)

Never use PHP evaluator

Finally, here’s one more tip: Never, ever, ever use the “PHP evaluator” filter or the “PHP code” format. This filter is provided, and this format is automatically created, when you enable the “PHP filter” module which comes with core, and it allows you to enter snippets of PHP code in content. It’s the fastest way to solve a problem where, for example, the client wants a copyright notice with the current year to appear in a block in the footer or something, but it’s just the wrong way to go about it; you’re better off creating a custom module which provides a block which does this (reaching the requisite level of Drupal development skill first, but don’t panic - custom blocks in Drupal are one of the easiest and most basic development tasks you can do). If creating a module is more work, then why is it better than using the PHP evaluator filter? For several reasons;

  • Running code through the filter is much slower than running code in a file (as in a module), particularly once PHP caches such as Xcache and APC come into play. For many cases where you just need to do a quick date calculation or something like that, you probably won’t notice the difference in human terms, but on a busy site, it will add up.
  • It massively increases the attack surface of your site. Anyone who is able to use this filter, intentionally or accidentally, will basically be able to do anything they want to to your site, your database, and your entire server. Thus, if you disregard my advice and continue to use this filter, you should at least quadruple-check your permissions settings to ensure that nobody you don’t trust to the utmost has access to a format using this filter.
  • In the case that an error occurs in code which is being run through the PHP filter, annoying white screens of death may result.
  • Code in a module is easier to localize, since utilities which look for strings which can be localized will look for them in code files, not in the database.
  • Finally, with all these other reasons taken into account, using the PHP filter is just lazy; it’s best to get in the habit and stay in the habit of writing modules for these sorts of tasks.


Colored camera lens filters

Photo: aslakr @ Flickr http://www.flickr.com/photos/asla...


And that’s all that your typical Drupal system administrator should know about the Drupal text formatting system. It’s a lot of information, to be sure, but for the convenience and especially the security of your site, there’s a lot to keep in mind. So go forth and harness Drupal's input formats and text filtering system to improve the security and usability of your site.

We want to work with you!