Junkbuster User Manual | ||
---|---|---|
Prev |
Junkbuster can use "regular expressions" in various config files. Assuming support for "pcre" (Perl Compatible Regular Expressions) is compiled in, which is the default. Such configuration directives do not require regular expressions, but they can be used to increase flexibility by matching a pattern with wildcards against URLs.
If you are reading this, you probably don't understand what "regular expressions" are, or what they can do. So this will be a very brief introduction only. A full explanation would require a book ;-)
"Regular expressions" is a way of matching one character expression against another to see if it matches or not. One of the "expressions" is a literal string of readable characters (letter, numbers, etc), and the other is a complex string of literal characters combined with wildcards, and other special characters, called metacharacters. The "metacharacters" have special meanings and are used to build the complex pattern to be matched against. Perl Compatible Regular Expressions is an enhanced form of the regular expression language with backward compatibility.
To make a simple analogy, we do something similar when we use wildcard characters when listing files with the dir command in DOS. *.* matches all filenames. The "special" character here is the asterik which matches any and all characters. We can be more specific and use ? to match just individual characters. So "dir file?.text" would match "file1.txt", "file2.txt", etc. We are pattern matching, using a similar technique to "regular expressions"!
Regular expressions do essentially the same thing, but are much, much more powerful. There are many more "special characters" and ways of building complex patterns however. Let's look at a few of the common ones, and then some examples:
. - Matches any single character, e.g. "a", "A", "4", ":", or "@". |
? - The preceding character or expression is matched ZERO or ONE times. Either/or. |
+ - The preceding character or expression is matched ONE or MORE times. |
* - The preceding character or expression is matched ZERO or MORE times. |
\ - The "escape" character denotes that the following character should be taken literally. This is used where one of the special characters (e.g. ".") needs to be taken literally and not as a special metacharacter. |
[] - Characters enclosed in brackets will be matched if any of the enclosed characters are encountered. |
() - Pararentheses are used to group a sub-expression, or multiple sub-expressions. |
| - The "bar" character works like an "or" conditional statement. A match is successful if the sub-expression on either side of "|" matches. |
s/string1/string2/g - This is used to rewrite strings of text. "string1" is replaced by "string2" in this example. |
These are just some of the ones you are likely to use when matching URLs with Junkbuster, and is a long way from a definitive list. This is enough to get us started with a few simple examples which may be more illuminating:
/.*/banners/.* - A simple example that uses the common combination of "." and "*" to denote any character, zero or more times. In other words, any string at all. So we start with a literal forward slash, then our regular expression pattern (".*") another literal forward slash, the string "banners", another forward slash, and lastly another ".*". We are building a directory path here. This will match any file with the path that has a directory named "banners" in it. The ".*" matches any characters, and this could conceivably be more forward slashes, so it might expand into a much longer looking path. For example, this could match: "/eye/hate/spammers/banners/annoy_me_please.gif", or just "/banners/annoying.html", or almost an infinite number of other possible combinations, just so it has "banners" in the path somewhere.
A now something a little more complex:
/.*/adv((er)?ts?|ertis(ing|ements?))?/ - We have several literal forward slashes again ("/"), so we are building another expression that is a file path statement. We have another ".*", so we are matching against any conceivable sub-path, just so it matches our expression. The only true literal that must match our pattern is adv, together with the forward slashes. What comes after the "adv" string is the interesting part.
Remember the "?" means the preceding expression (either a literal character or anything grouped with "(...)" in this case) can exist or not, since this means either zero or one match. So "((er)?ts?|ertis(ing|ements?))" is optional, as are the individual sub-expressions: "(er)", "(ing|ements?)", and the "s". The "|" means "or". We have two of those. For instance, "(ing|ements?)", can expand to match either "ing" OR "ements?". What is being done here, is an attempt at matching as many variations of "advertisement", and similar, as possible. So this would expand to match just "adv", or "advert", or "adverts", or "advertising", or "advertisement", or "advertisements". You get the idea. But it would not match "advertizements" (with a "z"). We could fix that by changing our regular expression to: "/.*/adv((er)?ts?|erti(s|z)(ing|ements?))?/", which would then match either spelling.
/.*/advert[0-9]+\.(gif|jpe?g) - Again another path statement with forward slashes. Anything in the square brackets "[]" can be matched. This is using "0-9" as a shorthand expression to mean any digit one through nine. It is the same as saying "0123456789". So any digit matches. The "+" means one or more of the preceding expression must be included. The preceding expression here is what is in the square brackets -- in this case, any digit one through nine. Then, at the end, we have a grouping: "(gif|jpe?g)". This includes a "|", so this needs to match the expression on either side of that bar character also. A simple "gif" on one side, and the other side will in turn match either "jpeg" or "jpg", since the "?" means the letter "e" is optional and can be matched once or not at all. So we are building an expression here to match image GIF or JPEG type image file. It must include the literal string "advert", then one or more digits, and a "." (which is now a literal, and not a special character, since it is escaped with "\"), and lastly either "gif", or "jpeg", or "jpg". Some possible matches would include: "//advert1.jpg", "/nasty/ads/advert1234.gif", "/banners/from/hell/advert99.jpg". It would not match "advert1.gif" (no leading slash), or "/adverts232.jpg" (the expression does not include an "s"), or "/advert1.jsp" ("jsp" is not in the expression anywhere).
s/microsoft(?!.com)/MicroSuck/i - This is a substitution. "MicroSuck" will replace any occurence of "microsoft". The "i" at the end of the expression means ignore case. The "(?!.com)" means the match should fail if "microsoft" is followed by ".com". In other words, this acts like a "NOT" modifier. In case this is a hyperlink, we don't want to break it ;-).
We are barely scratching the surface of regular expressions here so that you can understand the default Junkbuster configuration files, and maybe use this knowledge to customize your own installation. There is much, much more that can be done with regular expressions. Now that you know enough to get started, you can learn more on your own :/
More reading on Perl Compatible Regular expressions: http://www.perldoc.com/perl5.6/pod/perlre.html