(Ir)Regular Expression

I spent almost half of yesterday trying to figure out how to capture a substring from a list of text. The text is a result of querying products from the database. They contain product name, color code, and many of them also have the color name (in parentheses).

ALEXIS 002
BIG EASY 011
VENETO 012
AMAZING 200 (AURORA)
BABY WOOL 201 (ALPINE MEADOW)
FISHERMAN'S WOOL 098 (NATURAL)
HOMETOWN USA 108 (WASHINGTON DENIM)
JIFFY THICK & QUICK 207 (GREEN MOUNTAINS)
LION COTTON 098 (NATURAL)
LION SUEDE 178
MARTHA EXTRA SOFT WOOL (GERBERA DAISY)
MARTHA LOFTY WOOL BLEND 509 (BALLPOINT BLUE)
VANNA'S CHOICE BABY 100 (ANGEL WHITE)
WOOL EASE THICK & QUICK 099 (FISHERMAN)

So what I want is everything before the number. First thing I think of is regular expression. So I started crafting the regex pattern.

First I came up with this one.

^(\D+)

It should just work. I thought. Capture 1 or more non-digit character. Well it did for most of the lines, except for line 11. It returns the whole string with parentheses.

ALEXIS
BIG EASY
VENETO
AMAZING
BABY WOOL
FISHERMAN'S WOOL
HOMETOWN USA
JIFFY THICK & QUICK
LION COTTON
LION SUEDE
MARTHA EXTRA SOFT WOOL (GERBERA DAISY)
MARTHA LOFTY WOOL BLEND
VANNA'S CHOICE BABY
WOOL EASE THICK & QUICK

Huh? Why? Then I thought, maybe because it does not have the digit to separate the string. So I modify the pattern to look for optional digits.

/^(\D+)(\d{3})?(.*)/

Still not working. (same result as above)

After many tries, I RTFM and found out that \D will match any non-digit (meaning, include symbols) but \w will match any word characters (meaning a-z, 0-9 and _ but not symbols). So I changed the pattern.

/^(\w+[^\d|\(]*)/

That’s \w+ (one or more word character) but [^\d|\(]* (not digit or parentheses). And BOOM!

ALEXIS
BIG EASY
VENETO
AMAZING
BABY WOOL
FISHERMAN'S WOOL
HOMETOWN USA
JIFFY THICK & QUICK
LION COTTON
LION SUEDE
MARTHA EXTRA SOFT WOOL
MARTHA LOFTY WOOL BLEND
VANNA'S CHOICE BABY
WOOL EASE THICK & QUICK

It worked.

Note:

  1. The first ^ means beginning of line but the second ^ means not.
  2. I use Rubular to help visually forming the pattern.