I spent almost half of yesterday trying to figure out how to capture a substring from a list of text. The text is a result of querying products from the database. They contain product name, color code, and many of them also have the color name (in parentheses).
ALEXIS 002 BIG EASY 011 VENETO 012 AMAZING 200 (AURORA) BABY WOOL 201 (ALPINE MEADOW) FISHERMAN'S WOOL 098 (NATURAL) HOMETOWN USA 108 (WASHINGTON DENIM) JIFFY THICK & QUICK 207 (GREEN MOUNTAINS) LION COTTON 098 (NATURAL) LION SUEDE 178 MARTHA EXTRA SOFT WOOL (GERBERA DAISY) MARTHA LOFTY WOOL BLEND 509 (BALLPOINT BLUE) VANNA'S CHOICE BABY 100 (ANGEL WHITE) WOOL EASE THICK & QUICK 099 (FISHERMAN)
So what I want is everything before the number. First thing I think of is regular expression. So I started crafting the regex pattern.
First I came up with this one.
It should just work. I thought. Capture 1 or more non-digit character. Well it did for most of the lines, except for line 11. It returns the whole string with parentheses.
ALEXIS BIG EASY VENETO AMAZING BABY WOOL FISHERMAN'S WOOL HOMETOWN USA JIFFY THICK & QUICK LION COTTON LION SUEDE MARTHA EXTRA SOFT WOOL (GERBERA DAISY) MARTHA LOFTY WOOL BLEND VANNA'S CHOICE BABY WOOL EASE THICK & QUICK
Huh? Why? Then I thought, maybe because it does not have the digit to separate the string. So I modify the pattern to look for optional digits.
Still not working. (same result as above)
After many tries, I RTFM and found out that
\D will match any non-digit (meaning, include symbols) but
\w will match any word characters (meaning a-z, 0-9 and _ but not symbols). So I changed the pattern.
\w+ (one or more word character) but
[^\d|\(]* (not digit or parentheses). And BOOM!
ALEXIS BIG EASY VENETO AMAZING BABY WOOL FISHERMAN'S WOOL HOMETOWN USA JIFFY THICK & QUICK LION COTTON LION SUEDE MARTHA EXTRA SOFT WOOL MARTHA LOFTY WOOL BLEND VANNA'S CHOICE BABY WOOL EASE THICK & QUICK
- The first
^means beginning of line but the second
- I use Rubular to help visually forming the pattern.