How can I match exact word of text using reg.compile

Keywords: python regex nlp

Question: 

I want to find a pattern (WITH RE.COMPILE) for the exact words like this,

Imagine such words [aether, altitude, aphelion, west]

which kind of capture word or word with punctuation, In a way that I can use it in spacy, I used this but it does not work






regex_patterns = [

re.compile(r'aether?,|altitude?,|aphelion?,|apside?,|apsis?,|ascension?,|autumnal equinox?,|east?.|eastward?,|eclipse?,|ecliptic?,|elliptical?,|epicycle?,|equinoctical?,|exquinox?,|fixed star?,|latitude?,|longitude?s|mean ecliptic?,|meridian?,|mobile star?,|node?,|nodes?,|north?,|octant?,|orbit?,|\borbital?,|\bparallax?,|\brays?,|\bretrograde?,|rise?,|sidereal?,|sidereal position?,|solstice?,|south?,|star?,|vernal equinox?,|west?,')
                                          ]

It would be nice if regex capture 'word' and 'word,' (word +punctuation) like in this sentence

"west, can take a look"

the result should be

west,

Answers: 

If we wish to match specific words, we might likely want to start with an expression similar to:

(aether|altitude|aphelion|apside|apsis|ascension|autumnal equinox|east|eastward|eclipse|ecliptic|elliptical|epicycle|equinoctical|exquinox|fixed star|latitude|longitudes?|mean ecliptic|meridian|mobile star|nodes?|north|octant|orbit|\borbital\b|\bparallax\b|\brays\b|\bretrograde\b|rise|sidereal|sidereal position|solstice|south|star|vernal equinox|west),?

and then modify it.

Demo

Test

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r"(aether|altitude|aphelion|apside|apsis|ascension|autumnal equinox|east|eastward|eclipse|ecliptic|elliptical|epicycle|equinoctical|exquinox|fixed star|latitude|longitudes?|mean ecliptic|meridian|mobile star|nodes?|north|octant|orbit|\borbital\b|\bparallax\b|\brays\b|\bretrograde\b|rise|sidereal|sidereal position|solstice|south|star|vernal equinox|west),?"

test_str = ("aether\n"
    "altitude\n"
    "aphelion\n"
    "apside\n"
    "apsis\n"
    "ascension\n"
    "autumnal equinox\n"
    "east?.\n"
    "eastward\n"
    "eclipse\n"
    "ecliptic\n"
    "elliptical\n"
    "epicycle\n"
    "equinoctical\n"
    "exquinox\n"
    "fixed star\n"
    "latitude\n"
    "longitude\n"
    "longitudes\n"
    "mean ecliptic\n"
    "meridian\n"
    "mobile star\n"
    "node\n"
    "nodes\n"
    "north\n"
    "octant\n"
    "orbit\n"
    "orbital\n"
    "parallax\n"
    "rays\n"
    "retrograde\n"
    "rise\n"
    "sidereal\n"
    "sidereal position\n"
    "solstice\n"
    "south\n"
    "star\n"
    "vernal equinox\n"
    "west")

matches = re.finditer(regex, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches, start=1):

    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

Try this regex:

'(word|other|foo|bar)+[\,\.]?'

that would match word and word,, foo, foo, and the other words with and without the punctuation chars ,, . or others that you add.