Regular expression query

Question

0.00/5 (No votes)

See more:

I'm writing a list of regular expressions to identify company names from text.

This is what the text is like-->

Summer Intern
Genisup India Pvt. Ltd., Hosur, Tamil Nadu
 June 2021 – Aug 2021 1⁄2 Remote
• Internship on the topic NLP: Topic Modeling to assign the
theme or topic for any news article on internet using Machine
Learning techniques.
• Worked on proxy rotation and Web Scraping
• Performed LDA Topic Modeling on "The Hindu" news articles
and obtained precision score of 0.906.
Intern Trainee
VUGS Technologies Pvt. Ltd., Agra, Uttar Pradesh
 May 2021 – June 2021 1⁄2 Remote
• Built an OCR using Pytesseract and NER Text Classification
Model to categorize detected text into Name, E-mail Address,
Phone number and Date using NLTK,SpaCy and BERT
• Created an OCR for Handwritten text [A-Z, 0-9] using CNN
architecture
• Built a Face Recognition Model and Face Mask Detection
Model using OpenCV and Haar Cascade Classifier.

Expected output-->

['Genisup India Pvt. Ltd.' 'VUGS Technologies Pvt. Ltd.']

Observed output-->

['Genisup India Pvt. Ltd.' 'S Technologies Pvt. Ltd.']

Why isn't "VUGS" getting printed completely?

What I have tried:

import re
import numpy as np

sub_patterns = ['[A-Z][a-z]* [A-Z][a-z]* Private Limited','[A-Z][a-z]* [A-Z][a-z]* Pvt. Ltd.','[A-Z][a-z]* [A-Z][a-z]* Inc.',
'[A-Z][a-z]* [A-Z][a-z]* Corporation', '[A-Z][a-z]* [A-Z][a-z]* Inc.', '[A-Z][a-z]* [A-Z][a-z]* Technologies', '[A-Z][a-z]* [A-Z][a-z]* Company', '[A-Z][a-z]* [A-Z][a-z]* Solutions',
'[A-Z][a-z]* [A-Z][a-z]* Services']
pattern = '({})'.format('|'.join(sub_patterns))
comp = re.findall(pattern, text)
comp_name = np.array(comp)
comp_un=np.unique(comp_name)
print(comp_un)

Posted 14-Sep-22 4:24am

Apoorva 2022

Updated 14-Sep-22 4:50am

Add a Solution

1 solution

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

jsc42 · Accepted Answer · 2022-09-14T04:50:00

1. VUGS is not getting printed correctly because you have assumed that the name of a company starts with a capital letter followed by zero or more lower case letters. However, VUGS starts with multiple capital letters.
2. You are assuming that all company names consist of exactly two names and do not have non-alphabetic characters. This means you would miss IBM (one word) and 3M (starts with a number).
3. You are using '.' (as in Pvt.) as a literal character - it is not a literal in a Regular expression; use \\. instead

On the assumption that all of the company names start at the beginning of a line, only consist of alphanumerics, and must have at least one word, your RegExp could look like ... (But beware, I have not tested this)

^[A-Za-z0-9 ]*[A-Za-z0-9] (Private Limited|Pvt\\. Ltd\\.|Inc\\.|Corporation|Technologies|Company|Solutions|Services)

NOTE. Where you see \\ in the texts above, only put one backslash char. I cannot get Code project to display a single backslash char, so I've had to double them