Click here to Skip to main content
15,559,275 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
I'm writing a list of regular expressions to identify company names from text.

This is what the text is like-->
Summer Intern
Genisup India Pvt. Ltd., Hosur, Tamil Nadu
 June 2021 – Aug 2021 1⁄2 Remote
• Internship on the topic NLP: Topic Modeling to assign the
theme or topic for any news article on internet using Machine
Learning techniques.
• Worked on proxy rotation and Web Scraping
• Performed LDA Topic Modeling on "The Hindu" news articles
and obtained precision score of 0.906.
Intern Trainee
VUGS Technologies Pvt. Ltd., Agra, Uttar Pradesh
 May 2021 – June 2021 1⁄2 Remote
• Built an OCR using Pytesseract and NER Text Classification
Model to categorize detected text into Name, E-mail Address,
Phone number and Date using NLTK,SpaCy and BERT
• Created an OCR for Handwritten text [A-Z, 0-9] using CNN
• Built a Face Recognition Model and Face Mask Detection
Model using OpenCV and Haar Cascade Classifier.

Expected output-->
['Genisup India Pvt. Ltd.' 'VUGS Technologies Pvt. Ltd.']

Observed output-->
['Genisup India Pvt. Ltd.' 'S Technologies Pvt. Ltd.']

Why isn't "VUGS" getting printed completely?

What I have tried:

import re
import numpy as np

sub_patterns = ['[A-Z][a-z]* [A-Z][a-z]* Private Limited','[A-Z][a-z]* [A-Z][a-z]* Pvt. Ltd.','[A-Z][a-z]* [A-Z][a-z]* Inc.',
'[A-Z][a-z]* [A-Z][a-z]* Corporation', '[A-Z][a-z]* [A-Z][a-z]* Inc.', '[A-Z][a-z]* [A-Z][a-z]* Technologies', '[A-Z][a-z]* [A-Z][a-z]* Company', '[A-Z][a-z]* [A-Z][a-z]* Solutions',
'[A-Z][a-z]* [A-Z][a-z]* Services']
pattern = '({})'.format('|'.join(sub_patterns))
comp = re.findall(pattern, text)
comp_name = np.array(comp)
Updated 14-Sep-22 5:50am

1 solution

1. VUGS is not getting printed correctly because you have assumed that the name of a company starts with a capital letter followed by zero or more lower case letters. However, VUGS starts with multiple capital letters.
2. You are assuming that all company names consist of exactly two names and do not have non-alphabetic characters. This means you would miss IBM (one word) and 3M (starts with a number).
3. You are using '.' (as in Pvt.) as a literal character - it is not a literal in a Regular expression; use \\. instead

On the assumption that all of the company names start at the beginning of a line, only consist of alphanumerics, and must have at least one word, your RegExp could look like ... (But beware, I have not tested this)

^[A-Za-z0-9 ]*[A-Za-z0-9] (Private Limited|Pvt\\. Ltd\\.|Inc\\.|Corporation|Technologies|Company|Solutions|Services)

NOTE. Where you see \\ in the texts above, only put one backslash char. I cannot get Code project to display a single backslash char, so I've had to double them
Share this answer

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900