Complex text cleaning using regex on Python/notepad++

Question

0.00/5 (No votes)

See more:

, +

Since I am new to Regex this question might seem silly. I have a huge text file in the Devanagari script (nearly 70k lines). The majority of the file is in Devanagari except for the word tags which are in English alphanumerics. This file has four types of data typically. I will show examples of these.
Type 1 -

<तत्-विवरणे>T6

Type 2-

<<ज्ञान-उत्पत्ति>T6-हेतु त्वेन>T6

<<<<देव-आदि>Bs6-स्थान>T6-प्राप्ति>T6-हेतुः>T6

<<<यद्-राज्य>K1-सुख>T6-लोभेन>T6

There can be more variations of type 2, in that there could be longer compounds like this for example

<<<<<<<अतीत-<न-अन्तर>Tn>K1-अध्याय>K1-अन्त>T6-उक्त>T7-श्लोक>K1-अर्थ>T6-न्यायेन>T6

.

Type 3-

<शुभ-<न-शुभे>Tn>Di

Type 4-

<ज्ञान-ऐश्वर्य-शक्ति-बल-वीर्य-तेजोभिः>Di

Now typically each line has a combination of these kinds of data. Let me show a few examples.

यदि हि ज्ञेयस्य <देह-आदेः>Bs6 क्षेत्रस्य धर्माः <<सुख-दुःख-मोह-इच्छा>Di-आदयः>Bs6 ज्ञातुः भवन्ति  तर्हि  'ज्ञेयस्य क्षेत्रस्य धर्माः केचित् आत्मनः भवन्ति <अविद्या-अध्यारोपिताः>T3  <<जरा-मरण>Di-आदयः>Bs6 तु न भवन्ति' इति <विशेष-हेतुः>K1 वक्तव्यः

एवम् च सति  <सर्व-क्षेत्रेषु>K1 अपि सतः भगवतः क्षेत्रज्ञस्य ईश्वरस्य <<संसारि त्व-गन्ध>T6-मात्रम्>S अपि न आशङ्क्यम्

Now what I want to accomplish is that clean the lines such that only Type 1 data is retained in its original form and the rest of the kinds of data are converted into running text minus the angular brackets, hyphens and the English tags. So if that were to be done properly the above two lines should like the following

यदि हि ज्ञेयस्य <देह-आदेः>Bs6 क्षेत्रस्य धर्माः सुख दुःख मोह इच्छा आदयः ज्ञातुः भवन्ति  तर्हि  'ज्ञेयस्य क्षेत्रस्य धर्माः केचित् आत्मनः भवन्ति <अविद्या-अध्यारोपिताः>T3 जरा मरण आदयः तु न भवन्ति' इति <विशेष-हेतुः>K1 वक्तव्यः

एवम् च सति  <सर्व-क्षेत्रेषु>K1 अपि सतः भगवतः क्षेत्रज्ञस्य ईश्वरस्य संसारि त्व गन्ध मात्रम् अपि न आशङ्क्यम्

How could I do this using Python and Regex or alternatively using Regex on notepad++?

What I have tried:

I have figured out some regex expressions that can find the data types in the file. I usually located type 2 using the series of angular brackets as anchors. As for type 3 data

(?<=>)([a-zA-Z0-9]+?)(>)

this regex expression works and for type 4 data and also in cases of type 1 and type 2 data this expression works

<(?=(?:[^>]*-){2,})[^>]+>

My initial attempt at this included using these regex expressions to find the compounds and then strip the compound of the hyphenx, brackets and tags but I could not make it work on a line by line basis as I read the lines and could not keep the Type 1 compound intact. Please suggest a good way to do this. Any help will be appreciated. I am using Python. The logic of the code I tried was

open file
read line
find type 2,3 and 4 compounds
strip them off the unnecessary items

but doing this type 1 compounds also get edited. Please suggest a workaround other methods using Notepad++

Posted 27-May-21 8:07am

adideva98

Updated 29-May-21 5:28am

v2

Add a Solution

Comments

Richard MacCutchan 29-May-21 11:36am

see my suggested solution below.

1 solution

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Richard MacCutchan · Answer 1 · 2021-05-27T22:48:00

The problem with regex is that it works on any matches that it finds, without ignoring others that are similar. So removing a single < character will change all those that have more than one.

You have a couple of possibilities. Find the longest pattern that is unique but will not match any of the others, and try doing the conversion one pattern at a time. After each conversion save the file and check that the remaining patterns match what you expect. Alternatively write actual code to parse the text and remove the elements that you do not want.