Click here to Skip to main content
15,041,266 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
Since I am new to Regex this question might seem silly. I have a huge text file in the Devanagari script (nearly 70k lines). The majority of the file is in Devanagari except for the word tags which are in English alphanumerics. This file has four types of data typically. I will show examples of these.
Type 1 -
<तत्-विवरणे>T6

Type 2-
<<ज्ञान-उत्पत्ति>T6-हेतु त्वेन>T6
<<<<देव-आदि>Bs6-स्थान>T6-प्राप्ति>T6-हेतुः>T6
<<<यद्-राज्य>K1-सुख>T6-लोभेन>T6


There can be more variations of type 2, in that there could be longer compounds like this for example
<<<<<<<अतीत-<न-अन्तर>Tn>K1-अध्याय>K1-अन्त>T6-उक्त>T7-श्लोक>K1-अर्थ>T6-न्यायेन>T6
.

Type 3-
<शुभ-<न-शुभे>Tn>Di

Type 4-
<ज्ञान-ऐश्वर्य-शक्ति-बल-वीर्य-तेजोभिः>Di


Now typically each line has a combination of these kinds of data. Let me show a few examples.
यदि हि ज्ञेयस्य <देह-आदेः>Bs6 क्षेत्रस्य धर्माः <<सुख-दुःख-मोह-इच्छा>Di-आदयः>Bs6 ज्ञातुः भवन्ति  तर्हि  'ज्ञेयस्य क्षेत्रस्य धर्माः केचित् आत्मनः भवन्ति <अविद्या-अध्यारोपिताः>T3  <<जरा-मरण>Di-आदयः>Bs6 तु न भवन्ति' इति <विशेष-हेतुः>K1 वक्तव्यः

एवम् च सति  <सर्व-क्षेत्रेषु>K1 अपि सतः भगवतः क्षेत्रज्ञस्य ईश्वरस्य <<संसारि त्व-गन्ध>T6-मात्रम्>S अपि न आशङ्क्यम्

Now what I want to accomplish is that clean the lines such that only Type 1 data is retained in its original form and the rest of the kinds of data are converted into running text minus the angular brackets, hyphens and the English tags. So if that were to be done properly the above two lines should like the following

यदि हि ज्ञेयस्य <देह-आदेः>Bs6 क्षेत्रस्य धर्माः सुख दुःख मोह इच्छा आदयः ज्ञातुः भवन्ति  तर्हि  'ज्ञेयस्य क्षेत्रस्य धर्माः केचित् आत्मनः भवन्ति <अविद्या-अध्यारोपिताः>T3 जरा मरण आदयः तु न भवन्ति' इति <विशेष-हेतुः>K1 वक्तव्यः

एवम् च सति  <सर्व-क्षेत्रेषु>K1 अपि सतः भगवतः क्षेत्रज्ञस्य ईश्वरस्य संसारि त्व गन्ध मात्रम् अपि न आशङ्क्यम्

How could I do this using Python and Regex or alternatively using Regex on notepad++?

What I have tried:

I have figured out some regex expressions that can find the data types in the file. I usually located type 2 using the series of angular brackets as anchors. As for type 3 data

(?<=>)([a-zA-Z0-9]+?)(>)

this regex expression works and for type 4 data and also in cases of type 1 and type 2 data this expression works
<(?=(?:[^>]*-){2,})[^>]+>


My initial attempt at this included using these regex expressions to find the compounds and then strip the compound of the hyphenx, brackets and tags but I could not make it work on a line by line basis as I read the lines and could not keep the Type 1 compound intact. Please suggest a good way to do this. Any help will be appreciated. I am using Python. The logic of the code I tried was

open file
read line
find type 2,3 and 4 compounds
strip them off the unnecessary items

but doing this type 1 compounds also get edited. Please suggest a workaround other methods using Notepad++
Posted
Updated 29-May-21 5:28am
v2
Comments
Richard MacCutchan 29-May-21 11:36am
   
see my suggested solution below.

1 solution

The problem with regex is that it works on any matches that it finds, without ignoring others that are similar. So removing a single < character will change all those that have more than one.

You have a couple of possibilities. Find the longest pattern that is unique but will not match any of the others, and try doing the conversion one pattern at a time. After each conversion save the file and check that the remaining patterns match what you expect. Alternatively write actual code to parse the text and remove the elements that you do not want.
   

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)




CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900