Since I am new to Regex this question might seem silly. I have a huge text file in the Devanagari script (nearly 70k lines). The majority of the file is in Devanagari except for the word tags which are in English alphanumerics. This file has four types of data typically. I will show examples of these.
Type 1 -
<तत्-विवरणे>T6
Type 2-
<<ज्ञान-उत्पत्ति>T6-हेतु त्वेन>T6
<<<<देव-आदि>Bs6-स्थान>T6-प्राप्ति>T6-हेतुः>T6
<<<यद्-राज्य>K1-सुख>T6-लोभेन>T6
There can be more variations of type 2, in that there could be longer compounds like this for example
<<<<<<<अतीत-<न-अन्तर>Tn>K1-अध्याय>K1-अन्त>T6-उक्त>T7-श्लोक>K1-अर्थ>T6-न्यायेन>T6
.
Type 3-
<शुभ-<न-शुभे>Tn>Di
Type 4-
<ज्ञान-ऐश्वर्य-शक्ति-बल-वीर्य-तेजोभिः>Di
Now typically each line has a combination of these kinds of data. Let me show a few examples.
यदि हि ज्ञेयस्य <देह-आदेः>Bs6 क्षेत्रस्य धर्माः <<सुख-दुःख-मोह-इच्छा>Di-आदयः>Bs6 ज्ञातुः भवन्ति तर्हि 'ज्ञेयस्य क्षेत्रस्य धर्माः केचित् आत्मनः भवन्ति <अविद्या-अध्यारोपिताः>T3 <<जरा-मरण>Di-आदयः>Bs6 तु न भवन्ति' इति <विशेष-हेतुः>K1 वक्तव्यः
एवम् च सति <सर्व-क्षेत्रेषु>K1 अपि सतः भगवतः क्षेत्रज्ञस्य ईश्वरस्य <<संसारि त्व-गन्ध>T6-मात्रम्>S अपि न आशङ्क्यम्
Now what I want to accomplish is that clean the lines such that only Type 1 data is retained in its original form and the rest of the kinds of data are converted into running text minus the angular brackets, hyphens and the English tags. So if that were to be done properly the above two lines should like the following
यदि हि ज्ञेयस्य <देह-आदेः>Bs6 क्षेत्रस्य धर्माः सुख दुःख मोह इच्छा आदयः ज्ञातुः भवन्ति तर्हि 'ज्ञेयस्य क्षेत्रस्य धर्माः केचित् आत्मनः भवन्ति <अविद्या-अध्यारोपिताः>T3 जरा मरण आदयः तु न भवन्ति' इति <विशेष-हेतुः>K1 वक्तव्यः
एवम् च सति <सर्व-क्षेत्रेषु>K1 अपि सतः भगवतः क्षेत्रज्ञस्य ईश्वरस्य संसारि त्व गन्ध मात्रम् अपि न आशङ्क्यम्
How could I do this using Python and Regex or alternatively using Regex on notepad++?
What I have tried:
I have figured out some regex expressions that can find the data types in the file. I usually located type 2 using the series of angular brackets as anchors. As for type 3 data
(?<=>)([a-zA-Z0-9]+?)(>)
this regex expression works and for type 4 data and also in cases of type 1 and type 2 data this expression works
<(?=(?:[^>]*-){2,})[^>]+>
My initial attempt at this included using these regex expressions to find the compounds and then strip the compound of the hyphenx, brackets and tags but I could not make it work on a line by line basis as I read the lines and could not keep the Type 1 compound intact. Please suggest a good way to do this. Any help will be appreciated. I am using Python. The logic of the code I tried was
open file
read line
find type 2,3 and 4 compounds
strip them off the unnecessary items
but doing this type 1 compounds also get edited. Please suggest a workaround other methods using Notepad++