Click here to Skip to main content
15,028,332 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
Hello all. This is again a question related to wikipedia. There is a thing called wikilink (display one title, take to another article). For example, "[[metro station|station]]" will be displayed as "station", but will take you to the article of "metro station". And "[[Wyn Jones (rugby union)|Wyn Jones]]" will take you to Wyn Jones (rugby union)".

I am doing a find and replace task, using following regex:
Match m = Regex.Match(ArticleText, @"\[\[[W]yn Jones]]");
if (m.Success) ArticleText = ArticleText.Replace("Wyn Jones", "Wyn Jones (rugby union)");


With the regex provided above, my tool (auto wiki browser - AWB), replaces only "Wyn Jones" with "Wyn Jones (rugby union)".

What I want to do is, convert instances like [[Wyn Jones|W. Jones]] to "[[Wyn Jones (rugby union)]]". For that particular example, I can use something similar to:
Match m = Regex.Match(ArticleText, @"\[\[[W]yn Jones\|W. Jones\]\])");
if (m.Success) ArticleText = ArticleText.Replace("\[\[Wyn Jones\|W. Jones\]\]", "\[\[Wyn Jones (rugby union)\]\]");


But how do I replace whatever thats between "|" and "]]"? In this example, it is "W.Jones"; but in next it might be "Wyn J." I think we could use
abc(?=xyz)
but I am not sure how to. Basically we have to tell the program to look for "[[Wyn Jones" followed by "|" and replace whatever is upto "]]".

Any feedback will be appreciated a lot.

What I have tried:

following is the entire code that I have created. Note that first 17 lines of the codes are embedded in the AWB itself, so my code begins with line 18. Also, I ran only the first module without second module, and it worked fine. Apparently module2 doesnt have any effect. The module1 changed [[Wyn Jones]] to [[Wyn Jones (rugby union)]], [[Wyn Jones (rugby union)]] remained unchanged, [[Wyn Jones (rugby union)|Wyn Jones]] also remained unchanged, [[Wyn Jones (rugby union)|rugby player]] also remained unchanged. All of this is intended, but [[Wyn Jones|rugby player]] remained unchanged as well. That should have been replaced as [[Wyn Jones (rugby union)|Wyn Jones]].

In very simple words: I want to find all the [[Wyn Jones]], and [[Wyn Jones|(any variable words)]], and these to be replaced with "[[Wyn Jones (rugby union)|Wyn Jones]]".

Currently, [[Wyn Jones]] is being handled, but not the variables. To achieve that, we first have to find for [[Wyn Jones|, and then we have to tell the program to replace [[Wyn Jones| and ]], and whatever comes between these two with "[[Wyn Jones (rugby union)|Wyn Jones]]". To put it in even simpler terms, we have to term A, and then replace term A, B, and whatever comes between these two with term C.

AWB also supports programs in C#. So it is not necessary to be done in regex. A program in C# can work as well.



public string ProcessArticle(string ArticleText, string ArticleTitle, int wikiNamespace, out string Summary, out bool Skip)
	{
		Skip = false;
		Summary = "fixed disamb link(s)";

			ArticleText = CustomModule1(ArticleText, ArticleTitle, wikiNamespace, ref Summary, ref Skip);
			ArticleText = CustomModule2(ArticleText, ArticleTitle, wikiNamespace, ref Summary, ref Skip);
//25
		return ArticleText;
	}

public string CustomModule1(string ArticleText, string ArticleTitle, int wikiNamespace, ref string Summary, ref bool Skip)
	{
		if (!Skip)
			{
			Summary = "fixed disamb link(s)";
//34
			Match m = Regex.Match(ArticleText, @"\[\[[W]yn Jones]]");
			if (m.Success) ArticleText = ArticleText.Replace("[[Wyn Jones]]", "[[Wyn Jones (rugby union)]]");
			else ArticleText += "";
			}
		return ArticleText;
	}
//41 apparently custom module 2 doesnt have any effect on this.
public string CustomModule2(string ArticleText, string ArticleTitle, int wikiNamespace, ref string Summary, ref bool Skip)
	{
    		if (!Skip)
			{
			Summary = "fixed disamb link(s)";
//47
			Match m = Regex.Match(ArticleText, @"\[\[[W]yn Jones\|");
			if (m.Success) ArticleText = Regex.Replace(ArticleText, @"[[Wyn Jones\[[Wyn Jones|, ]]}", "[[Wyn Jones|Wyn Jones (rugby union)]]", RegexOptions.IgnoreCase);
			
	}
		return ArticleText;
}
Posted
Updated 29-Aug-20 13:20pm
v3

1 solution

Try this:
(?<=\[\[Wyn Jones)\|.*?(?=\]\])

That treats the name you want as a "ignored prefix" and the terminator as an "ignored suffix" so the '|' up to and including the last character before the first ']' will be replaced.
   
Comments
usernamekiran 29-Aug-20 22:50pm
   
Hi. With your code, I created this:
public string ProcessArticle(string ArticleText, string ArticleTitle, int wikiNamespace, out string Summary, out bool Skip)
	{
		Skip = false;
		Summary = "test";
			Match m = Regex.Match(ArticleText, @"\[\[[W]yn Jones]]");
			if (m.Success) ArticleText = Regex.Replace("(?<=\[\[Wyn Jones)\|.*?(?=\]\])", "[[Wyn Jones (rugby union)]]");
			else ArticleText += "";

		return ArticleText;
	}

But it is giving me "[CS1009] Unrecognized escape sequence" error, wherever we used \ to escape. I also tried:
public string ProcessArticle(string ArticleText, string ArticleTitle, int wikiNamespace, out string Summary, out bool Skip)
	{
		Skip = false;
		Summary = "test";

			Match m = Regex.Match(ArticleText, @"\[\[[W]yn Jones]]");
			if (m.Success) ArticleText = ArticleText.Replace("(?<=\[\[Wyn Jones)\|.*?(?=\]\])", "[[Wyn Jones (rugby union)]]");
			else ArticleText += "";            
		return ArticleText;
	}

It gave the same error. Surprisingly, AWB regex page gives out the same method for escaping at: https://en.wikipedia.org/wiki/Wikipedia:AutoWikiBrowser/Regular_expression

What am I doing wrong?
usernamekiran 29-Aug-20 23:40pm
   
I also tried this code (with "Regex.Replace" as well, still the same error:
public string ProcessArticle(string ArticleText, string ArticleTitle, int wikiNamespace, out string Summary, out bool Skip)
	{
		Skip = false;
		Summary = "test";

			Match m = Regex.Match(ArticleText, @"\[\[[W]yn Jones]]");
			if (m.Success) ArticleText = ArticleText.Replace("(?<=\[\[Wyn Jones)\|.*?(?=\]\])", "[[Wyn Jones (rugby union)]]", RegexOptions.IgnoreCase);												
			else ArticleText += "";

		return ArticleText;
	}
usernamekiran 30-Aug-20 0:01am
   
The following code got compiled, but it didn't do any changes at all to the aforementioned edited examples.
public string ProcessArticle(string ArticleText, string ArticleTitle, int wikiNamespace, out string Summary, out bool Skip)
	{
		Skip = false;
		Summary = "test";
			Match m = Regex.Match(ArticleText, @"\[\[[W]yn Jones]]");
			if (m.Success) ArticleText = ArticleText.Replace(@"(?<=[[Wyn Jones)\|.*?(?=]])", "[[Wyn Jones (rugby union)]]");
			else ArticleText += "";

		return ArticleText;
	}
OriginalGriff 30-Aug-20 2:30am
   
So you didn't bother reading what I said too closely then?
'[', ']', and '|' are all special characters in regular expressions, and as such need to be escaped if you want to match them as text ...
OriginalGriff 30-Aug-20 2:32am
   
Do yourself a favour, and get a copy of Expresso[^] - it's free, and it examines and generates Regular expressions.
Feed your article text into it as the text data, and try the replace strings.
See what happens, and it can generate C# code you can paste directly into your app when you get it working.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)




CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900