Click here to Skip to main content
15,896,269 members
Please Sign up or sign in to vote.
1.00/5 (2 votes)
See more:
in this program, i have given wikipeadia URL for text extraction logic but after extraction of text "for loops" are taking to much time to execute.
the same logic too fast in python program.

how to reduces execution time ?


import java.io.IOException;
import java.net.URL;
import java.util.Scanner;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class TextExtraction1 
{
	static TextExtraction1 fj;
	public String toHtmlString(String url) throws IOException 
	{
		StringBuilder sb = new StringBuilder();
		   for(Scanner sc = new Scanner(new URL(url).openStream()); sc.hasNext(); )
		      sb.append(sc.nextLine()).append('\n');
		   return sb.toString();
	}
	
	static int search(String key,String target)
	{
		int count=0;
		Pattern p=Pattern.compile(key);
		Matcher m=p.matcher(target);
		while(m.find()){count++;}
		return count;
	} 

	String extractText(String s) throws IOException
	{
				 
		String h1 = fj.toHtmlString(s); 
        System.out.println("extracted \n\n");
        int i2=0;
        String h2[] = h1.split("\n");
        String html="";
        long start = System.currentTimeMillis();
        
        for(String h3:h2)
        {	//bw.write(h3);bw.newLine();
        		html += h3;
                html += ""; //iu=iu+1;               	
        }
        long end = System.currentTimeMillis();
        System.out.println(++i2+" th loop end in "+(end-start)/1000+" seconds");
        boolean capture = true;
        String filtered_text = "";
        
        String html_text[] = html.split("<");
        String h_text[];//System.out.println("kyhe1");
        
        
        start = System.currentTimeMillis();
        for(String h:html_text)
        {
        	h = "<" + h;
        	h_text = h.split(">");
        	for(String w :h_text)
        	{
        		if(w.length()>0)	{	if(w.substring(0, 1).equals("<")){w +=">";}	}
        		if(search("</script>",w)>0){capture=true;}
        		else if(search("<script",w)>0){capture=false;}
        		else if(capture){filtered_text += w;     filtered_text += "\n";}
        	}
        }
       // System.out.println("kyhe1");
        end = System.currentTimeMillis();
        html_text = filtered_text.split("\n");
        
        System.out.println(++i2+" th loop end in "+(end-start)/1000+" seconds");
        return html_text[0];
	}
	
		
	public static void main(String []args)throws IOException 
	{
		fj = new TextExtraction1();
		System.out.println(fj.extractText("https://en.wikipedia.org/wiki/Varanasi"));
	}
}



Same python code is too fast


import urllib2
import re
import sys
def get_text(f1):                #(f1)
    h1 = f1.read()        #f1.read()
    html = ''                # h3 is a string 
    h2 = h1.split('\n')
    f= open("guru99.txt","w+")
    
    for h3 in h2:
        html += h3
        html += ' '
        
           
    capture = True
    filtered_text = ''
    html_text = html.split('<')
   
    i=0
    for h in html_text:
        h = '<' + h
        h_text = h.split('>')
        
        for w in h_text:           
            if w:
                if w[0] == '<':
                    w += '>'
                    
            if re.search(r'</script>', w):
                capture = True                
            elif re.search(r'<script', w):
                capture = False                
            else:
                if capture:
                    filtered_text += w
                    filtered_text += '\n'
   
def get_url_text(url):
    
    try :
        f = urllib2.urlopen(url)
    except (urllib2.HTTPError,urllib2.URLError) :
        return '\n'
    else:
        return get_text(f)
def main():
    get_url_text(sys.argv[1])
if __name__ == "__main__": main()


What I have tried:

i just converted "for loop" into while loop


String h3="";int i3=0;
        while(i3<h2.length)
        {	//bw.write(h3);bw.newLine();
        		h3=h2[i3];
        		html += h3;
                html += "";i3++; //iu=iu+1;               	
        }
Posted
Updated 25-Jan-17 9:18am

You should try to optimise the Java code.

The best optimisation can be achieved by avoiding dynamic object creation inside loops.

An example:
PHP
# PHP
if w:
    if w[0] == '<':
        w += '>'

Java
// Java
if(w.length()>0)
{	
    if(w.substring(0, 1).equals("<"))
    {
        w +=">";
    }	
}

Her substring will create a new string dynamically and perform a string comparison.
Why not just use String.charAt() and perform a character comparison?
Java
if(w.length()>0)
{	
    if(w.charAt(0) == '<')
    {
        w += ">";
    }	
}

Another optimisation might be using class or static members to store the used regex search Patterns. Then Pattern.compile() has not to be executed multiple times.
 
Share this answer
 
Comments
Afzaal Ahmad Zeeshan 25-Jan-17 15:51pm    
Not sure, and too lazy to Google, but can't you apply indexers in Java to get the character at that index in String objects? Just curious. :-)
Jochen Arndt 26-Jan-17 2:57am    
I would have to Google it too.

An indexer would be better when iterating over the characters because it avoids the bound checking. Here it would require to split the loop into two then to process characters and substrings.
Afzaal Ahmad Zeeshan 26-Jan-17 6:58am    
Yup, I also looked around and found only charAt available in Java API, whereas they could write an interface that allows such possibility.

Your answer was good, and my comment was just another quick question to you only, nothing about the post. 5ed for that. :-)
There is a tool that lets you know where a program spend time, its name is Profiler.
Profiling (computer programming) - Wikipedia[^]

You should try to use StringBuilder every time you have to concatenate strings.
Note that
Java
filtered_text += w;     filtered_text += "\n";

is slower than
Java
filtered_text += w + "\n";
 
Share this answer
 
v2

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900