Click here to Skip to main content
Click here to Skip to main content

Adapting GRML

, 7 Oct 2004
Rate this:
Please Sign up or sign in to vote.
Convert a HTML web page to GRML.

Introduction

This article introduces the process of adapting (or converting) a web page from one markup language to another. It discusses how to adapt a HTML web page to GRML. Two examples are provided. The first demonstrates how to extract hyperlinks from an HTML web page and convert it to GRML. The second demonstrates how to do this with images. These examples require server-side processing. Here, IIS, Active Server Pages (ASP), and PERL are used.

Background

It is recommended to have some experience with ASP and PERL. PERL has regular expression support that is used to extract the hyperlinks and images from the web page. Any server-side scripting environment does this, including .NET, CGI, or PHP. However, PERL and ASP are used for this article. While PERL is required, the server-side scripting language specifically used is PerlScript. To use PerlScript, download a PERL interpreter. To get one that works with IIS, try ActivePerl. If not done already, read Introducing GRML and Using GRML. These articles provide explanations of what GRML is and how it is used.

What is an Adapter?

An adapter is generally defined as...

an object that converts the original interface of a component to another interface.

For the purposes of this article, the definition of an adapter is...

server-side processing or scripting that converts one markup language to another.

This definition describes converting HTML to GRML using ASP. The adapter object is the ASP scripting, the original interface is HTML, and the other interface is GRML.

Depending on what is being converted, adapters need to read from the original interface and write to the other interface. In other words, an adapter needs an interface reader and an interface writer. A HTML to GRML adapter requires a HTML reader and a GRML writer.

Adapting hyperlinks and images

HTML does not describe many elements of its content. For example, there is no way to determine the attributes of one text block from another. However, not all HTML content is without description. It does have specific tags for hyperlinks and images.

Using a specific tag to identify content makes it possible to create a script that reads only those tags. When found, the unwanted tag elements are removed, leaving only the content. The script then writes the content in the new format or markup language. This is how it adapts HTML hyperlinks or images to GRML.

The Hyperlink adapter

The use of the <a href=> tag allows a HTML web browser to identify which text is a hyperlink. This tag is the basis for the hyperlink adapter. It extracts all hyperlinks from an HTML web page and converts them to GRML.

Below is an example of a HTML to GRML hyperlinks adapter, using PerlScript:

<%@ Language="PerlScript%">
<HTML>
<center>
<form action=links.asp>
URL to extract: <input type=text name=url1 length=60>
<input type=submit>
</form>
</center>
<!--
<grml>
<edit url1>
<title>Enter URL:>
<%
use HTML::LinkExtor;
use URI::URL;
use LWP;

my $url, $html;

# Parsing the Request
$url = $Request->QueryString("url1")->Item();

$Response->Write("<submit>\n");
$Response->Write("<location>GRMLBrowser.com/links.asp\n");
$Response->Write("</submit>\n");
$Response->Write("<edit url1>\n");
$Response->Write("<text>$url\n");
$Response->Write("</edit>\n");

if ($url eq "")
{
        $Response->Write("</GRML>\n");
}
else
{
        if ($url !~ /http:\/\//)
        {
            $url = "http://". $url;
        }
}

# Constructing the Request
    $_ = $sites;

# Retrieving the Response/Resultset
#    - Filtering the Resultset (optional)
my $ua = LWP::UserAgent->new(agent => "Mozilla 4.0");
my $request  = HTTP::Request->new('GET', $url);
my $response = $ua->request($request);

unless ($response->is_success)
{
    print $response->error_as_HTML . "\n";
    exit(1);
}

my $res = $response->content(); # content without HTTP header

$Response->Write("<column>\n");
$Response->Write("<Title>\n");
$Response->Write("<Request>\n");
$Response->Write("<link>\n");
$Response->Write("</column>\n");

$Response->Write("<result>\n");

$res =~ s/\n/ /gsi;

while($res =~ m|href=(.+?)>(.*?)</A>|gsi)   ## that's all ...
{
    my $temp_link = $1;
    my $temp_item = $2;
    
    $temp_link =~ s/\'//gsi;
    $temp_link =~ s/\"//gsi;
    $temp_link =~ s/ (.*)//gsi;
    $temp_link =~ s/<b>//gsi;
    $temp_link =~ s/<\/b>//gsi;
    $temp_link =~ s/&amp;/\&/gsi;
    $temp_link =~ s/\n(.*)//gsi;
    $temp_item =~ s/<b>//gsi;
    $temp_item =~ s/<\/b>//gsi;
    $temp_item =~ s/<(.+?)>//gsi;
    $temp_item =~ s/<\/font>//gsi;
    $temp_item =~ s/&amp;/\&/gsi;
    $temp_item =~ s/ / /gsi;
    $temp_item =~ s/&quot;/\"/gsi;
    $temp_item =~ s/\n(.*)//gsi;
    $temp_item =~ s/\n/  /gsi;
    $temp_item =~ s/  (.*)//gsi;
    $temp_item =~ s/   (.*)//gsi;
   

    if ($temp_item !~ /img src=/)
    {
        if ($temp_link !~ /$url/ && $temp_link !~ /\/\//)
        {
            $temp_link = $url . "\/" . $temp_link;
        }

        $temp_item =~ s/\n//gsi;
        $temp_link =~ s/\n//gsi;

        $Response->Write("<link>$temp_link\n");
        $Response->Write("<title>$temp_item\n");    
    }

    $Response->Write("<request>$url\n");
    $Response->Write("\n\n");
}

$Response->Write("</result>\n");
$Response->Write("</GRML>\n");
%>
-->
</html>

What the above code does is it creates a form in HTML that extracts all the hyperlinks from a web page. The hyperlinks (and their titles) are formatted using GRML. To view GRML, a GRML web browser is required (such as Pioneer Report MDI).

All of the server-side scripting is used as the HTML reader. Only the following lines are used as the GRML writer. They are:

  • $Response->Write("\n");
  • $Response->Write("<Title>\n");
  • $Response->Write("\n");
  • $Response->Write("<link>\n");
  • $Response->Write("\n");
  • $Response->Write("\n");
  • $Response->Write("<link>$temp_link\n");
  • $Response->Write("<title>$temp_item\n");
  • $Response->Write("$url\n");
  • $Response->Write("\n");

Only the last three lines format the hyperlinks using GRML. The first two lines create the form in the browser window of a GRML web browser and do not use the adapted HTML hyperlinks.

To see the above in action, go to Hyperlink adapter or copy the above script to a file and host it from a local web server. Once the web page is displayed, enter a URL and press the 'Submit' button. It displays all the hyperlinks extracted from the HTML web page formatted in GRML.

After adapting hyperlinks from HTML to GRML, this is how it appears in a GRML web browser (using Pioneer Report MDI):

The Image adapter

Using the <img src=> tag, a script is able to find and extract images from HTML. By reading this tag and removing unwanted tag elements, the HTML images are converted to GRML. The following script demonstrates this:

<%@ Language="PerlScript%"> 
<center> 
<form action=translate.asp> 
URL to translate: <input type=text name=url1 length=60> 
<input type=submit> 
</form> 
</center> 

<!-- 
<grml> 
<edit url1> 
<title>Enter URL: 
</edit> 

<% 
use HTML::LinkExtor; 
use URI::URL; 
use LWP;

my $url, $html;

# Parsing the Request
$url = $Request->QueryString("url1")->Item();

if ($url eq "")
{
    $Response->Write("</GRML>\n");
}
else
{
    if ($url !~ /http:\/\//)
    {
    $url = "http://" . $url;
    }
}

$Response->Write("### URL ###\n\n");
$Response->Write("The url is: $url\n\n");

# Constructing the Request
$_ = $sites;

# Retrieving the Response/Results
#    - Filtering the Results (optional)
my $ua = LWP::UserAgent->new(agent => "my agent V1.00");
my $request  = HTTP::Request->new('GET', $url);
my $response = $ua->request($request);

unless ($response->is_success)
{
    print $response->error_as_HTML . "\n";
    exit(1);
}

my $res = $response->content(); # content without HTTP header

my @imgs  = ();
my @hrefs = ();

# Make the parser.  Unfortunately, we don't know the base yet
# (it might be diffent from $url)
my $p = HTML::LinkExtor->new(\&callback);

$p->parse($res);

# Expand all image URLs to absolute ones
my $base = $response->base;

@imgs = map { $_ = url($_, $base)->abs; } @imgs;

$Response->Write("<column>\n"); 
$Response->Write("<image>\n"); 
$Response->Write("<link>\n"); 
$Response->Write("</column>\n\n"); 

$Response->Write("<result>\n"); 

foreach (@imgs)
{
    $Response->Write("<image>$_\n");
}

$Response->Write("\nLinks:\n");

foreach (@hrefs)
{
    my $temp = $_;

    if ($temp !~ /$url/ && $temp !~ /\/\//)
    {
        $temp = $url . "\/" . $temp;
    }

    $Response->Write("<link>$temp\n");
}

sub callback
{
     my($tag, %attr) = @_;

     push(@imgs , values %attr) if $tag eq 'img';
     push(@hrefs, values %attr) if $tag eq 'a';
}

%>
</result>
</GRML>
-->

The above script is used as an HTML reader, except for the lines used to build the columns and each result. These lines are the GRML writer:

  • $Response->Write("\n");
  • $Response->Write("\n");
  • $Response->Write("<link>\n");
  • $Response->Write("\n\n");
  • $Response->Write("\n");
  • $Response->Write("$_\n");
  • $Response->Write("<link>$temp\n");
  • $Response->Write("\n");

Once the image content has been adapted to GRML, this is how it looks in a GRML web browser (using Pioneer Report MDI):

Conclusion

Converting HTML to GRML is possible when using an adapter. Only the content with identifiable tags are adaptable from one markup language to another. In the case of HTML, there are tags to identify hyperlinks and images.

The examples described for adapting content show how to convert HTML hyperlinks or images to GRML. The adapter consists of a HTML reader and a GRML writer. Using this adapter, a web page viewed with a HTML web browser is viewable using a GRML web browser.

Latest changes

  • 09/03/04
    • Using GRML v1.2 in code samples.
  • 10/08/04
    • Using GRML v2.3 in code samples. Pioneer Report MDI 3.64 uses GRML v1.2 while all other GRML web browsers use v2.3.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

About the Author

Toby Jacob Rhodes
Web Developer
United States United States
Developing with MFC for a couple of years now. Working at getting my new web browsers just right.
 
My website is at GRML web browsers.
 
Downloads:
Pioneer Report MDI (GRML/CSV/delimited web browsers)
 
Other stuff:
free myspace backgrounds | Free Images Graphics | Myspace profile editor
 
I enjoy Memphis, TN and it is great coz there are absolutely no major sports teams (well, except for the Grizzlies).

Comments and Discussions

 
Questionseems like student papers? Pinmemberbills44212-May-05 8:32 
QuestionArticle or Subject Matter? PinmemberDrew Stainton5-Sep-04 14:14 
AnswerRe: Article or Subject Matter? PinmemberToby Jacob Rhodes5-Sep-04 16:30 
GeneralI can hear you... [modified] PinmemberCap'n Code5-Sep-04 4:45 
GeneralRe: I can hear you... Pinmembermcarbenay7-Sep-04 4:39 
GeneralMessage Removed PinmemberCap'n Code7-Sep-04 11:11 
GeneralRe: I can hear you... Pinmembermcarbenay7-Sep-04 11:25 
GeneralMissing the point... Pinmembermcarbenay4-Sep-04 11:44 
GeneralRe: Missing the point... PinmemberToby Jacob Rhodes4-Sep-04 15:31 
GeneralRe: Missing the point... Pinmembermcarbenay4-Sep-04 22:49 
GeneralRe: Missing the point... PinmemberToby Jacob Rhodes5-Sep-04 5:40 
GeneralRe: Missing the point... Pinmembermcarbenay5-Sep-04 10:54 
GeneralRe: Missing the point... PinmemberToby Jacob Rhodes5-Sep-04 15:35 
GeneralRe: Missing the point... Pinmembermcarbenay5-Sep-04 21:37 
GeneralRe: Missing the point... PinmemberToby Jacob Rhodes6-Sep-04 5:23 
GeneralRe: Missing the point... Pinmembermcarbenay7-Sep-04 4:37 
GeneralQuestion about GRML PinmemberAaron Eldreth3-Sep-04 17:38 
GeneralRe: Question about GRML PinmemberToby Jacob Rhodes4-Sep-04 6:47 
GeneralGRML Useless PinmemberMichael Russell (Layton)3-Sep-04 6:28 
GeneralRe: GRML Useless PinmemberAK3-Sep-04 7:14 
GeneralOkay. PinmemberToby Jacob Rhodes3-Sep-04 9:54 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web02 | 2.8.140721.1 | Last Updated 8 Oct 2004
Article Copyright 2004 by Toby Jacob Rhodes
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid