|
|
Many thanks for the pointer.
I have now solved the issue. It turned out that the decode routine on the server had a bug when handling certain values. This has now been resolved and all works well.
|
|
|
|
|
With a large spectrum of Popup Blockers in place there are also sites, which are able to detect that a Popup Blocker is active and asks us to deactivate it.
Somesites are able to override and show a Popup. Experts-Exchange (without logging on) would be able to show a popup. Rediff.com too.
Any clue on what JavaScript trick is being used?
Vasudevan Deepak Kumar
Personal Web: http://www.lavanyadeepak.tk/
I Blog At: http://deepak.blogdrive.com/
|
|
|
|
|
dear all;
does any one know how to extract text nodes from a web page. by text nodes i mean "free text". ie, text without tags
by the way i am using the web browser control.
Thank you
llp00na
|
|
|
|
|
Just use the innerText property of the body element. You don't say what language you use so it's hard to be any more specific.
Steve
|
|
|
|
|
oh sorry,
I am using Visual C++
llp00na
|
|
|
|
|
I am also confused !!! How is a node text represented in the DOM. ie, some text without a tag
llp00na
|
|
|
|
|
please consider the following node:
<p> this is some text </p>
innerText will get me "this is some text". But it does not help deciding if this the node is a text node or not. In this case it is not as it has got a tag <p>
your suggestion works in similar cases to:
<td>
some text
<a> first link </a>
<a> second link </a>
<td/>
In this example "some text" is a text node. clearly its a child node of <td> and has got no tag.
I hope i correctly understand the problem because its confusing me. do you have any views on that ?
llp00na
-- modified at 5:39 Thursday 2nd March, 2006
|
|
|
|
|
Try this:
--------
<html>
<head>
<title>Test</title>
<script>
function Loaded()
{
alert(TheBody.innerText);
}
</script>
</head>
<body id="TheBody" onload="Loaded();">
The quick <b>brown</b> fox <b><i>jumps</i></b> over the lazy fox.
</body>
</html>
--------
This HTML will get all the text - even text inside child tags. Is this what you want?
Steve
|
|
|
|
|
thanx for your reply steve.
unfortunately its not what i am trying to do.
I am trying to asses whether an html element is a text node (an element without a tag). But i dont know how to achieve that.
llp00na
|
|
|
|
|
As I understand it a HTML element is never a text node - It can contain text nodes however. For example the following HTML:
<p>Text</p>
Could be viewed as follows:
<p><text_node>Text</text_node></p>
Where <text_node> indicates a text node.
Another example. The following HTML:
<p><b>T</b>ext</p>
Could be viewed as follows:
<p><b><text_node>T</text_node></b><text_node>ext</text_node></p>
Steve
|
|
|
|
|
I see, so are implying that any html element should always have a tag ???
In case where the user does not specify a tag, it gets assigned by the web browser !!!
Okey in your representation
<p><b><text_node>T</text_node></b><text_node>ext</text_node></p>
(T) has got two tags "b and text_node". the second element (ext) has got only one tag being "text_node". Now is there any way to spot this difference? Because thats what i am trying to do.
Thank you
llp00na
-- modified at 7:14 Thursday 2nd March, 2006
|
|
|
|
|
If you use the "normal" interfaces such as IHTMLElement and such you can't see this directly. If you use the IHTMLDOMNode family of interfaces the structure I mentioned previously can bee seen. Text nodes will implement the IHTMLDOMTextNode interface.
Steve
|
|
|
|
|
thanx Steve.
I will look into that. Is there any way to get the IHTMLDOMText from the IHTMLElement interface ???
llp00na
|
|
|
|
|
Thanx for ur valuable advice Steve.
I think i can use IHTMLElement4::getAttributeNode to get IHTMLDOMAttribute2. nodeType of this interface will allow me to specify the type of the node ie. text node, element node.
but i am not sure quite sue what do they mean by text node and element node !!!
do u have any idea ?
llp00na
-- modified at 11:28 Thursday 2nd March, 2006
|
|
|
|
|
I don't believe you can get a IHTMLDOMTextNode from a IHTMLElement interface because the basic philosophy is different - The IHTMLElement family doesn't include text nodes but exposes the text info using the IHTMLElement::get_innerText method. The IHTMLDOMTextNode exposes text nodes W3C style. You can get a IHTMLDOMNode from a IHTMLElement however - By QIing I believe. Can you explain your problem in more detail? Perhaps then I can give some concrete advice.
Steve
|
|
|
|
|
Thanx for you reply and infinite patience;
I am parsing an html DOM tree and trying to extract text with no tags (lets call it: text node or free text).
The way i approached this problem is as follows:
1-I have implemented an internet explorer web browser (the usual way, by adding a microsoft web browser control into my MFC application).
2-I am retrieving the IHTMLDocument2 interface from the IWebBrowser2 interface.
3-I have then retrieved the body element from the IHTMLDocument2.
4-I am looping through the html elements of the body and assessing whether a specific element is a text node (free text) or not.
What do you think ?
Thank you again steve
llp00na
|
|
|
|
|
I'm not quite sure what you mean - all text will be inside some tag. Perhaps it will be easier if you describe the desired effect by example. ie.
<html>
<head>
<title>The title</title>
</head>
<body>
Some text.
<p>Some text.</p>
<p><b>S</b>ome <b>t</b>ext.</p>
<span>Some <b>text</b></span>
</body>
</html>
What text should be returned in this case? Or perhaps you have a better example.
Steve
|
|
|
|
|
thanx steve;
in your example, i should return the following:
<html>
<head>
<title>The title</title>
</head>
<body>
Some text. -------->> Should return this
<p>Some text.</p>
<p>
<b>S</b>
ome ------------->> Should return this
<b>t</b>
ext. ------------>> Should return this
</p>
<span>
Some ----------->> Should return this
<b>text</b>
</span>
</body>
</html>
Hopefully its much clearer now.
llp00na
-- modified at 10:12 Friday 3rd March, 2006
|
|
|
|
|
I'm still a bit confused here. In the example:
<html>
<head>
<title>The title</title>
</head>
<body>
Some text. -------->> Should return this
<p>Some text.</p> <<----- STEVE: We don't return this?
<p>
<b>S</b>
ome ------------->> Should return this <<----- STEVE: But we do return this?
<b>t</b>
ext. ------------>> Should return this
</p>
<span>
Some ----------->> Should return this
<b>text</b>
</span>
</body>
</html>
Can you explain the bits marked with "STEVE"?
For this bit:
<p>Some text.</p> <<----- STEVE: We don't return this?
If I added the modified it as follows:
<p><b>new bit</b> Some text.</p> <<----- STEVE: We don't return this?
What difference would it make?
Steve
|
|
|
|
|
Thank you steve,
quoted:
"If I added the modified it as follows:
<p><b>new bit</b> Some text.</p> <<----- STEVE: We don't return this?"
In this case:
<P>
<b> new bit </b>
Some text . ------------->> should return this
</p>
The text I want to return is the one which does not have an explicit tag
in your example:
<P>
<b> new bit </b> ----> represents the first child of the element P
Some text . ----> represents the second child of the element P
</p>
As you can see, child one and child two are siblings right? and they are both children of element/node <P>.
Now, the first child "new bit" has got an explicit tag (being <b>), however the second child "Some text." does not have an explicit tag.
As a general assumption, let us consider X as an element/node of a DOM tree. If X has got MORE than one child, then return all textual children with no explicit tags.
I hope I made the problem understandable and I am sorry for my inaccurate explanations.
llp00na
-- modified at 11:41 Saturday 4th March, 2006
|
|
|
|
|
Ok, what about this case:
-------------------------
<html>
<head>
<title>The title</title>
</head>
<body>
Some text.
<p>Here is <b><i>s<i>ome</b> text.</p> <--- Do we return the text "ome"?
<p>
<b>S</b>
ome
<b>t</b>
ext.
</p>
<span>
Some
<b>text</b>
</span>
</body>
</html>
Steve
-- modified at 21:01 Saturday 4th March, 2006
|
|
|
|
|
Thanx steve:
<body>
Some text.
<p>
Here is -----> should return this
<b>
<i>s<i>
ome ------> should return this
</b>
text. -------> should return this as well
</p> <--- Do we return the text "ome"?
<p>
<b>S</b>
ome -----> return this
<b>t</b>
ext. ----> return this
</p>
<span>
Some ---> return this
<b>text</b>
</span>
</body>
llp00na
|
|
|
|
|
Here's some code which doesn't do what you want, but contains all the elements you need. You should be able to alter it to suit your needs. It uses straight COM without any smart pointers. Normally I'd use ATL smart pointers but I thought this way it will work "out of the box", you can alter it use your favorite smart pointers. The code will be considerably smaller if you alter it to use some kind of smart pointer. Here goes:
-------------
void CMFCWebBrowserDlg::OnButton1()
{
IDispatch *pDisp = m_Browser.GetDocument();
if ( pDisp != NULL )
{
IHTMLDocument2 *pDoc2;
HRESULT hr = pDisp->QueryInterface(&pDoc2);
if ( SUCCEEDED(hr) )
{
ASSERT(pDoc2);
IHTMLElement *pBodyElement;
hr = pDoc2->get_body(&pBodyElement);
if ( SUCCEEDED(hr) && pBodyElement )
{
IHTMLDOMNode *pDOMNode;
hr = pBodyElement->QueryInterface(&pDOMNode);
if ( SUCCEEDED(hr) )
{
ASSERT(pDOMNode);
IDispatch *pChildrenDisp;
hr = pDOMNode->get_childNodes(&pChildrenDisp);
if ( SUCCEEDED(hr) )
{
ASSERT(pChildrenDisp);
IHTMLDOMChildrenCollection *pChildrenCollection;
hr = pChildrenDisp->QueryInterface(&pChildrenCollection);
if ( SUCCEEDED(hr) )
{
ASSERT(pChildrenCollection);
long NumItems;
if ( SUCCEEDED(pChildrenCollection->get_length(&NumItems)) )
{
for ( long i=0; i<NumItems; ++i )
{
IDispatch *pItemDisp;
hr = pChildrenCollection->item(i, &pItemDisp);
if ( SUCCEEDED(hr) && pItemDisp!=NULL )
{
// Is it a text node?
IHTMLDOMTextNode *pTextNode;
hr = pItemDisp->QueryInterface(&pTextNode);
if ( SUCCEEDED(hr) )
{
// It's a text node so get the text.
BSTR bstrText;
hr = pTextNode->get_data(&bstrText);
if ( SUCCEEDED(hr) )
{
MessageBoxW(m_hWnd, bstrText, L"Text node", MB_OK);
SysFreeString(bstrText);
}
}
else
{
// It's not a text node.
IHTMLElement *pElem;
hr = pItemDisp->QueryInterface(&pElem);
if ( SUCCEEDED(hr) )
{
ASSERT(pElem);
BSTR bstrOuter;
hr = pElem->get_outerHTML(&bstrOuter);
if ( SUCCEEDED(hr) )
{
MessageBoxW(m_hWnd, bstrOuter, L"Element", MB_OK);
SysFreeString(bstrOuter);
}
pElem->Release();
}
}
pItemDisp->Release();
}
}
}
pChildrenCollection->Release();
}
pChildrenDisp->Release();
}
pDOMNode->Release();
}
pBodyElement->Release();
}
pDoc2->Release();
}
pDisp->Release();
}
}
Steve
|
|
|
|
|
thanx very much steve,
you have been really very helpful
llp00na
|
|
|
|