Web services are still new and growing up, it will definitely have an important role to play in the future of distributed computation. In this paper, I introduce a tool that allows you to search web services via UDDI registry servers. The point of my tool is to deliver business value, which in this case means easy discovery of web services and letting you view their WSDL on a tree.
Prior to making this tool, I fumbled through MSDN and CodeProject, to learn about how to discover a web service from UDDI, parsing a WSDL and XML schema. Along with some of my previous works on text processing, the program was born with the following features:
- Discover web services via UDDI server.
- N-grams TF-IDF based post-filtering.
- Parsing the Web Service Description Information (WSDL) and displaying them onto a hierarchical tree.
Preparing the ground
If you are not familiar with WSDL, XML schema, XML namespace and UDDI, please do some reading to get a feel of this problem.
Web service basics
First, I want to start with a short introduction to web services. Web services consist of a service provider and multiple consumers based on the client-server architecture. Each web service uses a custom communication protocol for the clients to access the servers. The most common access pattern for a web service consists of requests and responses. The client sends a request message that specifies the operation to be performed and all the relevant information to perform the operation, to the server. The server performs the specified operation and replies with a response message. The actions carried out by the server might result in permanent changes to the sate of the server.
Essentially, web services provide RPC like interfaces to the client. For example, MyInfor, is a web service that allows users to maintain access information such as names, addresses and phone numbers of their contacts. The MyInfor web service exports operations to insert, delete, replace and query portions of this contact information. Each of these operations takes input parameters (the query string) and produce an output (query response or success status) while making permanent state changes at the server. Each web service provides its own custom interface that could be vastly different from those provided by other web services. For example, a travel web service would provide operations to search for airfares, reserve and buy tickets and look-up itinerary.
WSDL, XML schema
Before going further, you should have a basic understanding of WSDL, XML namespaces and XML schema. WSDL is a document written in XML. It provides a way for service providers to describe the basic format of web service requests over different protocols. The WSDL specification defines the following major elements of networked XML-based services:
- Types provides data type definitions used to describe the messages exchanged.
- Message represents an abstract definition of the data being transmitted or communicated.
- Operation is an abstract description of an action supported by the service. Each operation refers to an input message and output message.
- Port Type is a set of operations supported by one or more endpoints.
- Binding specifies concrete protocol and data format specification for the operations and messages defined by a particular port type.
- Port specifies an address for a binding or the URL where the web service is listening.
- Service is a collection of related ports.
Typically, information common to a certain category of business services such as message formats, port types, and protocol bindings, are included in the reusable portion, while information pertaining to a particular service endpoint is included in the service implementation definition portion.
XML schema is an XML based alternative to DTD. An XML schema describes the structure of an XML document, and it is used to validate whether an XML document conforms to the definition structure. The XML schema is used to define the data type structure in a WSDL.
Discovering web services via UDDI server
A UDDI server is a web services registry, it contains the following information:
- Business and other service providers.
- Services they will expose.
- Binding information (locations) of those services.
- Interfaces supported by those services.
Looking at the UDDI mechanism we find that it is a web service, which exposes information about other web services.
One of the major purposes of a UDDI is to provide an API for publishing and retrieving information about web services. The operations can be invoked by a SOAP call to the exposed methods of a certain web service. The common uses of a UDDI are:
UDDI hosted by large companies like Microsoft, IBM. Anyone can get an account in these servers and look for a web service that they want to invoke for their development. Companies that have built web services most likely use these public services.
Industries that expose their own UDDI servers for performance or security reasons. For example: chemical sector, or finance.
Some large companies may choose to run a UDDI server on their Intranet so that generic building blocks for corporate applications can be exposed throughout the company.
For more details about UDDI please refer to UDDI specifications.
Searching for the web service(s)
The problem of searching for a web service involves two steps:
- Discover the web service advertisement information from a UDDI server.
- Select the most appropriate web service from the list of potential services.
At the basic level, the UDDI API provides only a simple keyword search on the "web service name" (or TModel name) advertised in UDDI registries. In fact, some valuable information may not be included in the name. The information about a web service is comprised in the advertised UDDI description, the description in the web service itself, method etc.
Unfortunately, the result returned by the UDDI server may be huge. There are a lot of web services that can be found with the associated access point like "http://localhost/abc...". This is useless information. Users need to visit hundreds of entry points to find the appropriate services.
In addition to the simple keyword processing, I now provide the post-filtering query feature to help find out a relevant advertisement among the currently available ones. In the first part of the work, I used the data directly from the UDDI registry, and did not utilize WSDL files as a source evidence for the searching task. But this is a must in the next step because these XML files carry all the information that is needed to describe a web service.[*] This task is made easier with the availability of a local repository of service advertisement information and descriptions.
The filtering algorithm is based on the vector space model which was proposed by Salton. The major idea behind this algorithm is that documents and query are represented on a K-dimensional vector. K is the number of distinct words which are extracted from the document collection. Each word is assigned a weight, it reflects the importance of a word within the document. This value is calculated based on its frequency and its distribution across a collection of documents. The idea behind IDF weighting is that people usually express their opinion by using frequently used words. The similarity of two documents is calculated based on TF-IDF and the cosine similarity between the angle of two vectors which represent the documents. This value is then normalized 0 through 1, and is used to rank the search results. For more detail about TF-IDF, please refer to this.
Here, I consider the terms in a UDDI description of the retrieved advertisement services (the concatenation of all advertised information about a web service) as a bag of words and use the TF-IDF measure to compute the similarity between two such bags. The pre-processing step includes: word stemming (suffixes removing), and stop word removal (removing frequent and insignificant words). This could help improve the accuracy of ranking. Due to the descriptions that are highly compact, I decided to use n-gram text method to extract the vocabulary collection (distinction terms). N-grams are a language-neutral representation, it works better for languages other than English where the rules based stemming algorithm (e.g. English-Porter stemming) have not shown to work well. The N-grams have also shown to work well in short text matching, including spelling errors, acronym, name matching etc. The disadvantage of N-grams based tokens system are slower running speed, and it incurs disk usage more than the word based tokens system. Typically, 2 or 3-grams occur in documents, but in some documents it is 6 or 7-grams.
After the pre-processing step, a description is split up in n-grams, instead of words like in other information retrieval systems. In my preliminary evaluation, the tri(3)-gram, and quad(4)-gram based systems have shown to return better results than the word-based tokens system.
Parsing the web service description (WSDL)
Each web service has an associated service description (WSDL) that describes its abstract interface and the concrete implementation functionality. The service description will be parsed for all major content elements like the type definitions, elements, operations etc. These elements will be modeled on a tree.
The application starts parsing from the implementation level (service, port, binding) of a WSDL and goes up to the interface level (portType, message, operation). The information about the service, port, binding elements will be used to determine the form style of the parsed message.
A web service node contains the
PortType collection, a
PortType (like a class) corresponds to a set of one or more operations, where an operation (like a method) defines a specific input/output that must correspond to the name of a message that was defined earlier in the WSDL document. If an operation specifies just an input, it is a one-way operation. An output followed by an input is a solicit-response operation, and a single input is a notification.
The input/output parameters of an operation use a message as their type. An operation may either have an input or output or both.
Message is a collection of parts
- The message is not a real thing by itself, but just a bag of things.
- Each part represents one "thing" to be sent or received.
- Don't think of RPC parameters - a medical record being sent to a doctor may consist of some XML document (contained in the SOAP envelope) and lots of other stuff like XRay images etc. as attachments. Each of these would be modeled as a part.
- In other scenarios it may be multiple XML documents or elements from different namespaces (purchase order + vcard).
The message part uses the XML schema to define their part's type.
Defining a message part
- XML schema lets you define one of these:
- Named, complex type.
- An element, where the element is typed by pointing to a named complex type or by defining an anonymous type just for that.
- XML schema also provides a set of built-in types.
- WSDL 1.1 lets you point to named types (type=) or named elements (element=) to declare a message part.
A binding corresponds to a
PortType implemented using a particular protocol such as SOAP or CORBA. The type attribute of the binding must correspond to the name of a
PortType that was defined earlier in the WSDL document. If the service supports more than one protocol, the WSDL includes a binding for each.
The input/output parameter trees of an operation are created based on the operation's style (like Document or RPC) that follows the SOAP binding rules. These trees are a kind of XML schema trees and look like the XML SOAP messages of the consuming web service.
The WSDL(1.1) SOAP binding rules are:
<soap:body> is what defines how the body is built.
- If a part is described using type=, then one needs a way to generate an XML element out of it.
- Basically, some mechanism to do the equivalent of an XML schema global element declaration is needed.
- In SOAP RPC case, SOAP RPC rules tell you how to make those into elements.
- Name based on part name.
- Content based on encoded or not. In other cases, no mechanism is given.
- If a part is described using element=, then the XML element is given.
Modeling XML schemas on a tree
Each element or attribute of the schema is translated into a tree node. This implementation supports all XML schema elements, and extensible elements (array) that come from the SOAP definition. Parsing XML schema is a complex recursive operation that walks on every particle of an XML schema.
Modeling XML schema as a tree requires an exhausting operation due to the complexity of the XML schema structure. During this task, the constructs in the XML schema are always taken care of to ensure that a schema is modeled as a full tree. i.e. : some complexType derives from another complexType, extension or restriction... This also takes into account the following definitions:
- Reference types definition:
Reference definition is a mechanism to make schema simple through the sharing of common segments/types. In the process of transforming this structure into a tree, I chose to duplicate the shared segment under the node that refers to it. Therefore, you don't need to care about the reference types.
- Preventing infinite recursive definition:
This happens when a leaf element refers to one of its ancestors (i.e.: a class has a member as an instance of it). This structure definition will also break the tree structure and it has to be solved differently from the way of solving reference types, otherwise it falls into an infinite recursive loop. In this case, it just shows the node (which refers to its ancestors) with a predefined depth.
- Handling the namespace prefix:
Each node will be associated with a prefix if the schema definition specifies that the form of this node (element/attribute) is qualified.
- Displaying multiple occurring data:
The elements which appear more than once in an XML document will be displayed duplicated on the tree.
The XML schema tree looks very much like an XML instance document of that XML schema.
How to use this tool
This tool is built on .NET 2.0, and references the Microsoft UDDI SDK package.
The UDDI explorer
This form features an advance searching of web service via UDDI servers. It may help your application find a potential partner. I have started modifying the original sample class "Discovery web service via UDDI" from MSDN.
Building from a simple sample, I chose to create the new features like extension search parameters, searching thread, parsing and viewing WSDL and XML schema. There are some well-known UDDI servers such as Microsoft or IBM... that are already at the UDDI server URL repository.
In my opinion, these parameters are the most efficient things that help you boost the search effectively. The search parameters now include:
- Web service name
This optional collection of string values represent one or more partial services names qualified with the
xml:lang attributes. Any
BusinessService data contained in the specified
BusinessEntity with a matching partial name value gets returned.
- A wildcard character % may be used to signify any number of characters. There are not more than 5 % in a keyword search. E.g. "Hello%Kofax%" is a key. Note: This wildcard is by default supported by UDDI API.
- An "OR" operand may be used to perform an OR logical search. E.g. "Hello OR Weather". The returned result can only match either "Hello" or "Weather". This is equivalent to the "Or all key words" function.
- Business provider name
Names of the business providers have been registered in the selected UDDI server. If more than one business provider matches the name, it will return all the services that match the service name which belongs to each of them. (logical OR). If the textbox is
null this means that it will search for all the providers, i.e. search with business name "simpleTron", at this URL, blank service name, blank TModel.
- Compliance TModels name
The names of the TModels have been registered by the services in the selected UDDI server. It will search for all the business structures that contain
BindingTemplate structures with fingerprint information that matches the TModel name. If more than one TModel matches the name, then the
BusinessService structures that contain
BindingTemplate structures with fingerprint information that matches all of their TModel keys specified will be used to filter the services (logical AND only).
- Case sensitive and match exact name
These are used to sort the results and to control the keyword matching: case sensitive/insensitive, use of wildcards and exact match.
Stop the search
Sometimes it may take a long time for the search for whatever reason via the server; in that case you can stop the searching thread by clicking on the stop button.
The search result
The search result lists the web services that match the criteria. Each result is a web service. A web service is translated into a node that has two children:
- The first one is named "Registry information", it contains information about the service which is retrieved from the UDDI registry (like URL, author contact, binding info...).
- The other node is named "WSDL Description", and contains its WSDL tree. By doing double click at this node, it will retrieve and parse the WSDL of the current service.
Post-filtering the result
After getting the results from the UDDI server, you can input a new query (at the web service name text box) and click on the "Filter" button to sort the list. The ranking module will then take the query, and sort the result in the descending order of their estimated similarity rating when compared to the given query. It is expected that this order will show the list from the most relevant results.
Viewing WSDL with tree
WSDL view displays the selected WSDL on a tree. View displays service properties, such as name, location (URL), documentation (annotation), the Port types, the collection of operations that are offered by the service and their parameters. Depending on the complexity of the WSDL and the server usage status, tree view may take a few seconds to render. Move your cursor over any operation element to view the input/output parameters associated with it.
I hope to have more chances to intensively study this topic. By the next step, I would like to utilize the WordNet dictionary to enrich the semantic of the search. Building a web service (s) corpus to store the search and computation results (TF-IDF weighting, vector space), that may be helpful for some extension works. I also would like to study DAML-S, that is said to be fairly similar to WSDL, but it supports the specification of semantic information in RDF format. Migrating to the Mono framework will also be an interesting task. It also requires to test carefully on performance, and a precision/recall rating.
This tool is provided free for using. None of the source copyright notes and author lines should be replaced.
This article is short because I have not explained the methods in detail, so if you have any comments or questions regarding this tool please drop me a line, Thank you.
References and links
- A new approach to UDDI and WSDL
- "An algorithm for suffix stripping", M. Porter, 1980.
- Caching XML Web Services for Mobility
- Fast string matching using an n-gram algorithm.
- Introduction to mono - Your first mono application.
- Microsoft Universal Description, Discovery, and Integration (UDDI).
- N-Gram based Text Categorization
- Programming Microsoft .NET XML Web Services, 2003, Foggon, D., Maharry, D., Ullman, C., Watson, K.
- Stemming and its effects on TF-IDF Ranking
- The UDDI's specifications
- The Mono framework community.
- Term-weighting approaches in automatic text retrieval
- The Bazaar free icons
- Understanding WSDL in a UDDI registry
- Understanding the WSDL 1.1's design, S. Weerawarana, IBM Research, 2002.
- XML Schema Tutorial
- XML Namespaces by Example
- XSD Editor tool
- W3C, OWL Web ontology language overview.
- W3C WSDL Activities
- Wiki Stop word list resources.
- Web Services Description Language (WSDL) Explained
- 11/19/2005: Migrated to VS.NET 2005, Fixed bugs of UI modification on different threads.
- 11/25/2005: Fixed some minor bugs.
- 12/04/2005: Integrated the n-grams based TF-IDF similarity post-filtering.
- 12/10/2005: Added more about query post-filtering to the paper.