Applications are getting more advanced. Instead of inputting and outputting data, they are leveraging complex systems to enrich that content for more advanced decision-making. At Fixate, we are building such an application. My organization produces a lot of content. One thing we pride ourselves on is that our content is timely and relevant. No fluff. This is not an easy standard to maintain, and today, it’s 100% manual. This is where we can leverage content, analyze all the content we have created, and similar content in the websphere to help curate future topics and produce even more in-depth and higher-quality articles. And we are considering using Watson Content Hub (WCH) to do all this.
Is it really a CMS?
One of WCH’s key value propositions is that it’s a content management system (CMS). But is it really? When I hear the term content management system (CMS), I am thinking of something very static. A place where you upload content, store it for a while, use it from time to time, and delete it. If you are using Watson Content Hub in this way, you are missing out.
Watson Content Hub, to me, is a content intelligence platform. Yes, you can store content, and storing it there has some value. But the real value is the analysis of the content once it is "published" in the hub. This is where our primary use case is. And for the developer, because there is an API, the intelligence of the platform is extremely valuable.
We are going to fit WCH into our content pipeline. As we deploy completed assets, we will programmatically upload them to the hub. They will be analyzed, and the analysis results will be aggregated and presented to content authors for future content creation.
I’m not sure our use case is a typical one. Most organizations will leverage a library of content to author other content in the system or pages. You can do this via the API as well, and then integrate the pages easily into your application. This is a possibility for us in the future, but right now, the analysis is the key aspect.
Using the Content Hub
Our application is written in PHP. And we are using the Unirest library for posting and parsing of the JSON streams we get from the Content Hub. The API for WCH is REST-based, and all content is delivered via JSON. In the future, we are considering migrating from a hosted PHP application that uses Azure Functions as serverless code to run the application.
Within a day we were able to test, and integrate WCH into our app, and get a really good sense of how we could use it. To get started, you first create a 30-day trial. The UX has a very simple design, and they have a really nice tutorial to explain some of the core components, assets, content, and websites. For the purposes of our application, the most important objects we will work with are assets, because the content has already been created, and will be uploaded in PDF form.
In the user interface, you will want to spend some time on your setup. For us, it was very important to create a taxonomy in advance. While you can create a taxonomy programmatically with the Categories functions, it was important for us to be consistent with our taxonomies and leave them untouched in our application. For our application, the category/term will be determined by a directory name of our Drive account where the article archive is saved. While the categories are not immediately beneficial, over time, they will be key to maintaining content in the CMS, and some new functionality we plan to implement.
We also created a content type specific for our content. The only difference between ours and the "Sample Article" type was that we excluded any images.
You also need to set up your users/roles. For us, we have only one user, which is the application itself, and using a least privileged approach, its role is set to Editor. Now, we can start using the API to integrate our application into the platform.
There are two endpoints for the API—authoring, and delivery. For our purposes, everything we needed was in authoring.
Integration with WCH was not difficult at all. The hardest part is authentication. Today, they do not use API tokens, but I understand from the product team this is coming soon. We are using the basicauth method, which means that you have to generate a cookie on your host with an access token. (As our development environment grows and we deploy our application across environments, this will be difficult. And if we finally make the move to serverless code, it might be a blocker.) But, once we had access the process was simply to upload content, wait for analysis, then retrieve tags.
The request URL for upload/create is:
The parameters of
auto-curate needed to be set to true for us. The reason is that the tags are the most important bit of information we need.
Analyze will tell the platform to extract tags from the body of the asset, and
notify will notify the application when it is done. This is critical, because as soon as it completes, the tags are extracted and stored into a separate database. And
auto-curate automatically accepts all tags. You cannot use auto-curate without having analyze set to true.
The field values returned are what give us the ID of the asset added so we can reference it when we retrieve the categories. Retrieval is simple. It’s a little strange that retrieval is still under the authoring API, but the URL is as follows:
All we care about is the categories, and the date range of assets we pull, because we only look back at the previous two months of results. Here is an example of the result from an uploaded asset (omitting a lot of metadata for the sake of space).
"name": "Continious Deployment for Docker Apps to
for Docker Apps to Kubernetes.pdf",
"fileName": "Continious Deployment for Docker Apps to Kubernetes.pdf",
"name": "Kubernetes Cluster",
"name": "Cloud services",
"Google Cloud Platform",
"Google Container Registry",
"Codeship Docker Platform",
"Docker image push",
"set Docker image",
"Codeship Docker workflow",
"functioning Kubernetes Deployment",
"previously defined Deployment",
"automatic Kubernetes Deployment",
"Kubernetes Deployment update",
"title Continuous\u00c2 Deployment",
"powerful automated deployment",
"\u00e2\u20ac\u2039Codeship Docker Platform\u00e2\u20ac\u2039",
"Google Cloud service",
"Google Cloud Registry",
"Google Cloud environment",
"encrypted environment file",
"Google Cloud services",
"Deployment update command",
"Google Cloud Key",
"Google Project ID",
"Google Authentication Email",
"registry push step",
"\u00e2\u20ac\u2039built-in push steps\u00e2\u20ac\u2039",
"complex container architecture",
"previously defined gcr_dockercfg",
"actual Kubernetes interactions",
"previously defined google_cloud_deployment",
"Container Registry Pushing",
"relatively new way",
"Container Registry URL",
"smaller development project"
Very cool! One of the most interesting things you will see is the fact that the system tries to generate "concepts." In the above document, it picked the concept of "cloud computing," which is a decent high-level assignment. After testing on 50 documents we found that the concept tended to be broad. The other thing that was interesting about concepts is when they were correct, they were very correct, and when they were wrong, they were wrong big. There was no middle ground.
In the future, the concepts are the killer feature that will feed directly, without any modification to our content contributors. Tags and keywords are essentially the same thing. But tags are more static and higher-confidence entities. There generally are fewer keywords than tags, whereas keywords are any entities extracted by the document, and thus all over the place. We utilize both, but the weight we put on keywords is far less than tags.
And for tags, one of the other great features is the ability to tell the entity type. The accuracy was not great, but my hope is that it learns as documents are added. We could build in an acceptance process, because if we did not use the auto-curate parameter, the keywords and tags would come across as suggested, and we could have a manual acceptance process. But for now, the precision is not the most important thing. With the volume of content we are able to feed the system, the level of accuracy is sufficient.
At the point when I created my trial the websites functionality had to be enabled, I did not end up testing this. However, I could see, were we to decide to have Watson Content Hub as the final destination of our content, then having sites for quick viewing of articles would be useful.
During my use of the platform, there were a few times where the content could not be retrieved, but it was intermittent, so performance overall was good. Here’s the functionality I’m wishing for:
Auto-populate categories based on tags: Where there is a match between a tag and a category/taxonomy term, it would be great to auto-populate the category. This would make it even more useful as a CMS for us. Yet, false positives could be risky.
Disambiguation: I wish there was some disambiguation based on categories, or a custom dictionary. A lot of tags and keywords repeat in slight variations across a document.
Clarify terminology: For example, the difference between "taxonomies" and "categories" is not 100% clear. It seems they are the same, but via the API, you do not have access to "taxonomies."
Better authentication: I would like them to start using API keys like they do across other products to make authentication easier.
Better documentation search: The documentation in the API Explorer is solid, especially if you are learning the product in a linear way. I generally do not learn APIs in that way, so search is key. And for the documentation, if you do a search, you get results across various products, not just WCH, which is time-consuming and not always accurate.
WCH suffers a little because it is not directly for application developers. The product still seems focused on users of the web interface via API. And because use over API is our primary use case, I get the sense that it’s not a priority. However, the UI/UX is simple to use, and very useful for quick tests. This did not introduce technical issues that I found—It is just not clear how developers should leverage the product and what the key use cases are. Because we are 100% black-boxed, using it might not be a great use case.
The term "CMS" does not do WCH justice. Yes, you can store content there, but its power is how the analysis elevates the content. I come from an NLP and content management platform background, and it was clear to me that WCH’s real value over other content management systems was the built-in intelligence. Because of this, and the following additional reasons, we are considering integrating the platform fully into our application:
- Integration is simple.
- The prospect of greater integration with Watson intelligence solutions
- While using the solution, I was blown away by the performance of analysis on the documents. An approximately 850-word document took between 1-2 seconds to analyze.
WCH will save us from integrating an open source NLP library manually (where we would have to become domain experts), which reduces the risk of integrating intelligence into our application, and the amount of time. In less than a day, we had a prototype integrated into our code base that added tremendous value to our content delivery pipeline.
For more info on WCH, IBM has a dedicated developer page on it here.