Thursday, 12 July 2007

Custom Search Engine Builder

I have been using my Google account for quite a while and only recently came across COOP and Custom Search Engines. The whole CSE thing seems a great idea, as I can see myself wanting to be able to customize Google search engine. Funny thing is, the whole CSE definition process employs tagging techniques. This made me think, that it would be great to have a tool allowing me to use my own tagged data stored on Delicious to drive the definition of the Google Custom Search Engine. I thought of coding a prototype to do that. And here are the results of my work.

General Info
Delicious allows users accumulating bookmarks of their favourite internet sites and to categorize them by tagging with custom labels.

Google Custom Search Engine (CSE) provides utilities to allow users predefining their own search domains leading to more accurate and relevant searches. It is done by specifying list of sites, which should be included and excluded while performing the searches using standard Google search technology.

The idea
Now since CSE definition involves providing internet site details, the idea is to extract, transform and source information collected on Delicious into the CSE.
As you can see, what is being proposed here, is a simple tool allowing integrating public data between the two service providers. Let’s call it CSEB for CSE Builder.

How does the CSEB work and how do I use it?
When using the app, here are the steps to follow:
  1. Provide your Delicious user id
  2. Configure which tags to include or exclude from CSE
  3. Provide your CSE inclusion and exclusion labels
  4. Generate annotations
  5. Provide your CSE details
  6. Generate the context
  7. Upload your annotations and context into the CSE

In case you want to see the application in action, check the Winked demo detailing the steps and outcome of taken actions.

Here is the Flash movie:

or if you reading this using RSS reader try this link.

Design aspects
APIs
Delicious API allows execution of various kinds of REST full queries. This way it is possible to retrieve information about tags and bookmarks/sites which can be fed to the client using RSS or JSON formats.
CSE provides with end-user controlled way of uploading files containing: site information (called annotations) and configuration information (called context).

Application
Since I do not host nor own any internet server, I thought of embedding all the functionality within client side Java Script. This way the prototype can be also easily distributed without need for large downloads and installations etc. The downside is that such approach requires nearly all processing to happen within the end-users’ browser.

Platform
I am not a JavaScript guru. Although after Dr.Dobbs review, I should probably choose YUI but I have decided to leverage my Java skills and employed GWT to compile my Java code into JavaScript, which could be then executed within several supported and most common browsers.


Development Environment Setup
GWT integrates really well into Eclipse I must say. This speeds up the development process tremendously as you can debug your Java code, while driving the generated web UI in the hosted browser – perfect!

So, I have downloaded Europa (the largest open source project release to date). During the setup, of the development environment, I have defined two run configurations and made them persist into files for re-usability. Check tab common\save as\shared file edit box.

First one is hosted debug. In the run configuration dialog-box I have:
  • chosen java application configuration type
  • chosen com.google.gwt.dev.GWTShell for main class
  • selected arguments tab and within program arguments edit box typed
    -port 9999 -logLevel DEBUG -noserver ${project_loc}\www\org.cse.csebuilder\csebuilder.html
    where: 9999 is my port of choice as since I am running many services on my poor box so the standard ports get allocated pretty quickly
Second one is for unit testing. Again in the run configuration dialog-box I have:
  • chosen JUnit run configuration type
  • selected run all tests... option for Junit3
  • selected arguments tab and within VM arguments edit box typed:
    -Dgwt.args="-out www-test"
    where: www-test is my directory of choice for storing unit test
As it is a good practice to eradicate the warnings from your project, and since GWT does support JAVA 5, I have setup the project compiler settings to use JAVA 1.4 compliance level and switched to JUNIT 3.8. But before linking the external dependencies I have defined two Java build path environment variables for GWT and JUNIT accordingly.

GWT allows generating the Eclipse project without much of a hassle. The only thing is that it generates batch files, which need to be executed for compilation, debug and testing purposes. I do not know about you, but I like to use ANT or Maven to script such tasks. So to get the project going I have setup my ANT script accordingly.

Then, I have downloaded and configured Subclipse,which is an Eclipse Subversion client. This allowed me sharing the code using Google Project Hosting Services.

Same Origin Policy – the reason why I started dealing with JSON
Since all processing is to happen within a client web browser I needed to find out the way to fetch and use the remote data. The way of doing it is to use AJAX methods and issue asynchronous requests for content. Problem is that, the data fed into the client is going to be retrieved from a different site to the CSEB. So it is subject to restrictions as defined by the Same Origin Policy (SOP). Under this policy the only thing I could do, was to use JSON format. When employing JSON one dynamically changes content of the DOM model and injects script tags. This asks the browser to download remote script contents and this way obtain new data available to your application. Additional information about JSON in GWT can be found here.

I have written custom utility class to handle this. It is made responsible for injecting the script object and managing its lifetime. The class does so, by using ID attribute within page DOM, which gets populated with the request URL and therefore allows replacing corresponding items when when multiple requests are being issued.


Data Processing delegated to Yahoo Pipes
Although Delicious provides with API, for tags and bookmarks retrieval, but the JSON APIs are subject to restrictions. You simply cannot fetch more than 100 items at a time. Such restrictions are good for the server side and scalability but are limiting the amount of things, which could be done in the CSEB application. Another thing is that, to make sure all relevant bookmarks are retrieved; the client would need to issue as many requests for the bookmarks as there are selected tags. This way one would have to handle chain of asynchronous communications on the client. What happens if one of them fails? Additionally it is possible to encounter a bookmark, which is a part of a few different categories. Therefore there is a need to handle duplicates. It tends to become resource hungry == really ugly…

To sort this out, I thought of using custom build yahoo pipes, so that the processing is delegated to the Yahoo Pipes service. Such service consists of two linked pipes.
  1. First one is made responsible of fetching JSON for delicious bookmarks owned by a user and associated with category / tag provided.
  2. Second pipe uses up to N fetching pipes. It aggregates their output, removes duplicated sites and returns JSON feed to the CSEB.

The downside of this approach is that the more services are introduced; it is more likely to suffer from connectivity and service downtime issues.

Mind you, now to achieve similar results you could employ worker pool from recently released Google Gears. This would require Gears download and refactoring, though.

Saving mashed information into a local file.
Once all the data is fetched and transformed into XML, end-user is asked to create a file to be able (in the dedicated separate step) to upload it into the CSE itself. It could be achieved in two ways: firstly by using a clipboard and file editor (which is a two-step operation) or secondly user could try persisting content into a local file from the browser (ideal). But such operation is subject to SOP again. As it would open up the way for all type of malicious Cross Site scripting attacks if local persistance was allowed.

Trying to get my local persistance working, I have coded a widget displaying generic, dialog like, page to allow end-user persisting XML content using Save Page As browser command.

The whole idea is based on the fact, that the browser will persist text exposed by the currently visible DOM objects. So providing, that your host page does not include textual fields, this widget will be able to control all the displayable elements. The widget contains several HTML widgets, styled not to be displayed, which will blend their contents to form valid XML document when the page gets saved. The whole trick is about inserting XML comments just before closing part of the root element. This way textual content of the page will end up being part of XML comment and the whole page can be persisted a valid XML (that is if you do not mind some rubbish in comments;) Here is the example of the generated XML:

<?xml version="1.0" encoding="UTF-8" ?>
<YourRootElem>
<YourElem... />

<!--
contents of the dialog goes here
-->
</YourRootElem>


GWT History
To be able to achieve local file persistence I needed to try out GWT History subsystem. It allows maintaining the state of the application when switching between populated main page and a Save as widget.
During unit testing I have found that although you can call History methods to programmatically invoke back and forward commands of the browser, but this happens without raising corresponding onHistoryChange event. I raised this as an issue on the GWT forum.

If you think you do not have to know HTML and JavaScript you might get disappointed
When coding CSEB, I have realized that it would be very difficult to develop an application without basic knowledge of HTML, CSS or JavaScript primitives. The GWT is stable but still very much evolving piece of software following the evolution of the browser technology.
This prototype required me to write a few custom widgets and implement a few lines of JSNI. This is only to get access to the required functionality making it possible to provide with desired UI metaphors.

Unfortunately due to limitations I could not test all of the JSNI code against all supported browsers but only two: FireFox and Internet Explorer.

When something worked not exaclty according to plan, I used FireFox and its great FireBug addon to debug the generated code and styles. So many thanks go to to its author Joe Hewitt.


Polishing the UI
Being of Polish origin, I wanted to have the CSEB available in my native language. To achieve this, I have localized the app, written all the necessary I18N code and developed a custom widget allowing the user to switch locale during runtime. I have written all relevant property files and things worked just fine.

GWT expects the whole locale specific data to be nicely separated in Java. Then it gets compiled into optionally obfuscated JavaScript. This works perfectly, but such approach does not allow language contributions without recompilation and IMHO presents itself as a little inconvenience. I think it would be beneficial to be able to ask for language contributions, which could be integrated with greater ease. So that locale specific data is loaded via e.g. XML from the same origin. Another reason would be that the supported languages add to the complexity of the compilation process as the GWT compiler generates files to cater for language-browser specific JavaScript combinations. So by removing the language from the combination would simplify the compilation process...


Testing, testing and once more testing
One of the things you need to love about GWT is the fact you can use JUNIT to test the functionality. It is even possible to emulate execution of the asynchronous code. I keep asking myself though, how accurate is to test the code which gets auto generated by the GWT compiler? But the similarity of the allowed subset of JAVA and JavaScript should make one quite comfortable…

The availability of unit testing most definitely contributes to the better quality of the solutions. As now there is no excuse to leave your code in un-testable and highly coupled state in JAVA. Now all the good coding practices can make their way into the JavaScript territory, providing GWT compiler will do the most out of them:
  1. Introduction of proper layering and class design to facilitate
    • Single concern principle
    • Low Coupling
    • High Cohesion
  2. Removal of various code smells
  3. Documentation
In case you are new to testing take a look at Google testing blog for guidance. It is really good read.


Architecture
The idea was to keep the subsystems small, low coupled and testable.
But following the rule that
A picture says a million words.

I have created the demonstration discussing the architecture on canvas of the UML diagrams. So check up:

  • The use case diagram for project functional requirements
  • The deployment diagram for deployment details
  • Sequence diagrams describing realization of the proposed use cases
  • Class diagrams for layering and internal infrastructure
  • Collaboration Diagrams for the algorithm descriptions


Here is the Flash movie:

or if you reading this using RSS reader try this link.

The model has been created in MagicDraw and uploaded to here for your reference. You can easily view using free Magic Draw Reader.

Deployment
I have subscribed to Google Page Creator Services and thought of deploying the application to my custom site. The unexpected issue I have hit was related to the fact that the file upload mechanism has truncated the file names down to 41 characters max. This made code another utility to shorten the compiled names. Let me know if you would like me to publish this as well.

So finally, the CSEB application can be found here for test run.

License and Attributions.
The project including sources, documentation etc. got uploaded and is hosted by the Google Code. Check main project's page for license information.

Icons are provided by the FamFamFam.

After dealing with enterprise software it was really great fun to delve into the world of internet technologies. Let me know what you think of this all.

No comments: