Removing Duplicates With Filtering
Merging, joining or grouping results from several services can result in duplicate items in the combined result. You can remove duplicates with the <filter> statement and a filter expression that uses the axis feature in XPath to compare preceding or following items. See XPath Axes for basic information on this XPath feature.
To remove duplicates in a mashup, simply merge, join or group results. If needed, sort the combined results based on the key field that determines uniqueness. This ensures that duplicates are contiguous.
Then use <filter> with a filtering expression that compares the key value of either the preceding or following 'item' to determine if this 'item' is unique. See Unique Filter Example for an example of removing duplicates.
XPath Axes
Axes in XPath are a syntax that allow XPath expressions to refer to other nodes based on a relationship with the node that is the current context. Take a simple <filter> statement, such as this:
<filter inputvariable="$a" outputvariable="$a" filterexpr="/rss/channel/item[contains(title,'Java')" />]
As the filter is processed, it checks each item node and that node becomes the current context. The title node in this example, in fact, uses the default XPath axis -- the child axis. Because there is no other axis identifier, the filter looks for title as a child of item.
There are many other axes in XPath that allow you to refer to previous nodes, following nodes, the parent node, ancestor nodes, descendant nodes and many more. See the XPath 2.0 specification for a complete list of valid axes.
To use a different axis than the default child axis, you add a prefix to each node name in the form:
axis-name::node-name
For example, preceding::item identifies any item node that comes before the current node and is not an ancestor. The path expression ancestor::channel identifies the channel node that is a parent or earlier ancestor at any level of the document to the current node. You can also use wildcards, such as following::* or following::node() to identify all following nodes of any name.
To filter out duplicates, you typically use one of these axes:
preceding: any nodes from the document root node to the current context node that come before the current context in document order and are not ancestor nodes of the current context.
following: any nodes that come after the current context in document order and are not descendants of the current context.
preceding-sibling: any nodes that have the same parent as the current context and occur before the current context in document order.
following-sibling: any nodes that have the same parent as the current context and occur after the current context in document order.
Unique Filter Example
This example merges the results from two RSS services and then checks the title of each item to remove duplicates:
<mashup xmlns:xsi= "http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openemml.org/2009-04-15/EMMLSchema
../schema/EMMLSpec.xsd"
xmlns="http://www.openemml.org/2009-04-15/EMMLSchema"
xmlns:macro="http://www.openemml.org/2009-04-15/EMMLMacro"
name = "MergeFeeds">
<output name="result" type="document"/>
<!-- invoke two RSS feeds -->
<directinvoke outputvariable="$feed1"
endpoint="http://www.nytimes.com/services/xml/rss/nyt/HomePage.xml" />
<directinvoke outputvariable="$feed2"
endpoint="http://www.nytimes.com/services/xml/rss/nyt/World.xml" />
<!-- merge the results -->
<merge inputvariables="$feed1, $feed2" outputvariable="result"/>
<!-- filter for unique items based on title -->
<filter inputvariable="$result" outputvariable="$result"
filterexpr="/rss/channel/item[not(preceding::title = ./title)]" />
</mashup>
The filtering expression uses:
The not() XPath function to negate the comparison. It only selects items that do not have any preceding items with matching titles.
The preceding axis to check all previous titles in the merged feeds against the title for the current item.
Because of the structure of RSS results, you could also use preceding-sibling::item/title. If you sort the results based on item/title you could also simply check just the closest item title with preceding-sibling::item[1]/title to rule out duplicates.
The . in ./title is the short syntax to identify the current context node. This selects the child title for the current context to compare it to all previous titles.
Enterprise Mashup Markup Language (EMML) Documentation is licensed under a Creative Commons Attribution-Share Alike 3.0 United States License.
