Similarity Search for Content Matching

The TextWise SemanticHacker API provides a match service call that analyzes the text or Web page provided in the call and returns a Semantic Signature and a match. Matching documents via their Semantic Signatures is similar, in principle, to matching documents via the terms contained in them. Instead of using the terms contained in the documents and their frequencies, our matching uses the dimensions in the documents’ Semantic Signatures and their associated weights. Matching determines the distance between the documents’ Semantic Signatures.

Much of the math involved in the comparison is performed in advance during the process of generating the semantic dictionary and the Semantic Signatures. The steps to matching are:

  • Finding the semantic dimensions that are shared between two documents.
  • Calculating a weight for the dimension (a similarity factor) for each shared semantic dimension.
  • Calculating a matching score. The higher the score, the more similar are the documents in the semantic space.

There are two options for matching when using this service:

  1. Custom content matching - you can maintain an index of your own content on our cloud and request highly relevant matches for your Web or enterprise content against that index. Contact us for details. 
  2. Self-service content matching - Developers can match their Web or other documents against our constantly-updated indexes at no charge (subject to daily query limits). We currently have over 9.5 million items in our index - a complete list can be found in the Documentation.  

Matching and The Match Query

When searching in an index for the documents that best match an incoming document, the following operations are performed:

  • The document is filtered to remove HTML and boilerplate if the incoming document is a Web page.
  • A Semantic Signature is generated for the incoming document.
  • The Semantic Signature is converted to a weighted term query: each dimension ID is used as a query term, and the dimension’s weight in the signature is used as a weighting factor in the query.
  • At least one semantic dimension must be shared between two documents in order for them to have a match score greater than zero.

To achieve adequate performance, the TextWise matching system limits the numbers of semantic dimensions used for each document to the top 30 and uses a nonstandard, weight-ordered index to perform the search.

The following is an example of a match between two Web pages about the Hubble space telescope:

sample matching screen shot 1

sample matching screen shot 2

 

$

 Semantic Signature is a registered trademark - © 2010 TextWise, LLC. All rights reserved. Privacy Policy