Semantic Comparison

Posted by

The Semantic Similarity Engine finds identical meaning contained by the analyzed content pieces, ignoring syntax or grammar. You can even compare content pieces written in different languages. It can be used for clustering documents based on the information they contain.

There are two flavors of the Semantic Similarity engine: one that is language dependent (capable of assessing the similarity between texts of the same language) and one that is language agnostic (you can compare content pieces written in different languages) that uses the  open source Language-Agnostic SEntence Representations – LASER library to calculate multilingual sentence embeddings. The language agnostic flavor (LASER Semantic Similarity Engine) works fine on short texts – around 100-200 tokens. The language dependent flavor does not have this limitation.

HW Requirements

Semantic Similarity Engine (language dependent)

The models that are trained to detect the semantic similarity of text of the same language are large models that require a high amount of RAM and powerful CPUs to perform reasonably. These engines do not require GPU processing units.

Minimum recommended requirements:

  • 4xCPU
  • RAM: 6GB
  • HDD: 10GB

LASER Semantic Similarity Engine

Due to the very large size of the model, we recommend running this engine in an environment equipped with an NVIDIA Graphical Processing Unit (GPU). 

Minimum recommended requirements for the GPU:

  • Type: NVIDIA Quadro 
  • RAM: 4041MB

Other minimum recommended requirements:

  • 4xCPU
  • RAM: 4GB
  • HDD: 10GB

Supported Languages

For the language dependent flavor we currently support the following languages (9): Arabic, English, Farsi, German, Hungarian, Italian, Polish, Romanian, Russian.

The LASER Semantic Similarity Engine supports the following languages (92): Afrikaans, Albanian, Amharic, Arabic, Armenian, Aymara, Azerbaijani, Basque, Belarusian, Bengali, Berber languages, Bosnian, Breton, Bulgarian, Burmese, Catalan, Central/Kadazan Dusun, Central Khmer, Chavacano, Chinese, Coastal Kadazan, Cornish, Croatian, Czech, Danish, Dutch, Eastern Mari, English, Esperanto, Estonian, Finnish, French, Galician, Georgian, German, Greek, Hausa, Hebrew, Hindi, Hungarian, Icelandic, Ido, Indonesian, Interlingua, Interlingue, Irish, Italian, Japanese, Kabyle, Kazakh, Korean, Kurdish, Latvian, Latin, Lingua Franca Nova, Lithuanian, Low German/Saxon, Macedonian, Malagasy, Malay, Malayalam, Maldivian (Divehi), Marathi, Norwegian (Bokmål), Occitan, Persian (Farsi), Polish, Portuguese, Romanian, Russian, Serbian, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Swahili, Swedish, Tagalog, Tajik, Tamil, Tatar, Telugu, Thai, Turkish, Uighur, Ukrainian, Urdu, Uzbek, Vietnamese, Wu Chinese and Yue Chinese.

Usage Example

This is an example of calling the Semantic Similarity engine for the English language using curl:

curl -X POST "http://localhost:8989/rest/process" -H "accept: application/json" -H "Content-Type: application/json" -d "{ \"content1\": \"Plans for human Moon exploration began during the Eisenhower administration. In a series of mid-1950s articles in Collier's magazine, Wernher von Braun had popularized the idea of a crewed expedition to establish a lunar base.\", \"content2\": \"Wernher von Braun popularized plans for research of the Moon by humans mid-1950, during the time when Eisenhower was leading the administration.\", \"language\": \"eng\"}"

The output of the above request shows like following:

{"similarity": 0.5795811414718628}

Leave a Reply

Your email address will not be published. Required fields are marked *