Helping AI Uncover the Mysterious Veil of Chinese Characters

Published in

ML Review

6 min readMay 31, 2018

This article is part of the Academic Alibaba series and is taken from the paper entitled “Learning Chinese Word Embeddings with Stroke n-gram Information” by Shaosheng Cao, Wei Lu, Jun Zhou, and Xiaolong Li, accepted by the 2018 Conference of the Association for the Advancement of Artificial Intelligence. The full paper can be read here.

The field of natural language processing (NLP) has recently seen an increasing amount of attention given to word representation learning. Finding a way for AI to analyze text and identify semantically related words holds huge potential for downstream applications — but it is especially complicated for vast and complex scripts like Chinese.

An Infamously Difficult Writing System

Chinese is an ancient language that fascinates people the world over, with millions studying it as a second or third language. Famously difficult to master, Chinese and its varieties, including Mandarin and Cantonese, use logographic scripts that differ vastly from alphabetic scripts like English. For instance, while letters in an alphabetic script represent the language at the phonetic level, Chinese characters represent the language at a semantic level — not at the word level, however, but at the level of the morpheme.

The Alibaba tech team, in collaboration with the Singapore University of Technology and Design, have proposed a model called stroke n-grams for capturing and codifying Chinese semantics. The “stroke” in stroke n-grams refers to the fact that the system draws on Chinese handwriting conventions to identify semantically relevant graphic elements within a word.

Unlike characters, radicals and components, strokes are not semantic elements of the script. However, stroke n-grams use stroke combinations and recurring stroke sequences between words to identify semantic structures within words.

The Search for Meaning in Sub-word Structures

To explain why stroke n-grams are more effective than other approaches, let’s consider the drawbacks of those other approaches first (analysis by character, radical, and component).

Characters

Chinese characters are a useful point of reference for tracing the history and development of the Chinese language and script — but they offer little utility in indicating which words are semantically related. Simply put, there are far more words in Chinese that share semantic information than those which share one or more characters.

For example, the Chinese words for ‘timber’ and ‘forest’ share semantic roots, but a character-level analysis gives no indication that this is the case. This makes considering only character-level information erroneous and superficial.

Meanwhile, it is instantly apparent to anyone familiar with the Chinese script that the words timber and forest are related, even if they do not know the words in question. This is because the characters in both words share the common graphic element, wood “木”.

“Wood” features prominently in the words “timber” and “forest”

Radicals

Radicals have stood the test of time in terms of providing a means of organizing Chinese characters in dictionaries, and in some cases they do provide useful semantic information — timber and forest being prime examples. However, there are many instances where radicals are wholly incapable of identifying semantic information in a word.

Radical analysis yields limited insights

For example, the radical in the character for wisdom “智” is sun “日”. Even after studying the historical justification for this radical, it is difficult to claim a credible semantic connection between sun and wisdom.

Components

Unfortunately, looking beyond radicals to other components — defined as fundamental graphic elements at the same level of complexity as radicals — is ultimately a wasted effort. While the timber-forest example suggests that component analysis should offer fruitful results, this is not true in all cases. To revisit the example above, the character for wisdom contains the additional basic components arrow “失”, and mouth “口” in addition to the sun “日” component used as the radical.

In some characters, no component indicates semantic meaning

Yet once again, anyone familiar with the Chinese script can instantly recognize that the words for “wisdom” and “knowledge” are semantically related, despite them sharing no common characters, radicals, or “components” in the defined sense.

There is a semantic relationship between “wisdom” and “knowledge”

The character for knowledge “知” appears as a sub-word graphic structure in the character for wisdom “智”. However, because it does not constitute a character, radical or component, none of the traditional means of classifying Chinese characters are able to produce a system that identifies this as a shared element. Meanwhile, attempting to identify and codify all graphic elements between the component and character levels that convey semantic information would be a monumental manual undertaking.

So how do stroke n-grams provide a minimalist solution that still ensures this information is systematically identified and stored?

Chinese Stroke Order and n-grams

Stroke n-grams rely on the fact that handwritten Chinese characters are always a combination of five basic stroke types, and that characters are always written from top to bottom, left to right, one component at a time.

With n-grams, stroke numbers are used to create an ID sequence

To revisit the wisdom-knowledge example, this means that the sub-word structure knowledge “知” would be written in the same sequence in both cases. By giving each stroke type a number and then representing a combination of strokes with a numerical sequence, a system could identify the same sequence occurring in different contexts. This is why stroke n-grams are capable of capturing morphological and semantic information that is shared between words, even though a stroke by itself conveys no semantic information.

Converting words to n-grams

Chinese words are mapped into stroke n-grams using the following process:

1. Words (comprising one or more Chinese characters) are divided into their constituent characters.

2. The stroke sequence for each character is retrieved and concatenated together.

3. Stroke sequences are designated stroke IDs.

4. A slide window of size n is imposed to generate stroke n-grams.

Generating n-grams via stroke sequence analysis

As shown in the above example for the word adult “大人” the stroke ID is a 5-gram that captures the stroke sequence for the entire word, while 3-gram and 4-gram n-grams capture stroke sequences for sub-word graphic components.

Using stroke n-grams for learning word embeddings

Word embedding, also known as word vectors, help computers comprehend words. First introduced by Google, the model maps a word’s semantic meaning into a low-dimensional vector space. Through this method, synonyms are identified by the measure of distance between two corresponding vectors.

To incorporate stroke n-grams into the function of learning word embeddings, the research team specifically designed a simple yet effective mathematical model that helps computers learn Chinese-style word embeddings. The novel algorithm developed by the research team outperformed Google’s word2vec, Stanford’s GloVe and Tsinghua’s CWE among others in the public test datasets, and yielded better results for several Alibaba and Ant Financial tasks.