Big Data Insight Empowered by Wisers’ Artificial Intelligence Text Analysis Technology

Source: Dr. Quincy Liang, Wisers AI Lab Senior Researcher    2021.01.12

The COVID-19 pandemic has driven Hong Kong on an unprecedented journey, starting from people feverishly snapping up face masks, all the way to the overcapacity in face mask production.  In the midst of all the mayhem, some Hongkongers have turned their vacation to staycation, while the Government has been actively rolling out anti-epidemic measures. However, how do we accurately grasp what citizens need? How does the government understand which of the different measures have received more praise and which have serious resistance? 

With the prevalence of online social media platforms (such as Facebook and Instagram), it become the main channels for people to gather information and sharing their opinions. Tens of millions of user posts, comments and likes are generated on social media every day, which represent the most heated discussion topics in town.  It allows us to deciphered public opinions with big data.

But it is not an easy task. Online posts and comments are generally unstructured text data which cannot be directly deciphered by computer. We need to classify and extract information from unstructured text data in order to carry out in-depth statistical analysis. Let us take the interpretation of "keywords" as an example.

Most of the traditional text processing techniques are based on keywords, which require pre-setting and matching against the text. But if the traditional way is applied to big data analysis, there are three main disadvantages, namely its inability to recognize new expressions, such as new slangs or nouns, resulting in falling accuracy; secondly, its failure to process complex linguistic and grammatical logic; thirdly its limitation to only processing pre-set keywords. The cost involved is high and it can only pick out texts with the analysts’ pre-set keywords. New insights or patterns hidden in big data cannot be discovered as a result. In view of such limitations, we have developed our intelligent text analysis technologies based on neural network, which greatly improve the representation and generalization capabilities of the model.

We hereby introduce four representative text analysis techniques based on artificial intelligence, namely sentiment analysis, named entity recognition, text classification, and hot word discovery.

Sentiment Analysis  

Sentiment analysis refers to automatic identification of sentiment expressed in the text by artificial intelligence system. There are two models of text analysis – one that identifies three types of sentiment, namely, positive, negative and neutral, and the other two types of sentiment, namely, positive and negative. Owing to different languages being used on social media (Colloquial style/English/Chinese) and different ways of expressions (emojis/slangs/hashtags), it is challenging to accurately identify sentiment from the text. Let’s see an example. Identifying the positive or negative sentiment from phrases like “the group tour has a packed itinerary” (using the Cantonese slang “chur”) or “heavy punishment”, we cannot simply judge from the literal meaning as it may mean the complete opposite. The Cantonese slang “chur” means “in a hurry”, which implies a negative sentiment, while the underlying meaning of "heavy punishment" is to incite people to patronize the restaurant or shop, which is a positive sentiment. In order to accurately identify sentiment from the text, we have resorted to the most advanced machine learning technology, involving pre-training and weak supervision to develop an accurate sentiment analysis model. Wisers’ proprietary technologies can identify the overall sentiment tendency from the text and support the analysis based on the subject matter. It refers to the emotion of a company or product mentioned in an article, because an article may mention many companies or products, and the emotions of these companies will be different. Thanks to sentiment analysis, we can monitor positive/negative sentiment and the latest development of the discussion topics like anti-epidemic measures, budget, and more.


Named Entity Recognition

There are 3 main disadvantages of the traditional text processing techniques. Named entity recognition technology can automatically identify from the text the names of companies/organizations/persons, titles, time, places, brands, products and different user-defined entity information. There is no need to set any keywords in advance. It is because we have integrated a deep learning model with complex linguistic features with an online feedback-based learning mechanism to optimise the model accuracy in a continuous manner. Making use of named entity recognition technology which automatically discover companies, institutions, brands and products which are heatedly discussed, with pre-alerts for contingencies and brand crisis tracking capabilities.


Text Classification

Text classification refers to the automatic tagging of the text according to the content. The classification system can be based on news categories (such as finance, sports and technology), industry categories (such as automobile, luxury and catering), and other customized classifications. We have deployed deep learning and made use of hundreds of millions of news and social media data collected over the years to train a massive Chinese semantic vector model with more than 13 million entries. On this basis, we have developed a semi-supervised learning text classification technology based on the semantic vector model and deep learning. Empowered by text classification technology, we can analyze the share of voices and the hotness algorithm for discussions across different industries.


Hot Word Discovery

Hot word discovery refers to the extraction of high-frequency words from a large number of texts. Owing to different languages and expressions used in online social media posts, traditional Chinese semantic segmentation analysis may lead to erroneous segmentation of the sentences. We have therefore developed a new patented hot word discovery technology that can automatically extract words from the text with high level of reliability. Making use of hot word discovery technology, we can automatically discover the latest online trendy words and analyze keywords from online comments.

Figure 1: deconstruct social media message based on text analysis technologies

Lastly, let’s see another example about how unstructured text data can become "structured" with the help of text analysis technologies. As shown in Figure 1 about a comment on "face masks from Company Z being outrageously expensive, “especially X123!", thanks to named entity recognition technology, company name (Company Z) and product name (X123) can be automatically identified from the text; after text classification, we classify the comment under the health industry; after sentiment analysis, we get the negative sentiment tag. With structured labels, in-depth statistical analysis and comparative analysis can be conducted based on big data. We can, for example, analyze which product/company received the highest number of likes in the health industry. We can also get to know the positive ratings of different products under the same company, and more.


There’s a real life example, in Wisers’ Online Epidemic Report, our analytical team made use of big data and smart text analysis technologies to delve into online posts and comments related to the Government's anti-epidemic measures. We found that among the nine anti-epidemic measures put in place by the Government, the press conferences held by the Centre for Health Protection received the highest commends from netizens, scoring a positive rating of 29.6%. Dr Chuang Shuk-kwan, Head of Communicable Disease Branch of the Centre for Health Protection, was the most popular government official among netizens. On the contrary, the government's distribution of free face mask was rated as the worst anti-epidemic measure by netizens. scoring a negative rating 69.4%. According to big data, netizens were concerned about the tendering process for mask manufacturers as well as the materials and design of the mask.


With the help of big data and state-of-the-art text analysis technologies, insights can be solicited through data and public opinions can be deciphered in real time with higher precision.