Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

TopicBERT: A Transformer transfer learning based memory-graph approach for multimodal streaming social media topic detection

TopicBERT: A Transformer transfer learning based memory-graph approach for multimodal streaming... Real time nature of social networks with bursty short messages and their respective large data scale spread among vast variety of topics are research interest of many researchers. These properties of social networks which are known as 5’Vs of big data has led to many unique and enlightenment algorithms and techniques applied to large social network- ing datasets and data streams. Many of these researches are based on detection and tracking of hot topics and trending social media events that help revealing many unanswered questions. These algorithms and in some cases software products mostly rely on the nature of the language itself. Although, other techniques such as unsupervised data mining methods are language independent but many requirements for a comprehensive solution are not met. Many research issues such as noisy sentences that adverse grammar and new online user invented words are challenging maintenance of a good social network topic detection and tracking methodology; The semantic relationship between words and in most cases, synonyms are also ignored by many of these researches. In this research, we use Transformers combined with an incremental community detection algorithm. Transformer in one hand, provides the semantic relation between words in di erent contexts. On the other hand, the proposed graph mining technique enhances the resulting topics with aid of simple structural rules. Named entity recognition from multimodal data, image and text, labels the named entities with entity type and the extracted topics are tuned using them. All operations of proposed system has been applied with big social data perspective under NoSQL technologies. In order to present a working and systematic so- lution, we combined MongoDB with Neo4j as two major database systems of our work. The proposed system shows higher precision and recall compared to other methods in three di erent datasets. Keywords: Memory–Graph, Deep Learning, Frequent Subgraph Mining, Transformer, Multimodal Learning, BERT 1. Introduction Internet and its applications are widely advancing towards social networks in which the user generated content plays a crucial role [1]. This content which is mostly generated by users, companies, news reporting agencies and etc., consists of text, image and video and has noise naturally [2, 3]. This media forms the foundation of social media delivery systems such as Facebook, Youtube, Instagram, Twitter, and etc. with di erent views and media orientations. Some of these social networks tend to be an online social TV such as Youtube. Content of these networks are generated in video format by users and posted in channels and communication between users is possible by commenting under these videos. Instagram was started as an image and short video sharing social network and the same communication method has been implemented. Facebook as another big social network, does not restrict users to post specific media types and also allows users to directly communicate with each other using direct chat which is recently available on Instagram too. Among these popular social media delivery systems, Twitter has gained much attention in recent years due to its simplicity and user friendly interface. It is also known as the best social media for discovering “what is happening right now”[4]. Twitter is easily accessible by almost any device connected to Internet through a web browser or a third-party/ocial application. The nature of twitter short posts that are known as tweets has restrictions in size and at first it was supposed to be a gentle replacement for SMS [5]. In other words, twitter is the best social media for discovering news and real world events [6]. Preprint submitted to arXiv August 19, 2020 arXiv:2008.06877v1 [cs.CL] 16 Aug 2020 Reports show that on a daily basis, 500 million new tweets and on a monthly basis, 300 new user sign-ups happened on twitter in 2013; In 2018, it is estimated that 700 million tweets will be posted each day and growth rate for number of tweets per year to be around 32% on 2013 [7]. These statistics show that users tend to produce large volumes of tweets on variety of types with high velocity and diverse veracity with di erent values; Mentioned characteristics of twitter streaming media takes it into account of social big data with respect to 5’V model [8]. Twitter has very rich data in form of tweets which can be accessed using Twitter API. This API made researchers capable of investigating its social big data in a streaming way or in a query based data retrieval fashion [9]. Docu- mentation about the API itself is available at twitter developer website . This API allows direct messaging, search, advertisement management, and account activity control. It has some restriction to prevent developers from abusing the service. For example, it has rate limit for user or application . This open and free dataset has been used by many researchers. Many advancements has been conducted in multiple research areas related to this social big data such as opinion mining, topic detection and tracking, user modeling, sentiment analysis, and etc. But the underlying conceptual perspective of researcher to the problem makes di erent outcomes with diverse real world applications. If the data is supposed to be treated as streaming social big data (as it is) then the problem would likely to be solved in an unsupervised or semi-supervised methodology. Some other researches try to solve the problem with respect to a static corpus but others tend to do so in an streaming big social data respect [10]. On the other hand, many of the previous works assumed the correctness of incoming data for granted which is not true in many cases. For example, in the case of analyzing tweets related to regular users, many noisy content such as mistypes and grammar errors exist while in the case of ocial tweets like a news agency tweet, this problem can be ignored. The most missing part of this analysis is related to images and videos of the tweets. To the best of our knowledge, most of the existing methods did not use multimodal data to detect and track topics. To overcome the mentioned problems, we present a novel approach for topic detection and tracking problem. Our work consists of three major parts: multimodal named entity recognizer, memory graph, and Transformer based se- mantic similarity assignment. The combination of these three parts provides the final results. Orientation of presented study is as follows: Section 2 covers related works of the problem at hand. Section 3 presents the proposed methodol- ogy for detecting and tracking topics. The experimental results at Section 4 investigates the performance of our work on three distinct datasets. Finally, Section 5 concludes the whole paper. 2. Related work This section describes various techniques proposed for topic detection and tracking from twitter. For reader curiosity and further information, Table 8 is inserted to cover most of related works in literature of TDT task on twitter. The first methods we investigate here, are using frequent pattern mining as building block. The frequent pattern mining problem is a well studied area by virtue of its various applications in multiple domains like clustering and classification. This problem is initiated as finding relationships between items in a transaction database [11], where the formal definition is given below: Definition 1. Given a database D consisting transactions T ; : : : ; T , determination of all possible patterns P with at 1 2 least s frequency over all of transactions, is known as Frequent Pattern Mining. With respect to this definition, a document collection can be noted as a database of transactions. Depending on this technical terminology, a frequent pattern mining approach applied to collection of documents discovers relationships between words with frequency of at least s [10, 12]. To have a better formalization and homogeneity, just like a database of transactions, we use notation D for collection of documents and P for possible patterns which are relationships between words. Abstract topic on the other hand, is defined in the following: https://developer.twitter.com/en https://developer.twitter.com/en/docs/basics/rate-limiting 2 Definition 2. Set of abstract topics (or shortly to pics)  in collection of documents D are bag of words W related to each other by occurrence in the same documents with high frequency. Summing up definitions 1 and 2, topic detection task from a series of documents is almost equivalent to frequent pattern mining in database of transactions if the semantic relation between words is discarded from problem. A major topic such as  2  usually consists of multiple minor topics like  , based on this assumption, a minor i j topic is defined as:  = fW j W  W g. With association rule mining in mind [13], statement    holds true j    j i j j i while the confidence c for a minor topic depends on detection algorithm but at any means is less than one. On the contrary, a real life event, or event for short is defined as [10]: Definition 3. An event e 2 E is a real-life topic that occurs in real life at specific time period . According to this definition, any event is also a topic but a topic  needs to happen in real life to be counted as an event e. For example, the Bongo Cat is not a real life event but it is surely a topic. For event detection tasks, it is crucial to have more information about topics rather than depending only on documents; online sources such as Wikipedia are very helpful for information gathering on this specific problem. These two problems, namely frequent pattern finding and topic detection, are dealt by a similar way. Frequent pattern mining has been a common way to solve the topic detection task over twitter data stream and some variants of formal methods has been proposed by researchers in [14–17]. In [14], authors designed a soft frequent pattern mining approach to overcome the topic detection problem. This research aims at finding word co-occurrence large than two, in order to do so, likelihood of each word is computed separately. Equation 1 shows likelihood of appearance for each word. N + p(wjcor pus) = P (1) ( N ) + n where N is the number of occurrences of word w, and  equals to 0:5. Denominator part in this equation is used to normalize the fraction. Grow rate of each word in corpus is computed to indicate the most discussed words. After finding top-K frequent words, a co-occurrence vector is formed to add co-occurring words to the top-K word vectors. This expansion forms the foundation of soft frequent pattern mining algorithm (SFPM) utilized in [14] to solve the topic detection problem. Results presented in main article shows superiority of this method to LDA [18] and a Gra ph based approach [19], which is assumed to be the baseline for this research. Like previous method, in [16] a named entity recognition weight has been added to boost scores for each word. This simple addition to original algorithm (SFPM) improved topic recall, keyword precision and keyword recall on two di erent datasets. Recently, high utility pattern mining (HUPM) has been introduced [20]. This approach for finding frequent item- sets with higher utility in twitter topic detection domain has been investigated in [15, 17]. The proposed high utility pattern clustering in [15] considered the utility of each pattern in order to extract high utility patterns, afterwards it takes these patterns and uses an incremental pattern clustering scheme. K N N and modularity based clustering employed by authors, try to identify new emerging patterns and coherent topics simultaneously. Finally, a group of words for each cluster of patterns are extracted to represent the topic of that specific cluster. All of mentioned steps together form HUPC framework (High Utility Pattern Clustering): text stream, text segments, top-K HUP mining, term indexing, HUP clustering, and post-processing. Like two previous methods, proposed scheme in [17] considered the problem of topic detection as a pattern mining problem and introduces a HUP mining algorithm in order to extract final topics. Internal, external, and transaction weighted utility of words has been used by authors to determine the utility of words. Minimum utility of each chunk of tweets is determined automatically by their algorithm. Also, for post-processing step, a T o pic tree was proposed to extract actual topics from candidate ones. Experimental results shows superiority of this work compared to five di erent methods. Bongo cat is a set of tweets identified by #BongoCat hashtag: https://www.twitter.com/hashtag/Bongocat Wikipedia is a free encyclopedia, written by the people who use it: https://www.wikipedia.org/ 3 Some research [21–24] used dynamic heart beat graph technique (DHG) for streaming data on Twitter. The proposed algorithms construct a sub-graph for each sentence. These bi-clique sub-graphs are added to the main graph and a graph operation on the whole graph is performed based on the frequency of co-occuring words (edges between nodes). Their proposed method shows a novel approach on how graph methods can be applied to streaming text but on the other hand their work lacks multimodality and comprehensiveness. However, in most of previous works the main contribution of authors is extracting topics with higher precision and recall, but none of these works except [17, 23] tried to take performance metrics in mind. An online topic detection and tracking scheme must fit in Definition 4 to be considered as a practical solution. Definition 4. New event detection (NED) is discovering new events from live stream L in real time with no delay or at least with an acceptable delay t . If the extraction and tracking method is applied to collection of oine documents, it will be considered a RED methodology, which completely di ers from NED. Definition 5 is a formalization of RED. Definition 5. Retrospective event detection (RED) is discovering previously unidentified events from collection of documents. Definitions 4 and 5 make a fair clarification of Table 8 and highlight di erences among diverse approaches. In Table 8 we provide an overview of related works in this field and categorized them based on the NED and RED terminology. 3. TopicBERT The three pillars of our proposed system (TopicBERT), namely, Transformer semantic similarity, word community detection, and multimodal named entity recognizer, are bounded together as it is presented in Figure 6. The incoming data from social media stream such as Twitter is streamed into the MongoDB that feeds multimodal data into di erent elements of the system. In this design, MongoDB acts as a cache for whole system in the case of any long delay or failure in any system parts. Delays happen because di erent parts have di erent complexity and speed metrics. On the other hand, the neo4j database is the repository of word graph in form of a memory graph. The following sections describe each part in more detail and the last section reviews how the whole system works in a harmony. 3.1. Memory-Graph Our approach identifies any document as a word graph such as G = (V ; E ), in which the nodes represent words, i i i and vertices are the co-occurrence relation of these nodes. A group of word graphs arriving in a streaming fashion form a larger graph such as G = (V; E). Figure 1 shows an example of such graphs. Preprocessing tasks like tokenization and stemming can be applied before forming word graphs. Time slots are used to identify arrival of each document and accordingly its addition to G. Upon arrival of each separate and independent G , the cumulative word graph G will be updated and each new node and edge will be added. Putting previous assumption in mind, each word graph has a time slot identifier label like t and must be noted as G . Discarding duplicated words in each document yields (i;t) to no further change in G but the G must be a weighted graph to uphold weight of relations. In our case, we use (i;t) similarity metric for computing comulative weights that is explained in Section 3.2. Evolving nature of cumulative word graph G indicates that streaming documents tend to complete it. In order to have a fading e ect on G, we also consider it as a memory-graph (MG) that can remember and forget. A control parameter %, controls the fading e ect on memory graph, meaning, each node and edge would fade in MG as time passes (new word graphs arriving). Forgetting and remembering properties of MG makes it reliable for extracting frequent patterns as relations that have certain properties in terms of MG’s memory representation. In other words, relations that withstand the evolution of MG are those that are being repeatedly added. Weights of MG show the importance of particular relations over https://neo4j.com/ 4 This is great That is not possible is that is this possible great not t t 0 1 that is possible this not great Figure 1: Word Graph formation example time, if any weight goes under a specific value, then the relation does not seem to be an important one anymore. If the weight shows a spiking growth, it means that the relation has been repeated with a high frequency in a short time. This property indicates an emerging relation if it disappears soon, otherwise, it will be assumed to be a common relation. All that is said draws a general assumption of MG which is core of current research to control the size of graph. Setting up forgetting rate can be done in diverse setups ranging from computational or cognitive approaches. In order to adapt a cognitive configuration bound with our presented MG, we implement forgetting curve [25]. According to Hermann Ebbinghaus’s work, who pioneered experimental study of memory, the forgetting curve is formulated as: R = e (2) where R is retrievability, % is stability of memory and t is time. % determines how fast retrievability falls over time in absence of any memory recall . Memory recall is defined as simply trying to remember items. However, there are two types of memory recall, free recall and serial recall. Free recall is an attempt to recall individual items regardless of their order while serial recall is attempting to recall the list of items in the ordered study. Overlearning e ect is practicing memorization beyond required repetitions. This e ect ensures that the memorized information will be more resistant to disruption or loss. Using this e ect and multiplying it to any values gives the fading e ect over it. For the t value in this equation, we will use the time that specific value last updated. If it has been a long time from its last update then the e ect is high and if it goes under a specific threshold it will be completely forgotten (deleted). However, the remembering and forgetting has its shortcomings too. For example, some words come from di erent contexts which need to be updated with di erent weights rather than just using +1 for any edge. On the other hand, words or phrases must have di erent importance because of their semantic nature; For example, a phrase related to a memory recall is di erent from recall 5 named entity such as person or location must be more important compared to a typical word. To enhance the described memory graph, subsection 3.2 and 3.3 describe our solutions. 3.2. Transformer masked language modelling Capturing the words and recording them in a graph form is explored in many previous works too. However, none of them tried to consider the semantic relation between words that appear in previous di erent contexts. For example, consider a word such as W , previously appearing in three di erent tweets of T ; T ; T and the word W in only one 1 2 3 tweet T , each of these two words had been on di erent contexts before, and no relation between them is available in our word graph. A new tweet arrives that has both of these words in the same context, a tweet such as T . The resulting sub-graph contains both of W and W now and according to previous methods there should be an edge with the weight of one between the two words. Apart from the consideration that this assumption is wrong, in this section we try to employ Transformers to improve it. 7 , 8 , 9 We use the BERT transformer pretrained model to compute the embedding for each incoming tweet and record it as an attribute for each node (word). The original version of BERT is not sucient because it does not capture sentence similarity, instead we used a pre-trained version called SentBERT [26]. If another tweet comes that has the word, embedding attribute of node is updated using a weighted approach between previous and new embedding. In the case of computing score for two di erent nodes in one tweet, we use frequency and similarity score. Frequency is the number of occurrence of two nodes in the same tweets, and score is calculated using (3). t t1 t t S = S + cos(M ; M ), if W and W 2 tweet (3) i j t i; j i; j i j tweet t t1 t M = M + (1 )M , if W 2 tweet (4) i t i i t t where, tweet is the tweet that arrives at time t, S is the score between words W and W , M is the embedding of t i j i: j i word W , is discount value for aggregating previous embedding and the new one (in the range of (0; 1)), and t is the time. According to (4), at each time a new tweet arrives and the embedding of nodes are updated with embedding provided from BERT (i.e., [CLS] token). The reason we have used [CLS] instead of each node embdding is that the context itself is more important than the word embedding and some words are tokenized into multiple tokens rather than one token which makes it harder to aggregate. Scoring equation shown in (3) provides a mechanism to score edges rather than naively using frequencies. This score equation at its best is equal to co-occurrence frequency, t t because larger values for cos(M ; M ) at each step equals means higher context similarity. However, this mechanism i j provides a smoother approach compared to frequency counting, but the online learning for fine tuning the underlying BERT language model has an undeniable importance. This fine tuning makes the language model more tuned to the task which the language is in form of tweets rather than formal essays or grammatically perfect articles. We adopt this assumption too, and use it in our setup to finetune the BERT at each step we gather batches of tweets. Utilizing MongoDB as a cache is important in this case. Additionally, if there is more than one server available, a dedicated server can be used for language modeling. 3.3. Multimodal Named Entity Recognition Categorization of items in the graph, based on their types provides key features to highlight the node significance. For example, in our study, words or phrases that point to some named entity such as a person is more important in forming a topic. To find the named entities, we use the deep multimodal model described in [27]. Figure 2 shows the multimodal named entity recognizer model that utilizes both image and text data from tweets. The multimodal nature of this approach provides better results in the presence of noise and it is able to perform recognition in the absence of image. The high performance of the model makes it perfect fit for the task at hand. Available at: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-4_H-512_A-8.zip Source code from google: https://github.com/google-research/bert SentBert: https://pypi.org/project/sentence-transformers/ 6 ... ... ... ... Text Conditional Random Field TD + GN + SineRELU ... T T T T T 0 1 2 3 n Conv (16x2) Bidirectional LSTM Bidirectional LSTM h h h ... h Conv (32x3) 0 1 2 n h h h ... h 0 1 2 n Conv (64x4) h h h ... h 0 1 2 n h h h ... h 0 1 2 n Conv (64x4)+ MP Conv (32x3)+ MP Forward LSTM Joint Word Embedding (500) h h h ... h 0 1 2 n Conv (16x2)+ MP fastText (300) GloVe (200) Character Embedding Feature Embedding Select Top 5 C C C C n n n n InceptionV3 W W W W 0 1 2 ... n ... C C C C 1 1 1 1 C C C C 0 0 0 0 Image Feature Extractor s =⟨ w , w , w , ..., w ⟩ Image 0 1 2 n Figure 2: Multimodal NER model, courtesy of [27]. 3.4. Graph update rules Updating the graph makes it larger and therefore, the memory graph approach is considered to shrink it to the desired size. However, adding multiple tweets (subgraphs) to it needs precautions, because each addition changes the shape and form of connections. Tracking changes and inferring topics according to small changes will be harder as the data stream get faster or the topic gets hotter. In order to detect and track topics, we propose using graph update rules that we obtained by getting motivated from [28]. They initially used this approach for community detection and we employ a di erent but similar method for topic detection. Each tweet converted to a complete subgraph, is part of a larger graph called memory graph. At each time step t, incoming tweets are converted to complete subgraphs such as G = (V ; E ). For simplicity, we see tweets coming in a specific time stamp one by one. Time stamp t is the t t t identifier for each complete subgraph such as G generated from tweet. The V symbol presents words or tokens and t t E presents the connection between them. E is complete for any tweet and is always determined according to (3) and t t (4). Rest of the notations are explained in Table 1. According to the assumptions above, we redefine each edge as a score in (3) and note the subgraph as G = (V ; S ) in which S  S . To understand the graph update rules, we first categorize the incoming subgraph type t t t i; j and according to its type, we decide which strategy to apply. Table 2 described each category. The first category t t U needs no precaution for addition because it will form a completely new topic cluster  <  independent of any other previous one. The strategy required for this category is to assign a new cluster id for it and add it to the set t t t of topic clusters at time t that is noted as  . In other sense, (8a)W 2 V ! (8i)W <  or (8i)V \  = ; is the a t a t i i mathematical definition of this category. In the case of category I, the new graph will be completely added to an t t t existing  because (9i)V   . This update will change the inner structure of  and has no impact on any other topic i i i clusters. The Multiple category orM is identified as words belonging to more than one topic cluster. Its mathematical 7 Table 1: Notations used in this paper Notation Description G = (V ; E ) Complete graph from twitter stream at time t with tokens as V and edges as E t t t S Cumulative similarity score between tokens i and j in time t (Eq. 3) i; j M Embedding of token i at time t (Eq. 4) Set of all topic clusters at time t ith topic cluster at time t Set of all topic clusters in all times ith topic at time t N (W ) Neighbourhood of W in time t t t definition is expressed as [(8a)W 2 V ! (9i)W 2  ] ^ [(8 j)V 1  ]. The last category, S, is expressed as a t a t i j t t t [(9a)W 2 V ! (8i)W <  ] ^ [(9b)W 2 V ! (9i)W 2  ] ^ [(8 j)V 1  ]. For the last two cases of these a t a b t b t i i j categorization, M and S, we need algorithmic approach to determine which word belongs to which topic cluster that is presented by Algorithm 1. In this algorithm, the topic cluster stickiness metric is used that is explained in (5). The function provides a metric to determine how much a single word W is related to a topic such as  . This function can be used in di erent approaches, utilization of this function is described by Algorithm 1. W ;W a b W 2N (W )\V t b a (W ;  ) = P (5) W ;W a b W 2N (W ) b a Algorithm 1: Update  for all the categories listed in Table 2 UpdateTopicClusters ( ; G ) Inputs :  and G = (V ; S ) t t t t+1 Output: 1 G = fW 2 V j(8i)W <  g; unseen t 2 G = V G ; seen t unseen t 0 3 S = fS jW 2 V and W 2 N(W )g; u pdate 0 t W;W 4 Apply S to related edges in update; u pdate 5 if G == ; then seen 6  = G ; t t K +1 7 end 8 else 9  = G ; t t K +1 10 foreach G 2 [G ; G ] do unseen seen 11 foreach W 2 G do t t t 12 assign W to  where i = argmax f(W;  )jN(W )\  , ;g; i i i i2f1;2;:::;K +1g 13 end 14 end 15 if  == ; then K +1 t t 16 remove  from  ; K +1 17 end 18 end 19 return  ; This algorithm provides a combination of all categorizations in a better understandable and ecient way. At 8 first, if the incoming G , consists of all new words, if it is type U, a new topic cluster is formed and assigned to it. Otherwise, for seen and unseen words in the G (first seen and then unseen ones), we calculate  for each W and assign W to the topic cluster which has the higher . This part of algorithm is required for the M and S categories while for the I no update is required, just updating the S values is enough. The reason for no required update in category I is that no new W will be added to topic clusters, just the S values will be updated. Table 2: Categorization of incoming tweets as subgraphs Category Symbol Description Unique U All words in tweet are new and did not appear in any Incessant I All words in tweet are previously merged into a single Multiple M All words in tweet are previously merged into more than one Subset S Some words in tweet are previously merged into one or more that one  and some are new 3.5. Topic Extraction The extraction part seems much easier having well-separated topic clusters which only contain related and con- nected words. However, using other features of the tweets is also helpful. By other features, we use how much tweet has been retweeted and liked. The other features are also available too, such as word frequency itself and the trend. We use all these features and store them in each node along with the embedding. Tagging words and grouping them as entities also improves the outcome. Here, we describe the methodology behind extracting topics from topic clusters at time t or the  . Equations (6) and (7) describe how we score each node in each time t. 0 1 B X C B C B C t t t 0 t B C B C (W ) = E   (W ) +  (W ) S (6) B 0C W;W @ A 0 t W 2N (W ) 0 1 B C B C B C t t t B C (W ) = F  BF (W ) + (L + R )C (7) i i @ A i=0 1:2; if W is tagged as named entity E = (8) 1; otherwise From the above equations, the outcome values for are converted to probabilities by (9). In this equation,  is the topic cluster that W belongs to. For each cluster, the probability of that topic cluster is in the topics is computed by (10) and also, for each topic cluster, the probability of each word being nominated is computed by (11). Multiplication of these two values gives the probability of each word being nominated in the overall outcome. Higher ranked words will appear first. (W ) P(Wj ) = P (9) t 0 (W ) 0 t W 2 (W ) W2 t t P( j ) = P (10) (W ) W2 t t t t P(Wj ) = P(Wj ) P( j ) (11) i i 9 3.6. Proposed system in a bounded harmony Although all parts and sections describe the system in little pieces but how this parts are bounded needs to be clarified. Figure 6 shows a visualization of the proposed system. The most important part of this systems is the data flow into the parts and keep it in a harmony. By harmony we mean how each part does its task and minimize the requirement for delays caused by slower parts. First, the connection to the twitter data stream is established and the incoming tweets about a topic of interest (this can be done by targeting specified keywords or just querying on specific key-phrases) are stored in the MongoDB as they are collected. These batches of data contain very valuable fields such text, media, hyperlinks and etc. which we just select the necessary fields and send them to each module that is indicated on the connection curves in the figure. Each batch containing 64 text samples are passed into BERT for fine-tuning. The masked language modeling using this transformer fine-tunes the embeddings in our use case. On the other hand, the graph strategies defined in Section 3.4 is used and applied by Algorithm 1. The named entity recognition is done by using both image and text from this multimodal data stream using what is described in Section 3.3. Graph strategies module adds data into the graph with doing topic clustering at the same time while the NER tags related words and combines them if there is an entity containing more than one words. We used Neo4j as our database because of its similicity in use and the ability to keep larger amounts of data while helping to access them using Cipher query language. This database is where we collect our findings in a graph form and share it between modules. The mining/tuning module is described in Section 3.5 which after mining the topics using equation (11), applies the decay defined by (2) to nodes. As hyperparameter for forgetting, we used a dynamic method that investigates how much RAM is available and according to that finds new %. If the memory usage goes over the threshold (90%) we start reducing the % by 1 until we get to the exact RAM usage after automatic elimination of nodes. If it goes under the threshold (80%) the % is increased by 1 until it is reached to the desired value.The initial value of % is set to 10 . The results are sent to the output at each step mining/tuning has been applied. This independence of modules makes them work concurrently or on di erent servers if needed. 4. Experiments and results For experimenting our system on various datasets we have used the datasets provided by [29] which was part of social sensor project. The three collected datasets along with results all of the methods that have been proposed upon these datasets are discussed in the following. 4.1. Datasets According to the TREC practice, releasing contents of tweets are not allowed because it violates the user privacy. The user might change his preferences to not seen by public in future time and releasing in past will violate his/her rights eventually. Therefore, the authors did not release the tweets, but instead they released the tweet id to be accessed if it is still available. However, in time, small part of this dataset was not accessible because of many reasons such as user changes to the privacy settings, user account closed or even restricted. This issue is not specific to this dataset and happens when the providers of datasets follow the TREC rules and respect user privacy. At the time we collected these datasets, the same issue happened and the quantitative results about the dataset itself is reported in table 3. Table 3: Quantitative information about the datasets, duration time format is HH:MM:SS Dataset Total Topics Number of Tweets Start Data End Date Duration FA CUP 13 189,034 2012/05/05 14:00:00 2012/05/05 19:59:30 05:59:30 SUPER TUESDAY 22 475,659 2012/03/06 17:00:00 2012/03/07 16:59:59 24:00:00 US ELECTIONS 64 1,992,260 2012/11/06 17:00:00 2012/11/08 04:59:59 36:00:00 10 A. B. C. Figure 3: WordCloud representation of datasets: A. US Elections; B. Super Tuesday; C. FA Cup. (A) 14:35:57 (B) 15:11:54 (C) 15:47:51 (D) 16:23:48 (E) 16:59:45 (F) 17:35:42 (G) 18:11:39 (H) 18:47:36 (I) 19:59:30 Figure 4: WordCloud representation of FA Cup dataset 4.2. Preprocessing Tweets often contain huge amounts of noise such as mistyped and user invetned words. This noise is a big problem in cases where no subword embedding is used. One of reasons we used BERT is because it can handle such unknown tokens. For the preprocessing part, we dropped stop words if they do not appear in the named entities. Other than dropping stopwords, we also dropped non-english tweets that contain various languages such as Spanish or Hindi and etc. We considered such languages out of our scope but our method is capable of handling these methods and we put 11 Table 4: Top-K topics-recall of detection algorithms compared to our proposed method on FA CUP dataset Top-K topic-recall Method 2 4 6 8 10 12 14 16 18 20 LDA .692 .692 .840 .840 .920 .920 .840 .840 .840 .750 Doc-P .769 .850 .920 .920 1 1 1 1 1 1 Gfeat-P .000 .308 .308 .375 .375 .375 .375 .375 .375 .375 SFPM .615 .840 .840 1 1 1 1 1 1 1 BNGram .769 .920 .920 .920 .920 .920 .920 .920 .920 .920 SVD+Kmean .482 .596 .710 .824 .938 .951 .951 .951 .951 .951 SNMF-Orig .100 .177 .254 .331 .389 .389 .389 .389 .389 .389 SNMF-KL .167 .334 .502 .670 .837 .837 .840 .850 .850 .924 Exempler .810 .838 .886 .908 .916 .916 .916 .916 .916 .916 EHG .379 .591 .727 .727 .864 .864 1 1 1 1 Ours .810 .951 .951 1 1 1 1 1 1 1 Table 5: Top-K topics-recall of detection algorithms compared to our proposed method on SUPER TUESDAY dataset Top-K topic-recall Method 2 10 20 30 40 50 60 70 80 90 100 LDA .000 .000 .000 .180 .130 .130 .180 .280 .280 .370 .227 Doc-P .227 .227 .310 .400 .460 .500 .500 .500 .540 .680 .680 Gfeat-P .046 .045 .085 .180 .227 .280 .280 .280 .280 .280 .280 SFPM .182 .182 .270 .325 .325 .325 .325 .325 .325 .325 .325 BNGram .500 .500 .540 .540 .540 .540 .540 .540 .540 .540 .540 SVD+Kmean .192 .236 .400 .488 .547 .580 .626 .666 .666 .666 .666 SNMF-Orig .000 .045 .100 .183 .277 .277 .277 .320 .320 .363 .453 SNMF-KL .000 .100 .183 .183 .318 .410 .366 .410 .453 .363 .410 Exempler .246 .463 .538 .572 .586 .597 .600 .617 .638 .638 .638 EHG .163 .408 .466 .540 .628 .674 .674 .699 .699 .711 .711 Ours .463 .381 .580 .628 .699 .699 .699 .711 722 .722 .722 into our future works to use multilingual contextual pretrained models . Cleaning other characters such as @ or - did not damage results more than 0.05 in various metrics but dropping these caharacter from tokens would yield in losing important tokens such as 1-0 in datasets such as FA CUP. Because of event itself, designing such preprocessing and engineering it is a hard thing to do and in some cases (rather than datasets we have tested and in some other languages) it may be beneficial but in overall it is logical to keep it as minimum as possible. 4.3. Evaluation metrics and results We use two metrics for evaluating our work, namely keyword-precision and topic-recall. Keyword-precision is defined as the number of correctly detected keywords divided by the number of ground-truth keywords. Topic-recall is the number of detected ground-truth topics divided by all ground truth topics. In order to evaluate and compare our work to others, we used the results obtained by other methods and also added our own on various top-K tests. For results of other methods, we used the reported values from [23]. Tables 4, 5, and 6 present results of our expriments on top-k topic-recall over all three datasets while table 7 provides results for all datasets with top-2 keyword-precision applied to these datasets. For the visualization part of our memory graph, we obtained results from our Neo4j database that is shown for an early moment before and after detecting communities. Figure 5 shows this visualization, in yellow nodes indicate the named entities while the blue ones are regular words. Pretrained multilingual BERT available at: https://github.com/google-research/bert/blob/master/multilingual.md 12 Table 6: Top-K topics-recall of detection algorithms compared to our proposed method on US ELECTIONS dataset Top-K topic-recall Method 2 10 20 30 40 50 60 70 80 90 100 LDA .109 .109 .185 .245 .220 .280 .325 .500 .475 .430 .460 Doc-P .234 .234 .415 .505 .560 .615 .615 .690 .690 .720 .740 Gfeat-P .078 .078 .140 .180 .180 .180 .180 .180 .180 .180 .180 SFPM .359 .359 .465 .525 .540 .540 .540 .540 .540 .540 .540 BNGram .480 .480 .495 .495 .495 .495 .495 .495 .495 .495 .495 SVD+Kmean .110 .216 .420 .522 .588 .608 .647 .700 .720 .720 .740 SNMF-Orig .075 .075 .154 .218 .439 .467 .483 .545 .563 .595 .595 SNMF-KL .154 .154 .326 .400 .547 .581 .562 .618 .600 .652 .622 Exempler .022 .142 .244 .364 .465 .532 .590 .628 .651 .662 .662 EHG .279 .608 .670 .688 .733 .746 .762 .772 .780 .796 .805 Ours .326 .628 .628 .733 .772 .772 .780 .805 .805 .813 .821 Table 7: Top-2 keyword-precision of detection algorithms compared to our proposed method on all three datasets Dataset Method FA CUP SUPER TUESDAY US ELECTIONS LDA .164 .000 .165 Doc-P .337 .511 .401 Gfeat-P .000 .375 .375 SFPM .233 .471 .241 BNGram .299 .628 .405 SVD+Kmean .242 .367 .300 SNMF-Orig .330 .241 .241 SNMF-KL .242 .164 .164 Exemplar .300 .485 .391 EHG .442 .812 .591 Ours .456 .851 .621 13 (A)                                                        (B) (C)                                                        (D) Figure 5: Outputs from Neo4j: (A) and (B) using Neo4j default visualization tool, yellow nodes are named entities and blues ones are words; (C) and (D) using GraphXR, all nodes are named entities 4.4. Future works In Figure 6, we propose a module in our system as computer vision. This module is our future work guide for researcher to obtain other visual entities such as people and objects in visual content such as videos and images. Clustering this objects using their embedding would yield to having them bounded with the topics they appear most. Thus, such results would give the end user much more information rather than text and this visual content would help automatic journalism from social media. 5. Conclusions In current research, we used Transformers combined with graph techniques to provide a topic detection system for social media. Fine-tuning the transformers in realtime and obtaining valuable semantic information rather than using just frequencies is another novelty of our work. Our proposed model is combined of various modules including BERT, graph strategies and multimodal named entity recognizer. This combination as a unique methodology is called memory graph that uses cognitive memorization in human brain. Using our hyper-parameters such as the rate of 14 Pretrained Transformer based Masked Language Model   Data Similarity Adding Fine-Tuning Updating Data Memory Graph Graph Strategies Tagging Text Tuning Data Stream Text Adding Twitter Data Stream Image Images Mining/Tuning Multimodal NER MongoDB Outputs Image Results Computer Vision Figure 6: Proposed System forgetting makes this system available for any stream size and any machine it is running. Modularity of our work makes it more usable in real life and corporate based environments when dealing with big social data.The results shows superiority of our proposed method compared to other state of the art techniques. References [1] T. U. T. Dallas, A. Susarla, Y. Tan, Social Networks and the Di usion of User- Generated Content : Evidence from YouTube Social Networks and the Di usion of User-Generated Content : Evidence from YouTube, Information Systems Research 23 (2012) 23–41. URL: https://doi.org/10.1287/isre.1100.0339. arXiv:isre.1100.0339. [2] F. Atefeh, W. Khreich, A survey of techniques for event detection in Twitter, Computational Intelligence 31 (2015) 133–164. URL: https: //doi.org/10.1111/coin.12017. arXiv:arXiv:1011.1669v3. [3] A. M. Kaplan, M. Haenlein, Users of the world, unite! The challenges and opportunities of Social Media, Business Horizons 53 (2010) 59–68. URL: https://doi.org/10.1016/j.bushor.2009.09.003. arXiv:1501.00994. [4] Twitter, Twitter is: whats happening in the world and what people are talking about right now., 2018. URL: http://twitter.com/about. [5] N. Carlson, The Real History Of Twitter, 2011. URL: http://www.businessinsider.com/how-twitter-was-founded-2011-4?op= [6] H. Kwak, C. Lee, H. Park, S. Moon, What is Twitter, a Social Network or a News Media?, Network (2010) 19–22. URL: https://doi. org/10.1145/1772690.1772751. arXiv:0809.1869v1. [7] I. L. Stats, Twitter usage statistics, 2018. URL: http://www.internetlivestats.com/twitter-statistics/. [8] G. Bello-Orgaz, J. J. Jung, D. Camacho, Social big data: Recent achievements and new challenges, Information Fusion 28 (2016) 45–59. URL: https://doi.org/10.1016/j.inffus.2015.08.005. arXiv:1505.06807. [9] T. API, Getting Started — Twitter Developers, 2018. URL: https://dev.twitter.com/start. [10] F. Atefeh, W. Khreich, A survey of techniques for event detection in twitter, Computational Intelligence 31 (2015) 132–164. [11] C. C. Aggarwal, J. Han, Frequent pattern mining, Springer, 2014. [12] R. Ibrahim, A. Elbagoury, M. S. Kamel, F. Karray, Tools and approaches for topic detection from twitter streams: survey, Knowledge and Information Systems (2017) 1–29. [13] C. Zhang, S. Zhang, Association rule mining: models and algorithms, Springer-Verlag, 2002. [14] G. Petkos, S. Papadopoulos, L. Aiello, R. Skraba, Y. Kompatsiaris, A soft frequent pattern mining approach for textual topic detection, in: Proceedings of the 4th International Conference on Web Intelligence, Mining and Semantics (WIMS14) - WIMS ’14, 2014, pp. 1–10. URL: http://dl.acm.org/citation.cfm?doid=2611040.2611068. [15] M. Huang, Jiajia and Peng, Wang, Topic Detection from Large Scale of Microblog Stream with High Utility Pattern Clustering, Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (2015) 3–10. [16] S. Gaglio, G. Lo Re, M. Morana, Real-time detection of twitter social events from the user’s perspective, in: IEEE International Conference on Communications, volume 2015-Septe, 2015, pp. 1207–1212. [17] H. J. Choi, C. H. Park, Emerging topic detection in twitter stream based on high utility pattern mining, Expert Systems with Applications 115 (2019) 27–36. [18] H. Sayyadi, M. Hurst, A. Maykov, Event detection and tracking in social streams, in: ICWSM - International Conference on Weblogs and Social Media, 2009, pp. 311–314. arXiv:1412.2188. 15 Type Task Ref. Detection Method Dataset Application Event Topic RED NED [30] Naive Bayse Classi- X X Twitter API, Handpicked Users Hot News Detection fier [31] BScore based BOW X X Twitter API (oine) Disaster and Story Clustering Detection [32] BOW Distance Simi- X X Twitter API FSD larity [33] BN-Gram and tf-idf X X Oine Datasets Topic Detection [34] Cross Checking via X X Twitter API, Wikipedia Hot News Detection Wikipedia [35] Formal Concept X X Replab 2013 dataset Topic Detection Analysis [36] FPM X X Twitter API Event Detection [14] FPM X X Super Tuesday/FA Cup/U.S. Elections Topic Detection [37] FPM (Hierarchical X X topic dataset from CLEar system Topic Detection Clustering) [16] FPM (tf-idf & n- X X Twitter API Event Detection gram improved) [38] GPU improved tf-idf X X oine dataset Topic Detection approximation [39] BOW Similarity X X oine dataset Topic Detection [40] Word Embedding SemEval Dataset Twitter Sentiment Classification [41] spatiotemporal X X oine Dataset Targeted-domain detection event detection [42] Clustering of Tempo- X X Twitter API Social Event Detec- ral & Spatial features tion [43] Geographical Regu- X X Twitter API Geo-Social Event larity Estimation Detection [44] BOW clustering X X Twitter API Event Detection & Analysis [45] Probabilistic Model- X X Twitter API Early Disaster Detec- ing tion [46] FPM X X oine dataset Event Detection [47] wavelet analysis X X Twitter API Event Detection Table 8: Twitter Topic/Event detection/tracking related studies 16 [19] D. Quercia, H. Askham, J. Crowcroft, TweetLDA: Supervised Topic Classification and Link Prediction in Twitter, in: WebSci, 2012, pp. 1–4. [20] M. Liu, J. Qu, Mining high utility itemsets without candidate generation, in: Proceedings of the 21st ACM international conference on Information and knowledge management, ACM, 2012, pp. 55–64. [21] Z. Saeed, R. A. Abbasi, A. Sadaf, M. I. Razzak, G. Xu, Text stream to temporal network-a dynamic heartbeat graph to detect emerging events on twitter, in: Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, 2018, pp. 534–545. [22] Z. Saeed, R. A. Abbasi, M. I. Razzak, G. Xu, Event detection in twitter stream using weighted dynamic heartbeat graph approach, arXiv preprint arXiv:1902.08522 (2019). [23] Z. Saeed, R. A. Abbasi, I. Razzak, O. Maqbool, A. Sadaf, G. Xu, Enhanced heartbeat graph for emerging event detection on twitter using time series networks, Expert Systems with Applications 136 (2019) 115–132. [24] Z. Saeed, R. A. Abbasi, I. Razzak, Evesense: What can you sense from twitter?, in: European Conference on Information Retrieval, Springer, 2020, pp. 491–495. [25] P. Wozniak, ´ E. Gorzelanczyk, ´ J. Murakowski, Two components of long-term memory., Acta neurobiologiae experimentalis 55 (1995) 301–305. [26] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks, arXiv preprint arXiv:1908.10084 (2019). [27] M. Asgari-Chenaghlu, M. R. Feizi-Derakhshi, L. Farzinvash, C. Motamed, A multimodal deep learning approach for named entity recognition from social media, arXiv preprint arXiv:2001.06888 (2020). [28] Z. Zhao, C. Li, X. Zhang, F. Chiclana, E. H. Viedma, An incremental method to detect communities in dynamic evolving social networks, Knowledge-Based Systems 163 (2019) 404–415. [29] L. M. Aiello, G. Petkos, C. Martin, D. Corney, S. Papadopoulos, R. Skraba, A. Goker, I. Kompatsiaris, A. Jaimes, Sensing trending topics in twitter, IEEE Transactions on Multimedia 15 (2013) 1268–1282. [30] J. Sankaranarayanan, H. Samet, B. E. Teitler, M. D. Lieberman, J. Sperling, Twitterstand: news in tweets, in: Proceedings of the 17th acm sigspatial international conference on advances in geographic information systems, ACM, 2009, pp. 42–51. [31] S. Phuvipadawat, T. Murata, Breaking news detection and tracking in twitter, in: Web Intelligence and Intelligent Agent Technology (WI-IAT), 2010 IEEE/WIC/ACM International Conference on, volume 3, IEEE, 2010, pp. 120–123. [32] S. Petrovic, ´ M. Osborne, V. Lavrenko, Streaming first story detection with application to twitter, in: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, 2010, pp. 181–189. [33] S. D. Tembhurnikar, N. N. Patil, Topic detection using bngram method and sentiment analysis on twitter dataset (2015) 1–6. [34] M. Osborne, S. Petrovic, R. McCreadie, C. Macdonald, I. Ounis, Bieber no more: First story detection using twitter and wikipedia, in: SIGIR 2012 Workshop on Time-aware Information Access, 2012. [35] J. Cigarran, ´ A. Castellanos, A. Garc´ ıa-Serrano, A step forward for topic detection in twitter: An fca-based approach, Expert Systems with Applications 57 (2016) 21–36. [36] M. Adedoyin-Olowe, M. M. Gaber, C. M. Dancausa, F. Stahl, J. B. Gomes, A rule dynamics approach to event detection in twitter with its application to sports and politics, Expert Systems with Applications 55 (2016) 351–360. [37] G. Dong, W. Yang, F. Zhu, W. Wang, Discovering burst patterns of burst topic in twitter, Computers & Electrical Engineering 58 (2017) 551–559. [38] U. Erra, S. Senatore, F. Minnella, G. Caggianese, Approximate tf–idf based on topic extraction from massive message stream using the gpu, Information Sciences 292 (2015) 143–161. [39] H. Becker, M. Naaman, L. Gravano, Beyond trending topics: Real-world event identification on twitter., ICWSM 11 (2011) 438–441. [40] D. Tang, F. Wei, N. Yang, M. Zhou, T. Liu, B. Qin, Learning sentiment-specific word embedding for twitter sentiment classification., in: ACL (1), 2014, pp. 1555–1565. [41] T. Hua, F. Chen, L. Zhao, C.-T. Lu, N. Ramakrishnan, Automatic targeted-domain spatiotemporal event detection in twitter, GeoInformatica 20 (2016) 765–795. [42] H. Abdelhaq, C. Sengstock, M. Gertz, Eventweet: Online localized event detection from twitter, Proceedings of the VLDB Endowment 6 (2013) 1326–1329. [43] R. Lee, K. Sumiya, Measuring geographical regularities of crowd behaviors for twitter-based geo-social event detection, in: Proceedings of the 2nd ACM SIGSPATIAL international workshop on location based social networks, ACM, 2010, pp. 1–10. [44] R. Li, K. H. Lei, R. Khadiwala, K. C.-C. Chang, Tedas: A twitter-based event detection and analysis system, in: Data engineering (icde), 2012 ieee 28th international conference on, IEEE, 2012, pp. 1273–1276. [45] T. Sakaki, M. Okazaki, Y. Matsuo, Earthquake shakes twitter users: real-time event detection by social sensors, in: Proceedings of the 19th international conference on World wide web, ACM, 2010, pp. 851–860. [46] T. Snowsill, F. Nicart, M. Stefani, T. De Bie, N. Cristianini, Finding surprising patterns in textual data streams, in: Cognitive Information Processing (CIP), 2010 2nd International Workshop on, IEEE, 2010, pp. 405–410. [47] M. Cordeiro, Twitter event detection: combining wavelet analysis and topic inference summarization, in: Doctoral symposium on informatics engineering, 2012. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Computing Research Repository arXiv (Cornell University)

TopicBERT: A Transformer transfer learning based memory-graph approach for multimodal streaming social media topic detection

Loading next page...
 
/lp/arxiv-cornell-university/topicbert-a-transformer-transfer-learning-based-memory-graph-approach-a0TLmwbge0

References

References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.

eISSN
ARCH-3344
DOI
10.1016/j.chaos.2021.111274
Publisher site
See Article on Publisher Site

Abstract

Real time nature of social networks with bursty short messages and their respective large data scale spread among vast variety of topics are research interest of many researchers. These properties of social networks which are known as 5’Vs of big data has led to many unique and enlightenment algorithms and techniques applied to large social network- ing datasets and data streams. Many of these researches are based on detection and tracking of hot topics and trending social media events that help revealing many unanswered questions. These algorithms and in some cases software products mostly rely on the nature of the language itself. Although, other techniques such as unsupervised data mining methods are language independent but many requirements for a comprehensive solution are not met. Many research issues such as noisy sentences that adverse grammar and new online user invented words are challenging maintenance of a good social network topic detection and tracking methodology; The semantic relationship between words and in most cases, synonyms are also ignored by many of these researches. In this research, we use Transformers combined with an incremental community detection algorithm. Transformer in one hand, provides the semantic relation between words in di erent contexts. On the other hand, the proposed graph mining technique enhances the resulting topics with aid of simple structural rules. Named entity recognition from multimodal data, image and text, labels the named entities with entity type and the extracted topics are tuned using them. All operations of proposed system has been applied with big social data perspective under NoSQL technologies. In order to present a working and systematic so- lution, we combined MongoDB with Neo4j as two major database systems of our work. The proposed system shows higher precision and recall compared to other methods in three di erent datasets. Keywords: Memory–Graph, Deep Learning, Frequent Subgraph Mining, Transformer, Multimodal Learning, BERT 1. Introduction Internet and its applications are widely advancing towards social networks in which the user generated content plays a crucial role [1]. This content which is mostly generated by users, companies, news reporting agencies and etc., consists of text, image and video and has noise naturally [2, 3]. This media forms the foundation of social media delivery systems such as Facebook, Youtube, Instagram, Twitter, and etc. with di erent views and media orientations. Some of these social networks tend to be an online social TV such as Youtube. Content of these networks are generated in video format by users and posted in channels and communication between users is possible by commenting under these videos. Instagram was started as an image and short video sharing social network and the same communication method has been implemented. Facebook as another big social network, does not restrict users to post specific media types and also allows users to directly communicate with each other using direct chat which is recently available on Instagram too. Among these popular social media delivery systems, Twitter has gained much attention in recent years due to its simplicity and user friendly interface. It is also known as the best social media for discovering “what is happening right now”[4]. Twitter is easily accessible by almost any device connected to Internet through a web browser or a third-party/ocial application. The nature of twitter short posts that are known as tweets has restrictions in size and at first it was supposed to be a gentle replacement for SMS [5]. In other words, twitter is the best social media for discovering news and real world events [6]. Preprint submitted to arXiv August 19, 2020 arXiv:2008.06877v1 [cs.CL] 16 Aug 2020 Reports show that on a daily basis, 500 million new tweets and on a monthly basis, 300 new user sign-ups happened on twitter in 2013; In 2018, it is estimated that 700 million tweets will be posted each day and growth rate for number of tweets per year to be around 32% on 2013 [7]. These statistics show that users tend to produce large volumes of tweets on variety of types with high velocity and diverse veracity with di erent values; Mentioned characteristics of twitter streaming media takes it into account of social big data with respect to 5’V model [8]. Twitter has very rich data in form of tweets which can be accessed using Twitter API. This API made researchers capable of investigating its social big data in a streaming way or in a query based data retrieval fashion [9]. Docu- mentation about the API itself is available at twitter developer website . This API allows direct messaging, search, advertisement management, and account activity control. It has some restriction to prevent developers from abusing the service. For example, it has rate limit for user or application . This open and free dataset has been used by many researchers. Many advancements has been conducted in multiple research areas related to this social big data such as opinion mining, topic detection and tracking, user modeling, sentiment analysis, and etc. But the underlying conceptual perspective of researcher to the problem makes di erent outcomes with diverse real world applications. If the data is supposed to be treated as streaming social big data (as it is) then the problem would likely to be solved in an unsupervised or semi-supervised methodology. Some other researches try to solve the problem with respect to a static corpus but others tend to do so in an streaming big social data respect [10]. On the other hand, many of the previous works assumed the correctness of incoming data for granted which is not true in many cases. For example, in the case of analyzing tweets related to regular users, many noisy content such as mistypes and grammar errors exist while in the case of ocial tweets like a news agency tweet, this problem can be ignored. The most missing part of this analysis is related to images and videos of the tweets. To the best of our knowledge, most of the existing methods did not use multimodal data to detect and track topics. To overcome the mentioned problems, we present a novel approach for topic detection and tracking problem. Our work consists of three major parts: multimodal named entity recognizer, memory graph, and Transformer based se- mantic similarity assignment. The combination of these three parts provides the final results. Orientation of presented study is as follows: Section 2 covers related works of the problem at hand. Section 3 presents the proposed methodol- ogy for detecting and tracking topics. The experimental results at Section 4 investigates the performance of our work on three distinct datasets. Finally, Section 5 concludes the whole paper. 2. Related work This section describes various techniques proposed for topic detection and tracking from twitter. For reader curiosity and further information, Table 8 is inserted to cover most of related works in literature of TDT task on twitter. The first methods we investigate here, are using frequent pattern mining as building block. The frequent pattern mining problem is a well studied area by virtue of its various applications in multiple domains like clustering and classification. This problem is initiated as finding relationships between items in a transaction database [11], where the formal definition is given below: Definition 1. Given a database D consisting transactions T ; : : : ; T , determination of all possible patterns P with at 1 2 least s frequency over all of transactions, is known as Frequent Pattern Mining. With respect to this definition, a document collection can be noted as a database of transactions. Depending on this technical terminology, a frequent pattern mining approach applied to collection of documents discovers relationships between words with frequency of at least s [10, 12]. To have a better formalization and homogeneity, just like a database of transactions, we use notation D for collection of documents and P for possible patterns which are relationships between words. Abstract topic on the other hand, is defined in the following: https://developer.twitter.com/en https://developer.twitter.com/en/docs/basics/rate-limiting 2 Definition 2. Set of abstract topics (or shortly to pics)  in collection of documents D are bag of words W related to each other by occurrence in the same documents with high frequency. Summing up definitions 1 and 2, topic detection task from a series of documents is almost equivalent to frequent pattern mining in database of transactions if the semantic relation between words is discarded from problem. A major topic such as  2  usually consists of multiple minor topics like  , based on this assumption, a minor i j topic is defined as:  = fW j W  W g. With association rule mining in mind [13], statement    holds true j    j i j j i while the confidence c for a minor topic depends on detection algorithm but at any means is less than one. On the contrary, a real life event, or event for short is defined as [10]: Definition 3. An event e 2 E is a real-life topic that occurs in real life at specific time period . According to this definition, any event is also a topic but a topic  needs to happen in real life to be counted as an event e. For example, the Bongo Cat is not a real life event but it is surely a topic. For event detection tasks, it is crucial to have more information about topics rather than depending only on documents; online sources such as Wikipedia are very helpful for information gathering on this specific problem. These two problems, namely frequent pattern finding and topic detection, are dealt by a similar way. Frequent pattern mining has been a common way to solve the topic detection task over twitter data stream and some variants of formal methods has been proposed by researchers in [14–17]. In [14], authors designed a soft frequent pattern mining approach to overcome the topic detection problem. This research aims at finding word co-occurrence large than two, in order to do so, likelihood of each word is computed separately. Equation 1 shows likelihood of appearance for each word. N + p(wjcor pus) = P (1) ( N ) + n where N is the number of occurrences of word w, and  equals to 0:5. Denominator part in this equation is used to normalize the fraction. Grow rate of each word in corpus is computed to indicate the most discussed words. After finding top-K frequent words, a co-occurrence vector is formed to add co-occurring words to the top-K word vectors. This expansion forms the foundation of soft frequent pattern mining algorithm (SFPM) utilized in [14] to solve the topic detection problem. Results presented in main article shows superiority of this method to LDA [18] and a Gra ph based approach [19], which is assumed to be the baseline for this research. Like previous method, in [16] a named entity recognition weight has been added to boost scores for each word. This simple addition to original algorithm (SFPM) improved topic recall, keyword precision and keyword recall on two di erent datasets. Recently, high utility pattern mining (HUPM) has been introduced [20]. This approach for finding frequent item- sets with higher utility in twitter topic detection domain has been investigated in [15, 17]. The proposed high utility pattern clustering in [15] considered the utility of each pattern in order to extract high utility patterns, afterwards it takes these patterns and uses an incremental pattern clustering scheme. K N N and modularity based clustering employed by authors, try to identify new emerging patterns and coherent topics simultaneously. Finally, a group of words for each cluster of patterns are extracted to represent the topic of that specific cluster. All of mentioned steps together form HUPC framework (High Utility Pattern Clustering): text stream, text segments, top-K HUP mining, term indexing, HUP clustering, and post-processing. Like two previous methods, proposed scheme in [17] considered the problem of topic detection as a pattern mining problem and introduces a HUP mining algorithm in order to extract final topics. Internal, external, and transaction weighted utility of words has been used by authors to determine the utility of words. Minimum utility of each chunk of tweets is determined automatically by their algorithm. Also, for post-processing step, a T o pic tree was proposed to extract actual topics from candidate ones. Experimental results shows superiority of this work compared to five di erent methods. Bongo cat is a set of tweets identified by #BongoCat hashtag: https://www.twitter.com/hashtag/Bongocat Wikipedia is a free encyclopedia, written by the people who use it: https://www.wikipedia.org/ 3 Some research [21–24] used dynamic heart beat graph technique (DHG) for streaming data on Twitter. The proposed algorithms construct a sub-graph for each sentence. These bi-clique sub-graphs are added to the main graph and a graph operation on the whole graph is performed based on the frequency of co-occuring words (edges between nodes). Their proposed method shows a novel approach on how graph methods can be applied to streaming text but on the other hand their work lacks multimodality and comprehensiveness. However, in most of previous works the main contribution of authors is extracting topics with higher precision and recall, but none of these works except [17, 23] tried to take performance metrics in mind. An online topic detection and tracking scheme must fit in Definition 4 to be considered as a practical solution. Definition 4. New event detection (NED) is discovering new events from live stream L in real time with no delay or at least with an acceptable delay t . If the extraction and tracking method is applied to collection of oine documents, it will be considered a RED methodology, which completely di ers from NED. Definition 5 is a formalization of RED. Definition 5. Retrospective event detection (RED) is discovering previously unidentified events from collection of documents. Definitions 4 and 5 make a fair clarification of Table 8 and highlight di erences among diverse approaches. In Table 8 we provide an overview of related works in this field and categorized them based on the NED and RED terminology. 3. TopicBERT The three pillars of our proposed system (TopicBERT), namely, Transformer semantic similarity, word community detection, and multimodal named entity recognizer, are bounded together as it is presented in Figure 6. The incoming data from social media stream such as Twitter is streamed into the MongoDB that feeds multimodal data into di erent elements of the system. In this design, MongoDB acts as a cache for whole system in the case of any long delay or failure in any system parts. Delays happen because di erent parts have di erent complexity and speed metrics. On the other hand, the neo4j database is the repository of word graph in form of a memory graph. The following sections describe each part in more detail and the last section reviews how the whole system works in a harmony. 3.1. Memory-Graph Our approach identifies any document as a word graph such as G = (V ; E ), in which the nodes represent words, i i i and vertices are the co-occurrence relation of these nodes. A group of word graphs arriving in a streaming fashion form a larger graph such as G = (V; E). Figure 1 shows an example of such graphs. Preprocessing tasks like tokenization and stemming can be applied before forming word graphs. Time slots are used to identify arrival of each document and accordingly its addition to G. Upon arrival of each separate and independent G , the cumulative word graph G will be updated and each new node and edge will be added. Putting previous assumption in mind, each word graph has a time slot identifier label like t and must be noted as G . Discarding duplicated words in each document yields (i;t) to no further change in G but the G must be a weighted graph to uphold weight of relations. In our case, we use (i;t) similarity metric for computing comulative weights that is explained in Section 3.2. Evolving nature of cumulative word graph G indicates that streaming documents tend to complete it. In order to have a fading e ect on G, we also consider it as a memory-graph (MG) that can remember and forget. A control parameter %, controls the fading e ect on memory graph, meaning, each node and edge would fade in MG as time passes (new word graphs arriving). Forgetting and remembering properties of MG makes it reliable for extracting frequent patterns as relations that have certain properties in terms of MG’s memory representation. In other words, relations that withstand the evolution of MG are those that are being repeatedly added. Weights of MG show the importance of particular relations over https://neo4j.com/ 4 This is great That is not possible is that is this possible great not t t 0 1 that is possible this not great Figure 1: Word Graph formation example time, if any weight goes under a specific value, then the relation does not seem to be an important one anymore. If the weight shows a spiking growth, it means that the relation has been repeated with a high frequency in a short time. This property indicates an emerging relation if it disappears soon, otherwise, it will be assumed to be a common relation. All that is said draws a general assumption of MG which is core of current research to control the size of graph. Setting up forgetting rate can be done in diverse setups ranging from computational or cognitive approaches. In order to adapt a cognitive configuration bound with our presented MG, we implement forgetting curve [25]. According to Hermann Ebbinghaus’s work, who pioneered experimental study of memory, the forgetting curve is formulated as: R = e (2) where R is retrievability, % is stability of memory and t is time. % determines how fast retrievability falls over time in absence of any memory recall . Memory recall is defined as simply trying to remember items. However, there are two types of memory recall, free recall and serial recall. Free recall is an attempt to recall individual items regardless of their order while serial recall is attempting to recall the list of items in the ordered study. Overlearning e ect is practicing memorization beyond required repetitions. This e ect ensures that the memorized information will be more resistant to disruption or loss. Using this e ect and multiplying it to any values gives the fading e ect over it. For the t value in this equation, we will use the time that specific value last updated. If it has been a long time from its last update then the e ect is high and if it goes under a specific threshold it will be completely forgotten (deleted). However, the remembering and forgetting has its shortcomings too. For example, some words come from di erent contexts which need to be updated with di erent weights rather than just using +1 for any edge. On the other hand, words or phrases must have di erent importance because of their semantic nature; For example, a phrase related to a memory recall is di erent from recall 5 named entity such as person or location must be more important compared to a typical word. To enhance the described memory graph, subsection 3.2 and 3.3 describe our solutions. 3.2. Transformer masked language modelling Capturing the words and recording them in a graph form is explored in many previous works too. However, none of them tried to consider the semantic relation between words that appear in previous di erent contexts. For example, consider a word such as W , previously appearing in three di erent tweets of T ; T ; T and the word W in only one 1 2 3 tweet T , each of these two words had been on di erent contexts before, and no relation between them is available in our word graph. A new tweet arrives that has both of these words in the same context, a tweet such as T . The resulting sub-graph contains both of W and W now and according to previous methods there should be an edge with the weight of one between the two words. Apart from the consideration that this assumption is wrong, in this section we try to employ Transformers to improve it. 7 , 8 , 9 We use the BERT transformer pretrained model to compute the embedding for each incoming tweet and record it as an attribute for each node (word). The original version of BERT is not sucient because it does not capture sentence similarity, instead we used a pre-trained version called SentBERT [26]. If another tweet comes that has the word, embedding attribute of node is updated using a weighted approach between previous and new embedding. In the case of computing score for two di erent nodes in one tweet, we use frequency and similarity score. Frequency is the number of occurrence of two nodes in the same tweets, and score is calculated using (3). t t1 t t S = S + cos(M ; M ), if W and W 2 tweet (3) i j t i; j i; j i j tweet t t1 t M = M + (1 )M , if W 2 tweet (4) i t i i t t where, tweet is the tweet that arrives at time t, S is the score between words W and W , M is the embedding of t i j i: j i word W , is discount value for aggregating previous embedding and the new one (in the range of (0; 1)), and t is the time. According to (4), at each time a new tweet arrives and the embedding of nodes are updated with embedding provided from BERT (i.e., [CLS] token). The reason we have used [CLS] instead of each node embdding is that the context itself is more important than the word embedding and some words are tokenized into multiple tokens rather than one token which makes it harder to aggregate. Scoring equation shown in (3) provides a mechanism to score edges rather than naively using frequencies. This score equation at its best is equal to co-occurrence frequency, t t because larger values for cos(M ; M ) at each step equals means higher context similarity. However, this mechanism i j provides a smoother approach compared to frequency counting, but the online learning for fine tuning the underlying BERT language model has an undeniable importance. This fine tuning makes the language model more tuned to the task which the language is in form of tweets rather than formal essays or grammatically perfect articles. We adopt this assumption too, and use it in our setup to finetune the BERT at each step we gather batches of tweets. Utilizing MongoDB as a cache is important in this case. Additionally, if there is more than one server available, a dedicated server can be used for language modeling. 3.3. Multimodal Named Entity Recognition Categorization of items in the graph, based on their types provides key features to highlight the node significance. For example, in our study, words or phrases that point to some named entity such as a person is more important in forming a topic. To find the named entities, we use the deep multimodal model described in [27]. Figure 2 shows the multimodal named entity recognizer model that utilizes both image and text data from tweets. The multimodal nature of this approach provides better results in the presence of noise and it is able to perform recognition in the absence of image. The high performance of the model makes it perfect fit for the task at hand. Available at: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-4_H-512_A-8.zip Source code from google: https://github.com/google-research/bert SentBert: https://pypi.org/project/sentence-transformers/ 6 ... ... ... ... Text Conditional Random Field TD + GN + SineRELU ... T T T T T 0 1 2 3 n Conv (16x2) Bidirectional LSTM Bidirectional LSTM h h h ... h Conv (32x3) 0 1 2 n h h h ... h 0 1 2 n Conv (64x4) h h h ... h 0 1 2 n h h h ... h 0 1 2 n Conv (64x4)+ MP Conv (32x3)+ MP Forward LSTM Joint Word Embedding (500) h h h ... h 0 1 2 n Conv (16x2)+ MP fastText (300) GloVe (200) Character Embedding Feature Embedding Select Top 5 C C C C n n n n InceptionV3 W W W W 0 1 2 ... n ... C C C C 1 1 1 1 C C C C 0 0 0 0 Image Feature Extractor s =⟨ w , w , w , ..., w ⟩ Image 0 1 2 n Figure 2: Multimodal NER model, courtesy of [27]. 3.4. Graph update rules Updating the graph makes it larger and therefore, the memory graph approach is considered to shrink it to the desired size. However, adding multiple tweets (subgraphs) to it needs precautions, because each addition changes the shape and form of connections. Tracking changes and inferring topics according to small changes will be harder as the data stream get faster or the topic gets hotter. In order to detect and track topics, we propose using graph update rules that we obtained by getting motivated from [28]. They initially used this approach for community detection and we employ a di erent but similar method for topic detection. Each tweet converted to a complete subgraph, is part of a larger graph called memory graph. At each time step t, incoming tweets are converted to complete subgraphs such as G = (V ; E ). For simplicity, we see tweets coming in a specific time stamp one by one. Time stamp t is the t t t identifier for each complete subgraph such as G generated from tweet. The V symbol presents words or tokens and t t E presents the connection between them. E is complete for any tweet and is always determined according to (3) and t t (4). Rest of the notations are explained in Table 1. According to the assumptions above, we redefine each edge as a score in (3) and note the subgraph as G = (V ; S ) in which S  S . To understand the graph update rules, we first categorize the incoming subgraph type t t t i; j and according to its type, we decide which strategy to apply. Table 2 described each category. The first category t t U needs no precaution for addition because it will form a completely new topic cluster  <  independent of any other previous one. The strategy required for this category is to assign a new cluster id for it and add it to the set t t t of topic clusters at time t that is noted as  . In other sense, (8a)W 2 V ! (8i)W <  or (8i)V \  = ; is the a t a t i i mathematical definition of this category. In the case of category I, the new graph will be completely added to an t t t existing  because (9i)V   . This update will change the inner structure of  and has no impact on any other topic i i i clusters. The Multiple category orM is identified as words belonging to more than one topic cluster. Its mathematical 7 Table 1: Notations used in this paper Notation Description G = (V ; E ) Complete graph from twitter stream at time t with tokens as V and edges as E t t t S Cumulative similarity score between tokens i and j in time t (Eq. 3) i; j M Embedding of token i at time t (Eq. 4) Set of all topic clusters at time t ith topic cluster at time t Set of all topic clusters in all times ith topic at time t N (W ) Neighbourhood of W in time t t t definition is expressed as [(8a)W 2 V ! (9i)W 2  ] ^ [(8 j)V 1  ]. The last category, S, is expressed as a t a t i j t t t [(9a)W 2 V ! (8i)W <  ] ^ [(9b)W 2 V ! (9i)W 2  ] ^ [(8 j)V 1  ]. For the last two cases of these a t a b t b t i i j categorization, M and S, we need algorithmic approach to determine which word belongs to which topic cluster that is presented by Algorithm 1. In this algorithm, the topic cluster stickiness metric is used that is explained in (5). The function provides a metric to determine how much a single word W is related to a topic such as  . This function can be used in di erent approaches, utilization of this function is described by Algorithm 1. W ;W a b W 2N (W )\V t b a (W ;  ) = P (5) W ;W a b W 2N (W ) b a Algorithm 1: Update  for all the categories listed in Table 2 UpdateTopicClusters ( ; G ) Inputs :  and G = (V ; S ) t t t t+1 Output: 1 G = fW 2 V j(8i)W <  g; unseen t 2 G = V G ; seen t unseen t 0 3 S = fS jW 2 V and W 2 N(W )g; u pdate 0 t W;W 4 Apply S to related edges in update; u pdate 5 if G == ; then seen 6  = G ; t t K +1 7 end 8 else 9  = G ; t t K +1 10 foreach G 2 [G ; G ] do unseen seen 11 foreach W 2 G do t t t 12 assign W to  where i = argmax f(W;  )jN(W )\  , ;g; i i i i2f1;2;:::;K +1g 13 end 14 end 15 if  == ; then K +1 t t 16 remove  from  ; K +1 17 end 18 end 19 return  ; This algorithm provides a combination of all categorizations in a better understandable and ecient way. At 8 first, if the incoming G , consists of all new words, if it is type U, a new topic cluster is formed and assigned to it. Otherwise, for seen and unseen words in the G (first seen and then unseen ones), we calculate  for each W and assign W to the topic cluster which has the higher . This part of algorithm is required for the M and S categories while for the I no update is required, just updating the S values is enough. The reason for no required update in category I is that no new W will be added to topic clusters, just the S values will be updated. Table 2: Categorization of incoming tweets as subgraphs Category Symbol Description Unique U All words in tweet are new and did not appear in any Incessant I All words in tweet are previously merged into a single Multiple M All words in tweet are previously merged into more than one Subset S Some words in tweet are previously merged into one or more that one  and some are new 3.5. Topic Extraction The extraction part seems much easier having well-separated topic clusters which only contain related and con- nected words. However, using other features of the tweets is also helpful. By other features, we use how much tweet has been retweeted and liked. The other features are also available too, such as word frequency itself and the trend. We use all these features and store them in each node along with the embedding. Tagging words and grouping them as entities also improves the outcome. Here, we describe the methodology behind extracting topics from topic clusters at time t or the  . Equations (6) and (7) describe how we score each node in each time t. 0 1 B X C B C B C t t t 0 t B C B C (W ) = E   (W ) +  (W ) S (6) B 0C W;W @ A 0 t W 2N (W ) 0 1 B C B C B C t t t B C (W ) = F  BF (W ) + (L + R )C (7) i i @ A i=0 1:2; if W is tagged as named entity E = (8) 1; otherwise From the above equations, the outcome values for are converted to probabilities by (9). In this equation,  is the topic cluster that W belongs to. For each cluster, the probability of that topic cluster is in the topics is computed by (10) and also, for each topic cluster, the probability of each word being nominated is computed by (11). Multiplication of these two values gives the probability of each word being nominated in the overall outcome. Higher ranked words will appear first. (W ) P(Wj ) = P (9) t 0 (W ) 0 t W 2 (W ) W2 t t P( j ) = P (10) (W ) W2 t t t t P(Wj ) = P(Wj ) P( j ) (11) i i 9 3.6. Proposed system in a bounded harmony Although all parts and sections describe the system in little pieces but how this parts are bounded needs to be clarified. Figure 6 shows a visualization of the proposed system. The most important part of this systems is the data flow into the parts and keep it in a harmony. By harmony we mean how each part does its task and minimize the requirement for delays caused by slower parts. First, the connection to the twitter data stream is established and the incoming tweets about a topic of interest (this can be done by targeting specified keywords or just querying on specific key-phrases) are stored in the MongoDB as they are collected. These batches of data contain very valuable fields such text, media, hyperlinks and etc. which we just select the necessary fields and send them to each module that is indicated on the connection curves in the figure. Each batch containing 64 text samples are passed into BERT for fine-tuning. The masked language modeling using this transformer fine-tunes the embeddings in our use case. On the other hand, the graph strategies defined in Section 3.4 is used and applied by Algorithm 1. The named entity recognition is done by using both image and text from this multimodal data stream using what is described in Section 3.3. Graph strategies module adds data into the graph with doing topic clustering at the same time while the NER tags related words and combines them if there is an entity containing more than one words. We used Neo4j as our database because of its similicity in use and the ability to keep larger amounts of data while helping to access them using Cipher query language. This database is where we collect our findings in a graph form and share it between modules. The mining/tuning module is described in Section 3.5 which after mining the topics using equation (11), applies the decay defined by (2) to nodes. As hyperparameter for forgetting, we used a dynamic method that investigates how much RAM is available and according to that finds new %. If the memory usage goes over the threshold (90%) we start reducing the % by 1 until we get to the exact RAM usage after automatic elimination of nodes. If it goes under the threshold (80%) the % is increased by 1 until it is reached to the desired value.The initial value of % is set to 10 . The results are sent to the output at each step mining/tuning has been applied. This independence of modules makes them work concurrently or on di erent servers if needed. 4. Experiments and results For experimenting our system on various datasets we have used the datasets provided by [29] which was part of social sensor project. The three collected datasets along with results all of the methods that have been proposed upon these datasets are discussed in the following. 4.1. Datasets According to the TREC practice, releasing contents of tweets are not allowed because it violates the user privacy. The user might change his preferences to not seen by public in future time and releasing in past will violate his/her rights eventually. Therefore, the authors did not release the tweets, but instead they released the tweet id to be accessed if it is still available. However, in time, small part of this dataset was not accessible because of many reasons such as user changes to the privacy settings, user account closed or even restricted. This issue is not specific to this dataset and happens when the providers of datasets follow the TREC rules and respect user privacy. At the time we collected these datasets, the same issue happened and the quantitative results about the dataset itself is reported in table 3. Table 3: Quantitative information about the datasets, duration time format is HH:MM:SS Dataset Total Topics Number of Tweets Start Data End Date Duration FA CUP 13 189,034 2012/05/05 14:00:00 2012/05/05 19:59:30 05:59:30 SUPER TUESDAY 22 475,659 2012/03/06 17:00:00 2012/03/07 16:59:59 24:00:00 US ELECTIONS 64 1,992,260 2012/11/06 17:00:00 2012/11/08 04:59:59 36:00:00 10 A. B. C. Figure 3: WordCloud representation of datasets: A. US Elections; B. Super Tuesday; C. FA Cup. (A) 14:35:57 (B) 15:11:54 (C) 15:47:51 (D) 16:23:48 (E) 16:59:45 (F) 17:35:42 (G) 18:11:39 (H) 18:47:36 (I) 19:59:30 Figure 4: WordCloud representation of FA Cup dataset 4.2. Preprocessing Tweets often contain huge amounts of noise such as mistyped and user invetned words. This noise is a big problem in cases where no subword embedding is used. One of reasons we used BERT is because it can handle such unknown tokens. For the preprocessing part, we dropped stop words if they do not appear in the named entities. Other than dropping stopwords, we also dropped non-english tweets that contain various languages such as Spanish or Hindi and etc. We considered such languages out of our scope but our method is capable of handling these methods and we put 11 Table 4: Top-K topics-recall of detection algorithms compared to our proposed method on FA CUP dataset Top-K topic-recall Method 2 4 6 8 10 12 14 16 18 20 LDA .692 .692 .840 .840 .920 .920 .840 .840 .840 .750 Doc-P .769 .850 .920 .920 1 1 1 1 1 1 Gfeat-P .000 .308 .308 .375 .375 .375 .375 .375 .375 .375 SFPM .615 .840 .840 1 1 1 1 1 1 1 BNGram .769 .920 .920 .920 .920 .920 .920 .920 .920 .920 SVD+Kmean .482 .596 .710 .824 .938 .951 .951 .951 .951 .951 SNMF-Orig .100 .177 .254 .331 .389 .389 .389 .389 .389 .389 SNMF-KL .167 .334 .502 .670 .837 .837 .840 .850 .850 .924 Exempler .810 .838 .886 .908 .916 .916 .916 .916 .916 .916 EHG .379 .591 .727 .727 .864 .864 1 1 1 1 Ours .810 .951 .951 1 1 1 1 1 1 1 Table 5: Top-K topics-recall of detection algorithms compared to our proposed method on SUPER TUESDAY dataset Top-K topic-recall Method 2 10 20 30 40 50 60 70 80 90 100 LDA .000 .000 .000 .180 .130 .130 .180 .280 .280 .370 .227 Doc-P .227 .227 .310 .400 .460 .500 .500 .500 .540 .680 .680 Gfeat-P .046 .045 .085 .180 .227 .280 .280 .280 .280 .280 .280 SFPM .182 .182 .270 .325 .325 .325 .325 .325 .325 .325 .325 BNGram .500 .500 .540 .540 .540 .540 .540 .540 .540 .540 .540 SVD+Kmean .192 .236 .400 .488 .547 .580 .626 .666 .666 .666 .666 SNMF-Orig .000 .045 .100 .183 .277 .277 .277 .320 .320 .363 .453 SNMF-KL .000 .100 .183 .183 .318 .410 .366 .410 .453 .363 .410 Exempler .246 .463 .538 .572 .586 .597 .600 .617 .638 .638 .638 EHG .163 .408 .466 .540 .628 .674 .674 .699 .699 .711 .711 Ours .463 .381 .580 .628 .699 .699 .699 .711 722 .722 .722 into our future works to use multilingual contextual pretrained models . Cleaning other characters such as @ or - did not damage results more than 0.05 in various metrics but dropping these caharacter from tokens would yield in losing important tokens such as 1-0 in datasets such as FA CUP. Because of event itself, designing such preprocessing and engineering it is a hard thing to do and in some cases (rather than datasets we have tested and in some other languages) it may be beneficial but in overall it is logical to keep it as minimum as possible. 4.3. Evaluation metrics and results We use two metrics for evaluating our work, namely keyword-precision and topic-recall. Keyword-precision is defined as the number of correctly detected keywords divided by the number of ground-truth keywords. Topic-recall is the number of detected ground-truth topics divided by all ground truth topics. In order to evaluate and compare our work to others, we used the results obtained by other methods and also added our own on various top-K tests. For results of other methods, we used the reported values from [23]. Tables 4, 5, and 6 present results of our expriments on top-k topic-recall over all three datasets while table 7 provides results for all datasets with top-2 keyword-precision applied to these datasets. For the visualization part of our memory graph, we obtained results from our Neo4j database that is shown for an early moment before and after detecting communities. Figure 5 shows this visualization, in yellow nodes indicate the named entities while the blue ones are regular words. Pretrained multilingual BERT available at: https://github.com/google-research/bert/blob/master/multilingual.md 12 Table 6: Top-K topics-recall of detection algorithms compared to our proposed method on US ELECTIONS dataset Top-K topic-recall Method 2 10 20 30 40 50 60 70 80 90 100 LDA .109 .109 .185 .245 .220 .280 .325 .500 .475 .430 .460 Doc-P .234 .234 .415 .505 .560 .615 .615 .690 .690 .720 .740 Gfeat-P .078 .078 .140 .180 .180 .180 .180 .180 .180 .180 .180 SFPM .359 .359 .465 .525 .540 .540 .540 .540 .540 .540 .540 BNGram .480 .480 .495 .495 .495 .495 .495 .495 .495 .495 .495 SVD+Kmean .110 .216 .420 .522 .588 .608 .647 .700 .720 .720 .740 SNMF-Orig .075 .075 .154 .218 .439 .467 .483 .545 .563 .595 .595 SNMF-KL .154 .154 .326 .400 .547 .581 .562 .618 .600 .652 .622 Exempler .022 .142 .244 .364 .465 .532 .590 .628 .651 .662 .662 EHG .279 .608 .670 .688 .733 .746 .762 .772 .780 .796 .805 Ours .326 .628 .628 .733 .772 .772 .780 .805 .805 .813 .821 Table 7: Top-2 keyword-precision of detection algorithms compared to our proposed method on all three datasets Dataset Method FA CUP SUPER TUESDAY US ELECTIONS LDA .164 .000 .165 Doc-P .337 .511 .401 Gfeat-P .000 .375 .375 SFPM .233 .471 .241 BNGram .299 .628 .405 SVD+Kmean .242 .367 .300 SNMF-Orig .330 .241 .241 SNMF-KL .242 .164 .164 Exemplar .300 .485 .391 EHG .442 .812 .591 Ours .456 .851 .621 13 (A)                                                        (B) (C)                                                        (D) Figure 5: Outputs from Neo4j: (A) and (B) using Neo4j default visualization tool, yellow nodes are named entities and blues ones are words; (C) and (D) using GraphXR, all nodes are named entities 4.4. Future works In Figure 6, we propose a module in our system as computer vision. This module is our future work guide for researcher to obtain other visual entities such as people and objects in visual content such as videos and images. Clustering this objects using their embedding would yield to having them bounded with the topics they appear most. Thus, such results would give the end user much more information rather than text and this visual content would help automatic journalism from social media. 5. Conclusions In current research, we used Transformers combined with graph techniques to provide a topic detection system for social media. Fine-tuning the transformers in realtime and obtaining valuable semantic information rather than using just frequencies is another novelty of our work. Our proposed model is combined of various modules including BERT, graph strategies and multimodal named entity recognizer. This combination as a unique methodology is called memory graph that uses cognitive memorization in human brain. Using our hyper-parameters such as the rate of 14 Pretrained Transformer based Masked Language Model   Data Similarity Adding Fine-Tuning Updating Data Memory Graph Graph Strategies Tagging Text Tuning Data Stream Text Adding Twitter Data Stream Image Images Mining/Tuning Multimodal NER MongoDB Outputs Image Results Computer Vision Figure 6: Proposed System forgetting makes this system available for any stream size and any machine it is running. Modularity of our work makes it more usable in real life and corporate based environments when dealing with big social data.The results shows superiority of our proposed method compared to other state of the art techniques. References [1] T. U. T. Dallas, A. Susarla, Y. Tan, Social Networks and the Di usion of User- Generated Content : Evidence from YouTube Social Networks and the Di usion of User-Generated Content : Evidence from YouTube, Information Systems Research 23 (2012) 23–41. URL: https://doi.org/10.1287/isre.1100.0339. arXiv:isre.1100.0339. [2] F. Atefeh, W. Khreich, A survey of techniques for event detection in Twitter, Computational Intelligence 31 (2015) 133–164. URL: https: //doi.org/10.1111/coin.12017. arXiv:arXiv:1011.1669v3. [3] A. M. Kaplan, M. Haenlein, Users of the world, unite! The challenges and opportunities of Social Media, Business Horizons 53 (2010) 59–68. URL: https://doi.org/10.1016/j.bushor.2009.09.003. arXiv:1501.00994. [4] Twitter, Twitter is: whats happening in the world and what people are talking about right now., 2018. URL: http://twitter.com/about. [5] N. Carlson, The Real History Of Twitter, 2011. URL: http://www.businessinsider.com/how-twitter-was-founded-2011-4?op= [6] H. Kwak, C. Lee, H. Park, S. Moon, What is Twitter, a Social Network or a News Media?, Network (2010) 19–22. URL: https://doi. org/10.1145/1772690.1772751. arXiv:0809.1869v1. [7] I. L. Stats, Twitter usage statistics, 2018. URL: http://www.internetlivestats.com/twitter-statistics/. [8] G. Bello-Orgaz, J. J. Jung, D. Camacho, Social big data: Recent achievements and new challenges, Information Fusion 28 (2016) 45–59. URL: https://doi.org/10.1016/j.inffus.2015.08.005. arXiv:1505.06807. [9] T. API, Getting Started — Twitter Developers, 2018. URL: https://dev.twitter.com/start. [10] F. Atefeh, W. Khreich, A survey of techniques for event detection in twitter, Computational Intelligence 31 (2015) 132–164. [11] C. C. Aggarwal, J. Han, Frequent pattern mining, Springer, 2014. [12] R. Ibrahim, A. Elbagoury, M. S. Kamel, F. Karray, Tools and approaches for topic detection from twitter streams: survey, Knowledge and Information Systems (2017) 1–29. [13] C. Zhang, S. Zhang, Association rule mining: models and algorithms, Springer-Verlag, 2002. [14] G. Petkos, S. Papadopoulos, L. Aiello, R. Skraba, Y. Kompatsiaris, A soft frequent pattern mining approach for textual topic detection, in: Proceedings of the 4th International Conference on Web Intelligence, Mining and Semantics (WIMS14) - WIMS ’14, 2014, pp. 1–10. URL: http://dl.acm.org/citation.cfm?doid=2611040.2611068. [15] M. Huang, Jiajia and Peng, Wang, Topic Detection from Large Scale of Microblog Stream with High Utility Pattern Clustering, Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (2015) 3–10. [16] S. Gaglio, G. Lo Re, M. Morana, Real-time detection of twitter social events from the user’s perspective, in: IEEE International Conference on Communications, volume 2015-Septe, 2015, pp. 1207–1212. [17] H. J. Choi, C. H. Park, Emerging topic detection in twitter stream based on high utility pattern mining, Expert Systems with Applications 115 (2019) 27–36. [18] H. Sayyadi, M. Hurst, A. Maykov, Event detection and tracking in social streams, in: ICWSM - International Conference on Weblogs and Social Media, 2009, pp. 311–314. arXiv:1412.2188. 15 Type Task Ref. Detection Method Dataset Application Event Topic RED NED [30] Naive Bayse Classi- X X Twitter API, Handpicked Users Hot News Detection fier [31] BScore based BOW X X Twitter API (oine) Disaster and Story Clustering Detection [32] BOW Distance Simi- X X Twitter API FSD larity [33] BN-Gram and tf-idf X X Oine Datasets Topic Detection [34] Cross Checking via X X Twitter API, Wikipedia Hot News Detection Wikipedia [35] Formal Concept X X Replab 2013 dataset Topic Detection Analysis [36] FPM X X Twitter API Event Detection [14] FPM X X Super Tuesday/FA Cup/U.S. Elections Topic Detection [37] FPM (Hierarchical X X topic dataset from CLEar system Topic Detection Clustering) [16] FPM (tf-idf & n- X X Twitter API Event Detection gram improved) [38] GPU improved tf-idf X X oine dataset Topic Detection approximation [39] BOW Similarity X X oine dataset Topic Detection [40] Word Embedding SemEval Dataset Twitter Sentiment Classification [41] spatiotemporal X X oine Dataset Targeted-domain detection event detection [42] Clustering of Tempo- X X Twitter API Social Event Detec- ral & Spatial features tion [43] Geographical Regu- X X Twitter API Geo-Social Event larity Estimation Detection [44] BOW clustering X X Twitter API Event Detection & Analysis [45] Probabilistic Model- X X Twitter API Early Disaster Detec- ing tion [46] FPM X X oine dataset Event Detection [47] wavelet analysis X X Twitter API Event Detection Table 8: Twitter Topic/Event detection/tracking related studies 16 [19] D. Quercia, H. Askham, J. Crowcroft, TweetLDA: Supervised Topic Classification and Link Prediction in Twitter, in: WebSci, 2012, pp. 1–4. [20] M. Liu, J. Qu, Mining high utility itemsets without candidate generation, in: Proceedings of the 21st ACM international conference on Information and knowledge management, ACM, 2012, pp. 55–64. [21] Z. Saeed, R. A. Abbasi, A. Sadaf, M. I. Razzak, G. Xu, Text stream to temporal network-a dynamic heartbeat graph to detect emerging events on twitter, in: Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, 2018, pp. 534–545. [22] Z. Saeed, R. A. Abbasi, M. I. Razzak, G. Xu, Event detection in twitter stream using weighted dynamic heartbeat graph approach, arXiv preprint arXiv:1902.08522 (2019). [23] Z. Saeed, R. A. Abbasi, I. Razzak, O. Maqbool, A. Sadaf, G. Xu, Enhanced heartbeat graph for emerging event detection on twitter using time series networks, Expert Systems with Applications 136 (2019) 115–132. [24] Z. Saeed, R. A. Abbasi, I. Razzak, Evesense: What can you sense from twitter?, in: European Conference on Information Retrieval, Springer, 2020, pp. 491–495. [25] P. Wozniak, ´ E. Gorzelanczyk, ´ J. Murakowski, Two components of long-term memory., Acta neurobiologiae experimentalis 55 (1995) 301–305. [26] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks, arXiv preprint arXiv:1908.10084 (2019). [27] M. Asgari-Chenaghlu, M. R. Feizi-Derakhshi, L. Farzinvash, C. Motamed, A multimodal deep learning approach for named entity recognition from social media, arXiv preprint arXiv:2001.06888 (2020). [28] Z. Zhao, C. Li, X. Zhang, F. Chiclana, E. H. Viedma, An incremental method to detect communities in dynamic evolving social networks, Knowledge-Based Systems 163 (2019) 404–415. [29] L. M. Aiello, G. Petkos, C. Martin, D. Corney, S. Papadopoulos, R. Skraba, A. Goker, I. Kompatsiaris, A. Jaimes, Sensing trending topics in twitter, IEEE Transactions on Multimedia 15 (2013) 1268–1282. [30] J. Sankaranarayanan, H. Samet, B. E. Teitler, M. D. Lieberman, J. Sperling, Twitterstand: news in tweets, in: Proceedings of the 17th acm sigspatial international conference on advances in geographic information systems, ACM, 2009, pp. 42–51. [31] S. Phuvipadawat, T. Murata, Breaking news detection and tracking in twitter, in: Web Intelligence and Intelligent Agent Technology (WI-IAT), 2010 IEEE/WIC/ACM International Conference on, volume 3, IEEE, 2010, pp. 120–123. [32] S. Petrovic, ´ M. Osborne, V. Lavrenko, Streaming first story detection with application to twitter, in: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, 2010, pp. 181–189. [33] S. D. Tembhurnikar, N. N. Patil, Topic detection using bngram method and sentiment analysis on twitter dataset (2015) 1–6. [34] M. Osborne, S. Petrovic, R. McCreadie, C. Macdonald, I. Ounis, Bieber no more: First story detection using twitter and wikipedia, in: SIGIR 2012 Workshop on Time-aware Information Access, 2012. [35] J. Cigarran, ´ A. Castellanos, A. Garc´ ıa-Serrano, A step forward for topic detection in twitter: An fca-based approach, Expert Systems with Applications 57 (2016) 21–36. [36] M. Adedoyin-Olowe, M. M. Gaber, C. M. Dancausa, F. Stahl, J. B. Gomes, A rule dynamics approach to event detection in twitter with its application to sports and politics, Expert Systems with Applications 55 (2016) 351–360. [37] G. Dong, W. Yang, F. Zhu, W. Wang, Discovering burst patterns of burst topic in twitter, Computers & Electrical Engineering 58 (2017) 551–559. [38] U. Erra, S. Senatore, F. Minnella, G. Caggianese, Approximate tf–idf based on topic extraction from massive message stream using the gpu, Information Sciences 292 (2015) 143–161. [39] H. Becker, M. Naaman, L. Gravano, Beyond trending topics: Real-world event identification on twitter., ICWSM 11 (2011) 438–441. [40] D. Tang, F. Wei, N. Yang, M. Zhou, T. Liu, B. Qin, Learning sentiment-specific word embedding for twitter sentiment classification., in: ACL (1), 2014, pp. 1555–1565. [41] T. Hua, F. Chen, L. Zhao, C.-T. Lu, N. Ramakrishnan, Automatic targeted-domain spatiotemporal event detection in twitter, GeoInformatica 20 (2016) 765–795. [42] H. Abdelhaq, C. Sengstock, M. Gertz, Eventweet: Online localized event detection from twitter, Proceedings of the VLDB Endowment 6 (2013) 1326–1329. [43] R. Lee, K. Sumiya, Measuring geographical regularities of crowd behaviors for twitter-based geo-social event detection, in: Proceedings of the 2nd ACM SIGSPATIAL international workshop on location based social networks, ACM, 2010, pp. 1–10. [44] R. Li, K. H. Lei, R. Khadiwala, K. C.-C. Chang, Tedas: A twitter-based event detection and analysis system, in: Data engineering (icde), 2012 ieee 28th international conference on, IEEE, 2012, pp. 1273–1276. [45] T. Sakaki, M. Okazaki, Y. Matsuo, Earthquake shakes twitter users: real-time event detection by social sensors, in: Proceedings of the 19th international conference on World wide web, ACM, 2010, pp. 851–860. [46] T. Snowsill, F. Nicart, M. Stefani, T. De Bie, N. Cristianini, Finding surprising patterns in textual data streams, in: Cognitive Information Processing (CIP), 2010 2nd International Workshop on, IEEE, 2010, pp. 405–410. [47] M. Cordeiro, Twitter event detection: combining wavelet analysis and topic inference summarization, in: Doctoral symposium on informatics engineering, 2012.

Journal

Computing Research RepositoryarXiv (Cornell University)

Published: Aug 16, 2020

There are no references for this article.