MCLC: Leiden Weibo Corpus

Denton, Kirk denton.2 at osu.edu
Tue Apr 24 12:24:11 EDT 2012


MCLC LIST
From: daan van esch (daanvanesch at gmail.com)
Subject: Leiden Weibo Corpus
******************************************************

Dear all,


It is my pleasure to announce to you today the Leiden Weibo Corpus (LWC),
an annotated linguistic 100-million word corpus containing 5.1 million
messages from Sina Weibo, China¹s premier Twitter-like microblogging
service. 

The LWC is freely available online at http://lwc.daanvanesch.nl/. Data for
the LWC was collected in January 2012. As such, it contains many
linguistic phenomena that may not be found in older corpora, such as
suffixation with ³-ing², an aspectual marker borrowed from English.

Furthermore, Sina Weibo messages come with valuable meta data, such as the
gender of the user and his location. This information allows the LWC to
calculate how often words are used in different provinces and cities
across China, which is useful for research into lexical variation across
China. 

Naturally, the LWC also supports searching for single words or grammar
patterns, such as ³any verb followed by an aspectual particle and then a
noun². This feature may also be of interest to students and teachers of
Mandarin who are looking for example sentences.

Please feel free to forward this announcement to anyone who might be
interested. Any feedback regarding the LWC would be greatly appreciated;
please send it to daanvanesch at gmail.com.

Best wishes,

Daan van Esch
graduate student in Chinese linguistics
Leiden University, the Netherlands




More information about the MCLC mailing list