In the field of natural language processing (NLP), authorship attribution is a well-known task which consists to answer the following question : who is the true author of a textual document ?, based on linguistic markers. Given a set of candidate author and a corpus of sample documents, the goal is to find who wrote a new unseen document. The classical approaches to authorship attribution are based on statistical methods and researchers have tried to apply neural network methods with good results but higher complexity and longer training time. Deep-Learning is known to be very efficient on image classification. However, applied to authorship attribution tasks, classical convolutional neural networks (CNN) have faced more trouble and other neural models such as vanilla recurrent neural networks (RNN), LSTMs and GRU reached state-of-the-art. Since the rise of neural networks, few work have studied the behaviours and performances of recurrent neural networks on authorship attribution tasks. This thesis proposes to analyse the performances of different kind of recurrent neural network applied to several authorship attribution tasks and based on different lexical (Word2Vec, characters) and syntactic (Part-of-Speech).
The goal of this project is to apply and compare several flavours of recurrent neural networks (RNN),
- Classical Echo State Networks ;
- Deep Echo State Network architectures such as DeepESN and Stacked-ESNs [2, 3];
- Long Short-Term Memory ;
- Gated Recurrent Units ;
The comparison is done on three tasks,
- Author profiling (Gender classification) on the PAN@CLEF17 dataset;
- Authorship Attribution on single-authored documents of the Reuters C50 dataset;
- Authorship Attribution on two-authored documents of the Reuters C50 dataset;
The models will be compared to the following baselines :
- Bayesian models ;
- Support Vector Machines ;
- Character-based Convolutional Neural Networks ;
- Evaluate the performance of different recurrent architectures on a classical NLP task ;
- Determine which textual representation is the best for neural models ;
- Evaluate the possibility to use nonlinear transient computing for natural language processing ;
The models are evaluated on a large field of features including the following.
|Word embedding (Word2Vec)||Lexical||300|
|Part of Speech (POS)||Syntactic||150|
Reuters C50 dataset
The dataset is the subset of RCV1. These corpus has already been used in author identification experiments. In the top 50 authors (with respect to total size of articles) were selected. 50 authors of texts labeled with at least one subtopic of the class CCAT(corporate/industrial) were selected.That way, it is attempted to minimize the topic factor in distinguishing among the texts. The training corpus consists of 2,500 texts (50 per author) and the test corpus includes other 2,500 texts (50 per author) non-overlapping with the training texts.
PAN17 Author Profiling
The PAN17 organizers provided participantes with a Twitter corpus annotated with authors’ gender and their specific variation of their native language:
- English (Australia, Canada, Great Britain, Ireland, New Zealand, United States)
- Spanish (Argentina, Chile, Colombia, Mexico, Peru, Spain, Venezuela)
- Portuguese (Brazil, Portugal)
- Arabic (Egypt, Gulf, Levantine, Maghrebi)
Download : https://www.uni-weimar.de/medien/webis/corpora/corpus-pan-labs-09-today/pan-17/pan17-data/pan17-author-profiling-training-dataset-2017-03-10-password-protected.zip, https://s3.amazonaws.com/autoritas.pan/pan17-author-profiling-test-2017-03-16.zip
Models are evaluated with precision, recall and F1 score.
- Comparison of Neural Models for Gender Profiling, June 2018, JADT2018 ;
- Behaviours of Deep Echo State Network-based ReservoirComputing Models on Classification of Textual Documents (To be published)
- Authorship Attribution using Echo State Network-based Recurrent Neural Models (To be published)
- Herbert Jaeger and Harald Haas. Harnessing Nonlinearity: Predicting Chaotic Systems and Saving Energy in Wireless Communication. Science 2 April 2004: Vol. 304. no. 5667, pp. 78 – 80doi:10.1126/science.1091277 PDF
- Gallicchio, Claudio; Micheli, Alessio (2013). “Tree Echo State Networks”. Neurocomputing. 101: 319–337.doi:10.1016/j.neucom.2012.08.017.
- Gallicchio, Claudio; Micheli, Alessio; Pedrelli, Luca. “Deep reservoir computing: A critical experimental analysis”.Neurocomputing. doi:10.1016/j.neucom.2016.12.089.