Tools and Methods for Quantitative Text Analysis

Dr. Paul Nulty

Research Officer (QUANTESS project)

Department of Methodology, LSE

14th November 2014

Overview


  • Tools for accessing and managing text
  • Transforming text into data
  • Feature extraction and methods for scaling and classification

Disciplines and terminology


  • Computational linguistics, mathematical linguistics
  • Corpus linguistics
  • Natural Language Processing
  • Qualitative Content Analysis
  • Quantitative Text Analysis

Quantitative vs Qualitative



  • Quantitative when:
    • large volume of text to read or annotate
    • replicable analysis
    • surface words reflect variable of interest

Goals of quantitative analysis


  • Fast and cheap annotation of text data
  • Learn associations between text features and latent variables
  • Event detection from media or social media
  • Modelling communication in networks
  • Topic tracking in large text corpus
  • many applications in industry

General computational tools

  • Text is unstructured data, in unstructured files
  • Statically typed programming languages
    • C, C++, Java
    • Fast, efficient, highly structured
    • Not easy to learn, slow to code in
  • Dynamically typed programming languages
    • Perl, Python, Ruby, R
    • Scripting' languages
    • slower to execute, easier to write

Text-specific packages


  • Stanford CoreNLP, Mallet (Java)
  • NLTK, gensim, TextBlob (Python)
  • tm, quanteda ( R )
  • Alceste, WordStat (QDA Miner)
  • Nvivo, atlas.ti

Managing text data


  • far less data-intensive than image, audio, or video
  • ASCII, UTF-8: 1 byte per character (\( 2^7 \) = 128 chars)
  • E.g., entire proceedings of European Parliament, 1996-2005, in 21 languages\( ^1 \) : 5.4GB
  • Many file formats, many file encodings


[1] Koehn, Philipp. “Europarl: A parallel corpus for statistical machine translation.” MT summit. Vol. 5. 2005

Sources of text data


  • Party manifestos, speeches
  • Parliamentary records (Hansard, europarl)
  • Judicial opinions, amicus curiae
  • newspaper articles (lexis nexis, factiva)
  • websites and social media

Retrieving text from the web


  • web crawlers/spider download sites by traversing links
  • python ('beautiful soup', scraPy) and R ('rVest') libraries
  • cUrl, wget, or
  • other tools available ('httrack')
  • social media.. through an API

Retrieving text from social media


  • API: Application Programming Interface
  • Twitter, facebook, google — all expose public web services
  • E.g: twitter REST API
 library(twitteR)
  setup_twitter_oauth('ZOHJIRAwnw23FhvFWyUg',
          'HTfEcEmRRDcx0ZsJ5CHOcmPc84AfDOp5VvIXwt0oY',
          '778251283-ZkDTfl3IbIFZFXlVokA6Gpc19TZPyov3wucZ0XaB',
          '8vgPnpEWP3qhvILmTLXVb9RslwcEwVVeKOo4KCYHOY')
[1] "Using direct authentication"

Retrieving text from social media


  • API: Application Programming Interface
  • Twitter, facebook, google — all expose public web services
  • E.g: twitter REST API, streaming API
searchTwitter('text analysis', n=50)
[[1]]
[1] "TTTTTJJJJJMMMMM: Text Dependent Analysis. That's All."

[[2]]
[1] "Libardopez: RT @KirkDBorne: Text Analysis - A Basic Overview: Bag of Words, Entities,.. http://t.co/VnkgGokecM #abdsc #Analytics #NLProc via @DataScien…"

[[3]]
[1] "dellahethcox: Thrilled and a little sad that it's Friday. This weekend looks like this: logic essay, logic text analysis, logic exercises."

[[4]]
[1] "CSC_Analytics: RT @KirkDBorne: Text Analysis - A Basic Overview: Bag of Words, Entities,.. http://t.co/VnkgGokecM #abdsc #Analytics #NLProc via @DataScien…"

[[5]]
[1] "KirkDBorne: Text Analysis - A Basic Overview: Bag of Words, Entities,.. http://t.co/VnkgGokecM #abdsc #Analytics #NLProc via @DataScienceCtrl"

[[6]]
[1] "prabhnair: \"@Dinosn: rootkit analysis methodology based on memory https://t.co/vuVwAlIAhw (Chinese original)\" #cybersecurity #forensic"

[[7]]
[1] "CompuLing: RT @linguistlist: 26.442, Calls: Romance, Computational Ling, Discourse Analysis, Lexicography, Text/Corpus Ling, Translation/France http:/…"

[[8]]
[1] "MarkEDeschaine: Microsoft acquires text analysis startup Equivio, plans to integrate machine ... - VentureBeat | @scoopit http://t.co/jHzIDK4yBJ"

[[9]]
[1] "siewlee1815: Text Tools: Create targeted content for any niche, semantic analysis suggests the best terms to use. http://t.co/lKyJ3xsJz1"

[[10]]
[1] "etuma360: Text analysis lessons from 20 million feedback messages http://t.co/nt6f3162FE"

[[11]]
[1] "cobaltcf: Microsoft Acquires Text Analysis Service Equivio http://t.co/zdgsyVv4sl"

[[12]]
[1] "OAPoliTorino: NEW FULL-TEXT: \"Integrated Design and Analysis of Intakes ...\" by Ferlauto M. et al. http://t.co/OUkQx40ugw"

[[13]]
[1] "lamehacker: rootkit analysis methodology based on memory [WooYun Drops]\n\nhttps://t.co/LRLO7L76kK"

[[14]]
[1] "Adypang: ESDM: Freeport Wajib Tambah Smelter Baru Dengan Kapasitas 2 Ton http://t.co/wYDeIoQIR7"

[[15]]
[1] "_purpledino: @retrogxy this is so weird????? Oh my god I literally want an essay analysis. U should do a tumblr text post it'll blow up"

[[16]]
[1] "yuliandriansyah: #SaveKPK Tuh kan. Kalau yang lain pada rame, pasti ada yang beginian. http://t.co/USGJhOTzaQ"

[[17]]
[1] "DublinTechTalk: RT @nlpbot: Getting started guide: AYLIEN Text Analysis SDK for Go  golang textmining NLProc http://t.co/nITYFiYYT0"

[[18]]
[1] "KevinBoylee: RT @nlpbot: Getting started guide: AYLIEN Text Analysis SDK for Go  golang textmining NLProc http://t.co/nITYFiYYT0"

[[19]]
[1] "peektastic_hq: Microsoft Acquires Text Analysis Service&Atilde&AElig&Atilde&acirc&Atilde&acirc&Atilde&Ac #peektastic http://t.co/Dgdu0UJVug"

[[20]]
[1] "PaulBriba2: Get your Proposal,Research Projects, IT, MBA,Thesis and Data Analysis . fabrianaresearchers@gmail.com\nCall / Text / Watsap 0717561799"

[[21]]
[1] "locazafynej: Elliott wave analysis on EUR/JPY and EUR/NZD div class=”separator” style=”clear both; text-align center;”a hr,,,"

[[22]]
[1] "ak1010: RT @Dinosn: rootkit analysis methodology based on memory https://t.co/xHkhA2201x (Chinese original)"

[[23]]
[1] "ANNEBELLE201525: RT @GirlfriendMAG: Wondering what that text from him means? We can help: http://t.co/UGo1LxQsRb x #CrushDecoder http://t.co/GsInL8yuWz"

[[24]]
[1] "Adypang: Waskita Sebut PMN Rp6,6T Dari Pemerintah Bisa Berikan Nilai Tambah Rp98,4T http://t.co/hgI24JB4oj"

[[25]]
[1] "Adypang: Suparni Jadi Dirut Baru Semen Indonesia http://t.co/HBAUZ2CHNE"

[[26]]
[1] "SelsRoger: RT @Dinosn: rootkit analysis methodology based on memory https://t.co/xHkhA2201x (Chinese original)"

[[27]]
[1] "bareksacom: Tahun 2015 Waskita Targetkan Rp20,8 T Kontrak Baru  http://t.co/8298v4FzJI http://t.co/TJI0MdrBfB"

[[28]]
[1] "RootMyCom: RT @Dinosn: rootkit analysis methodology based on memory https://t.co/xHkhA2201x (Chinese original)"

[[29]]
[1] "Wraz_deals_78f: Hot Deals : http://t.co/kjFzopZ0FI #7259\n\nScientific Analysis Tool Optimizes Your Content!\nCreate Targeted Cont... http://t.co/olI0N1Xn1s"

[[30]]
[1] "bareksacom: Pemerintah Perpanjang Kontrak Karya Freeport Indonesia  http://t.co/tbBiPl3O3e http://t.co/sDxSCe8sq4"

[[31]]
[1] "Adypang: Pemerintah Perpanjang Kontrak Karya Freeport Indonesia http://t.co/DFHeWuWaiY"

[[32]]
[1] "fygrave: RT @Dinosn: rootkit analysis methodology based on memory https://t.co/xHkhA2201x (Chinese original)"

[[33]]
[1] "BeyondGrappling: RT @JudoInside: Check the coolest video and text analysis by @oonyeoh of #JudoCrazy about @NaohiWtwc Takato http://t.co/t0VYFBM6f0 http://t…"

[[34]]
[1] "BSPO1348: What Have #Economists Been Doing... A Text Analysis of Published Academic Research from 1960–2010 http://t.co/XkLTZ6g67Q | @ej_economics"

[[35]]
[1] "hiamyunus: RT @bareksacom: Freeport Pilih Gresik Sebagai Lokasi Pembangunan Smelter  http://t.co/2GKCzhcIdo http://t.co/WyPee8hYWL"

[[36]]
[1] "VishnuGorantla: RT @Dinosn: rootkit analysis methodology based on memory https://t.co/xHkhA2201x (Chinese original)"

[[37]]
[1] "Homehealthcare4: How Paying for Elderly Home Care Can Be Made Easier You can only submit entirely new text for analysis once e  http://t.co/l6YmYd6vPd"

[[38]]
[1] "Sirinne: RT @linguistlist: 26.442, Calls: Romance, Computational Ling, Discourse Analysis, Lexicography, Text/Corpus Ling, Translation/France http:/…"

[[39]]
[1] "AndreNgoei: Hati2 thdp perusahaan yg tdk memberikan keterbukaan informasi kpd publik $INVS, http://t.co/7JJc6ziV2d http://t.co/nuPrNZZc7v\""

[[40]]
[1] "Dinosn: rootkit analysis methodology based on memory https://t.co/xHkhA2201x (Chinese original)"

[[41]]
[1] "linguistlist: 26.442, Calls: Romance, Computational Ling, Discourse Analysis, Lexicography, Text/Corpus Ling, Translation/France http://t.co/HtpLmez6ls"

[[42]]
[1] "ninnienicole: Do not want to write this analysis.. Someone text meeee"

[[43]]
[1] "DheerajBhaskar: Microsoft Acquires Text Analysis Service Equivio To Improve eDiscovery In Office 365 http://t.co/pju0UpTmKH"

[[44]]
[1] "Adypang: Freeport Pilih Gresik Sebagai Lokasi Pembangunan Smelter http://t.co/5nZnGSyHVx"

[[45]]
[1] "muskokathompson: RT @bbound1978: “Analysis of texts helps Ts go beyond the \"level\" to knowledge of the specific demands of the text on Rdrs.\" #FPGuidedReadi…"

[[46]]
[1] "Adypang: JP Morgan: Penurunan Harga Semen Adalah Inisiatif Semen Indonesia http://t.co/w6TPt0oe9b"

[[47]]
[1] "bareksacom: JP Morgan: Penurunan Harga Semen Adalah Inisiatif Semen Indonesia  http://t.co/fQhvT0sLKR http://t.co/vUy3APARju"

[[48]]
[1] "Adypang: OCBC NISP Terbitkan Obligasi Berkelanjutan Rp3 T http://t.co/cPYcDXLUPF"

[[49]]
[1] "Adypang: Induk MNCN Jual 572,17 Miliar Saham MNCN Untuk Investasi http://t.co/GThhlL1Ayj"

[[50]]
[1] "Adypang: Induk MNCN Jual 572,17 Miliar Saham MNCN Untuk Invesytasi http://t.co/5ZPYTIA4bF"

From natural language to data



  • topic models (Latent Dirichlet Allocation)\( ^2 \)
  • language models (Markov models)
  • bag-of-words / document-term matrix

    [2]Blei, David M. “Probabilistic topic models.” Communications of the ACM 55.4 (2012): 77-84.

Statistical text models are not models of human language


Nobody suggests studying bee communication by taking a massive corpus, you know, a huge library of video tapes of bees swarming around and doing statistical analysis of it, and getting some prediction about what they're likely to do next. - Noam Chomsky alt text

  • 'Noam Chomsky on where artificial intelligence went wrong', The Atlantic, November 2011
  • 'On Chomsky and the Two Cultures of Statistical Learning 'http://norvig.com/chomsky.html, 2012
  • Breiman, Leo. “Statistical modeling: The two cultures” Statistical Science 16.3 (2001): 199-231.

Standard approach - 'Bag of words'

  • Tokenize text (split on whitespace)
  • Some linguistics…
    • Phoneme: smallest contrastive linguistic unit
    • Morpheme: smallest meaningful unit
  • Count occurrences (tokens) of word types per document
  • Result: Document-term matrix
  • rows are documents, columns are word types, cells are counts
  • example..

Bag of words example

library(quanteda)
doc1 <- c('john hit the ball with the bat')
doc2 <- c('the ball hit the window')
doc3 <- c('john dropped the bat')
myCorp <- corpus(c(doc1,doc2,doc3))
myDfm <- dfm(myCorp)
Creating a dfm from a corpus ...
   ... indexing 3 documents
   ... tokenizing texts, found 16 total tokens
   ... cleaning the tokens, 0 removed entirely
   ... summing tokens by document
   ... indexing 8 feature types
   ... building sparse matrix
   ... converting to a dense matrix
   ... created a 3 x 8 dense dfm
   ... complete. Elapsed time: 0.025 seconds.
head(myDfm)
Document-feature matrix of: 3 documents, 8 features.
       features
docs    ball bat dropped hit john the window with
  text1    1   1       0   1    1   2      0    1
  text2    1   0       0   1    0   2      1    0
  text3    0   1       1   0    1   1      0    0

Pre-processing options

  • tokenization settings (hyphenation, contractions)
  • lowercasing, removing punctuation and numbers
  • stemming/lemmatization
  • dictionary application
  • document-term matrix, or document feature matrix?

Properties of text

  • frequency distribution (zipf's laws)
  • type-token ratio
  • readability metrics (e.g. Flesch-Kincaid)

word-frequency distribution

word-frequency distribution