R語言文字探勘好好玩 - Ch1_Tidy Text 格式 (1)

發表於 2019-04-17 更新於 2022-09-04 分類於 R

Text Mining with R 書本導讀介紹

使用 R 語言也可以實作文字探勘，接下來的系列文章就用 Text Mining with R 電子書來跟大家介紹與導讀。

學習內容大綱

tidy 文本格式
1.1 tidy 文本與其他資料結構的對比
1.2 unnest_tokens 函數
用 tidy 處理資料

tidy 文本格式

tidy data 是什麼呢？

tidy data 是有特定結構的意義:

一個欄位裡都只會有一個數值
不同觀察值 (observation) 要在不同行
每一張表格裡都是所有要分析的觀察值資料

一個符號 (token) 是文本當中有意義的單元，也就是我們經常使用的 詞，tidy 文本探勘的時候，每一行的符號通常是單個詞，但也可以是 n-gram、句子或是段落。

tidy 文本與其他資料結構的對比

字串 String：文本當然可以用字串向量。
語料 Corpus：主要包含了原本字串，並且帶有額外的 metadata 以及一些資訊。
文檔與術語矩陣 Document-term matrix：這是一個稀疏矩陣，說明了文檔的集合、每個文檔中的每一列。

`unnest_tokens` 函數

> text <- c("人生就是不斷的在後悔", 
+           "成功的道路是很遙遠的", 
+           "懂我的人一定能明白我的魅力所在",
+           "念書這種事情　就讓那些喜歡念書的人去做好了")
> text
[1] "人生就是不斷的在後悔"
[2] "成功的道路是很遙遠的"
[3] "懂我的人一定能明白我的魅力所在"
[4] "念書這種事情　就讓那些喜歡念書的人去做好了"

這些字串向量是我們要分析的字串集，為了要轉換整齊的文本資料集，我們利用 dplyr 套件的 tibble 將它放在數據框內

library(dplyr)
text_df <- tibble(line = 1:4, text = text)

text_df

結果如下：

# A tibble: 4 x 2
   line text                                      
  <int> <chr>                                     
1     1 人生就是不斷的在後悔                      
2     2 成功的道路是很遙遠的                      
3     3 懂我的人一定能明白我的魅力所在            
4     4 念書這種事情　就讓那些喜歡念書的人去做好了

在這裏有用到一個 tibble 方法，這個是在 dplyr 套件，可以將文字用一個整齊的工具整理。接下來進一步我們將這些的文檔拆解為單個標記，並且再將轉換有數據的結構。因此我們使用 tidytext 的 unnest_tokens() 函數。

library(tidytext)

text_df %>%
  unnest_tokens(word, text)

結果如下：

# A tibble: 32 x 2
    line word 
   <int> <chr>
 1     1 人生 
 2     1 就是 
 3     1 不斷 
 4     1 的   
 5     1 在   
 6     1 後悔 
 7     2 成功 
 8     2 的   
 9     2 道路 
10     2 是   
# … with 22 more rows

unnest_tokens() 函數需要填入兩個參數，word 是只要輸出的列 (row) 名稱，以及 text 就是來源的輸入列，另外就是這組函數：

保留其他列
標點符號會被刪除
預設情況下，會轉換為 小寫，讓它們比較容易跟其他的數據集做比較或組合。(不過如果使用 to_lower=FALSE 的話，就會關閉小寫的行為)

如此一來有這個格式的資料，我們就可以使用 tidy 相關的套件做操作、處理、視覺化。(dplyr, tidyr, ggplot2)，以下是 tidy 處理資料時進行文本分析的流程圖。

用 tidy 處理資料

以下是使用 janeaustenr 套件引入 Jane Austen 的六部完整的小說文本，然後再轉換成 tidy 格式。

library(janeaustenr)
library(dplyr)
library(stringr)

original_books <- austen_books() %>%
  group_by(book) %>%
  # 新增 linenumber, chapter 欄位, linenumber 是增加紀錄原始格式的行數
  # chapter 是使用正規表達式 Regex 就能找到章節的位置
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
                                                 ignore_case = TRUE)))) %>%
  ungroup()
original_books

這樣就多兩個欄位 linenumber 與 chapter，結果如下：

# A tibble: 73,422 x 4
   text                  book                linenumber chapter
   <chr>                 <fct>                    <int>   <int>
 1 SENSE AND SENSIBILITY Sense & Sensibility          1       0
 2 ""                    Sense & Sensibility          2       0
 3 by Jane Austen        Sense & Sensibility          3       0
 4 ""                    Sense & Sensibility          4       0
 5 (1811)                Sense & Sensibility          5       0
 6 ""                    Sense & Sensibility          6       0
 7 ""                    Sense & Sensibility          7       0
 8 ""                    Sense & Sensibility          8       0
 9 ""                    Sense & Sensibility          9       0
10 CHAPTER 1             Sense & Sensibility         10       1
# … with 73,412 more rows

要把這個當作 tidy 資料集，需要重構每行一個符號的格式，可以使用 unnest_tokens() 函數，也如同前面所提到的，一些標記或標點符號都會拿掉。

1
2
3

tidy_books <- original_books %>%
  unnest_tokens(word, text)
tidy_books

結果如下：

# A tibble: 725,055 x 4
   book                linenumber chapter word       
   <fct>                    <int>   <int> <chr>      
 1 Sense & Sensibility          1       0 sense      
 2 Sense & Sensibility          1       0 and        
 3 Sense & Sensibility          1       0 sensibility
 4 Sense & Sensibility          3       0 by         
 5 Sense & Sensibility          3       0 jane       
 6 Sense & Sensibility          3       0 austen     
 7 Sense & Sensibility          5       0 1811       
 8 Sense & Sensibility         10       1 chapter    
 9 Sense & Sensibility         10       1 1          
10 Sense & Sensibility         13       1 the        
# … with 725,045 more rows

在這個結果可以看到每一行的值，有一些沒有意義的詞，比如 the, of, to, by 這種字詞，在 tidytext 數據集裡有 stop_words 存有停用詞，就可以使用 anti_join() 將停用詞移除

data(stop_words)

tidy_books <- tidy_books %>%
  anti_join(stop_words)

可以使用 dplyr 的 count() 找出所有書最常見的詞

1 2	tidy_books %>% count(word, sort = TRUE)

結果如下：

# A tibble: 13,914 x 2
   word       n
   <chr>  <int>
 1 miss    1855
 2 time    1337
 3 fanny    862
 4 dear     822
 5 lady     817
 6 sir      806
 7 day      797
 8 emma     787
 9 sister   727
10 house    699
# … with 13,904 more rows

另外我們還可以轉成視覺化的圖

install.packages('ggplot2')  
library(ggplot2)

# 從上述的常見詞頻次數來看, 用數字是很難看出明顯的差距, 可以用圖片的方式呈現更能夠表達
tidy_books %>%
  # 由於要製作柱狀圖, 我們把上述詞頻 word 做個數量
  count(word, sort = TRUE) %>%
  # 篩選 n 欄位大於 600 次數的, 600 以下因為太少則忽略
  filter(n > 600) %>%
  mutate(word = reorder(word, n)) %>%
  # 在 ggplot 定義 x, y 軸，分別是 word 與 n
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

結果如下：

那麼 tidy 本文格式就先介紹到這裡，接著下一篇介紹實作電子書的詞頻說明，下回待續囉。

Text Mining with R 書本導讀介紹

學習內容大綱

tidy 文本格式

tidy data 是什麼呢？

tidy 文本與其他資料結構的對比

unnest_tokens 函數

用 tidy 處理資料

參考

`unnest_tokens` 函數