You did not state your objective. But it involves text mining. Technically, Task Mining is the task of transforming unstructured text data into structure numerical data so that machine learning algorithms can be applied to large document databases. Converting text to numbers requires the use of techniques for handling text at the individual work/character depending on the objective of the mining task.
Briefly, the process to prepare textual data for analysis involves:
1.
Tokenization[
^]
2.
Stemming[
^]
3.
Stop words[
^]
4. Indexing - represent the documents in the form term-document matrix (TDM) using
"bag-of-words"[
^] approach.
It is not possible to explain in details here, so visit
Term Frequency and Inverse
Document Frequency[
^]
Finally, the whole document corpus will be turned into a TDM where the usual data mining techniques can then be applied to meet the mining objective.