Both these techniques are very different from what we have learnt before. For a set of data, it sometimes will contains attribute like comments. Normally, this "comment" attribute will be filtered away as it data mining tool like PASW Modeler 13 will not be able to mine it.
Text mining basically means to uncover information hidden in text. It attempts to categorise textual data.
3 steps Text mining algorithms involves
According to Wikipedia, 'high quality' in text mining usually refers to some combination of relevance, novelty, and interesting-ness.
There are some challenges to text mining.
- Handling ambiguities such as spelling and grammar mistakes
- Text contains acronyms, abbreviations, misspellings (E.g. customer, cus, customar, csmr)
- Semantic analysis (E.g. book = to reserve something VS book = a manual)
- Syntax analysis
- Automatic detection of e-mail span or phishing
- Automatic processing of messages or e-mails
- Analysis of warranty claims, help desk calls/ reports, etc to identify the most common problems and relevant responses
- Analysis of related scientific publications in journals
- Filter and match resumes
There are three types of web mining:
- Web content mining
- Web structure mining
- Web usage mining
Typical Web Server Log File
It will capture information such as:
- User's IP address
- Date and Time
- Request
- Statues
- Bytes
- Previous Website
- Website user request to go
- Internet Browser used
Session File
This session file is extracted from the web server log files. It shows the number of clicks user will need to click in order to click to the page they want. This is somehow similar to "purchase sequence analysis" . If page 11 is the most popular page that users will normally go, then maybe the company will want to customise their web page so that users will not need to click many times in order to get to page 11. This is also making their web site more user friendly and accessible to the page users want.
Web mining also analysis users' behaviours. For example, web mining will observes the buying patterns of the user and then make recommendations to the users. This involves the marking cross-selling techniques.
Example of cross-selling
This is an example of personalise of a web site.
Personalisation of web site
Text mining and web mining will require a lot of work especially in the preparation of data such as creating a user dictionary. However, more than 80% of organisational information is in unstructured textual form which is an untapped gold mine of textual information.
(All images are taken from Temasek Polytechnic, BIT lecture slide)*