Showing posts with label Text Mining. Show all posts
Showing posts with label Text Mining. Show all posts

Friday, February 12, 2010

Week 13 - Text Mining and Web Mining

We were introduced new mining techniques this week - text mining and web mining.

Both these techniques are very different from what we have learnt before. For a set of data, it sometimes will contains attribute like comments. Normally, this "comment" attribute will be filtered away as it data mining tool like PASW Modeler 13 will not be able to mine it.

Text mining basically means to uncover information hidden in text. It attempts to categorise textual data.
3 steps Text mining algorithms involves

According to Wikipedia, 'high quality' in text mining usually refers to some combination of relevance, novelty, and interesting-ness.

There are some challenges to text mining.
  • Handling ambiguities such as spelling and grammar mistakes
  • Text contains acronyms, abbreviations, misspellings (E.g. customer, cus, customar, csmr)
  • Semantic analysis (E.g. book = to reserve something VS book = a manual)
  • Syntax analysis
Still, if all the challenges above are solved, patterns and trends will be presented in graphs and could be used to help the organisation greatly such as:
  • Automatic detection of e-mail span or phishing
  • Automatic processing of messages or e-mails
  • Analysis of warranty claims, help desk calls/ reports, etc to identify the most common problems and relevant responses
  • Analysis of related scientific publications in journals
  • Filter and match resumes
Next, web mining. It basically does the same thing as text mining except that it also analysis log files in the web sites.

There are three types of web mining:
  1. Web content mining
  2. Web structure mining
  3. Web usage mining
Typical Web Server Log File

It will capture information such as:
  • User's IP address
  • Date and Time
  • Request
  • Statues
  • Bytes
  • Previous Website
  • Website user request to go
  • Internet Browser used
Session File

This session file is extracted from the web server log files. It shows the number of clicks user will need to click in order to click to the page they want. This is somehow similar to "purchase sequence analysis" . If page 11 is the most popular page that users will normally go, then maybe the company will want to customise their web page so that users will not need to click many times in order to get to page 11. This is also making their web site more user friendly and accessible to the page users want.

Web mining also analysis users' behaviours. For example, web mining will observes the buying patterns of the user and then make recommendations to the users. This involves the marking cross-selling techniques.

Example of cross-selling

This is an example of personalise of a web site.

Personalisation of web site

Text mining and web mining will require a lot of work especially in the preparation of data such as creating a user dictionary. However, more than 80% of organisational information is in unstructured textual form which is an untapped gold mine of textual information.

(All images are taken from Temasek Polytechnic, BIT lecture slide)*