所有提交的EM系统将被重定向到网上投稿系统.作者被要求将文章直接提交给网上投稿系统各自的日志。

基于Web挖掘技术的在线Web日志预测系统研究综述

Megha P. Jarkad1, Mansi教授2和Bhonsle3.
  1. 印度浦那ghrcm - wagholi计算机系助理教授
  2. College-G.H。莱索尼工程学院和管理, Wagholi,浦那,印度
  3. 印度浦那大学计算机科学系。
有关文章载于Pubmed谷歌学者

更多相关文章请访问国际计算机与通信工程创新研究杂志

摘要

Web程序挖掘是对Web中用户准入模式的特征挖掘。Web使用挖掘可以简单地定义为通过挖掘Web日志文件和来自特定网站的相关数据来发现和分析用户访问模式。在这一领域已经做了大量的工作,但本文主要关注的是使用web日志数据、点击流记录和用户信息来预测用户未来的下一次请求。本文的目的是提供web使用挖掘的过去、现在的评价和更新-未来的请求预测。本文还对未来请求预测的各种方法及其应用进行了比较和总结,概述了未来请求预测的研究进展。

介绍

Web是世界上最大的知识仓库。随着用户数量的不断增加,熟练有效地从web中提取知识变得越来越单调。用户在网上冲浪时留下有价值的信息,这些信息是在web日志文件中手动收集的。这些在web日志文件中收集到的数据对于了解用户的web导航行为起着重要的作用。借助这些数据,我们可以预测用户的兴趣和用户未来的需求。本文提出了不同的用户未来请求预测方法。本文还讨论了这些方法的优缺点。本文的其余部分组织如下。第2节是论文的研究动机,第3节是关于用户下一次请求预测的文献综述,第4节是结论。

动机

万维网正变得越来越流行,这导致大量的用户访问网络在世界各地。当任何用户访问一个网站时,服务器会自动收集与该用户相关的大量信息,如其IP地址、请求的URL,并保存在访问日志文件中,因为用户可能会重复访问同一个网页。Web访问模式是所有被访问页面的集合,它在判断用户行为方面起着重要作用。通过这种行为,我们可以预测用户未来的访问模式,这将有助于减少浏览网页的时间,从而减少服务器的负载,节省用户的时间。本研究的主要目的是了解在未来的请求预测中web使用挖掘方面已经做了哪些研究。

文献综述

文献综述的重点是研究和对比现有的预测用户未来运动的技术。AlexandrasNanopoulos et al.[1]专注于“网页预取”,因为它在减少用户在每个基于web的应用程序中的感知延迟方面非常重要。随着网络的日益普及,网络流量过大,导致响应延迟。web服务器负载过重、带宽低、网络拥塞、传播延迟和带宽未充分利用是造成延迟的几个原因。克服这一问题的方法之一是增加带宽,但这不是最佳解决方案,因为我们还必须考虑经济成本。因此,提出了一种技术,通过在对web对象进行显式请求之前将这些对象转移到后台缓存中来减少客户端对web对象的未来请求的延迟。在他们的论文中讨论了web文档访问之间的依赖顺序,以及用户事务中属于模式的请求与随机请求的交织,这些是影响web预取的重要因素V. Sujatha, Punithavalli[2]提出了使用聚类和分类(PUCC)从web日志数据预测用户导航模式。第一个阶段是清理阶段,对所有的web日志数据进行预处理,删除所有在分析或挖掘过程中没有用处的条目,并将未格式化的数据转换为可以直接应用于web挖掘过程的形式。此阶段还包括会话和用户识别。在第二阶段,1。all the entries which has accessed robot.txt file are identified and removed.2.All the entries which have visiting time of access as midnight are detected and removed.3.All the entries having access mode HEAD instead of GET or POST are removed.4.All the entries whose browsing speed exceeds threshold T1 are removed.The browsing speed is the ratio of number of viewed pages to session time.The potential users are then identified from others using cleaned data. From the potential users, a graph partitioned clustering algorithm was used to find the navigation pattern. An undirected graph having connectivity between each pair of web pages is used.Weight is assigned to every edge in the graph which is based on frequency and connectivity time.An LCS classification algorithm was then used to predict user future requests.[2]S. Vigneshwari, S. Vigneshwari[3] proposed a model for web information gathering using ontology mining method. Web Data Ontology(WDO) and User Profiling Ontology(UPO) are usd for extracting information from web in the proposed approach. This ontologies stand as the building blocks of the proposed web information gathering model. The web data ontology is considered as the main ontology of the proposed web information gatheringModel.The web ontology is constructed from the web site under consideration.The web sites are gathered and saved for extracting the ontology.Extracting the concepts from web documents is the initial stage of ontology construction. Based on the interrelation ship between the keywords in the web documents the concepts are extracted.Stop words in the web documents are removed in the initial preprocessing step and after that the documents are transferred to stemmer algorithm. The root words of the keywords in each of the document are extracted in stemmer.Duplicate keywords and redundancy is removed from each document.Thus the document contains only unique keywords. The keywords are then subjected to theconcept extraction. The concepts are extracted based on balanced mutual information (BMI) which is also known as semantic similarity measure.Based on BMI value all the concepts are arranged and the result is ontology.The second ontology that is user profiling based ontology is constructed based on user given tags for the web retrieval. The processing of the tags is same as the WDO model.As tags are simple keywords, there is no need for muchpreprocessing to separate the exact contentlike WDO model.In ontology mining method, based on the web log data and the web documents two ontologies are developed. The main feature for the ontology mining method is user given query. The ontology mining mainly focuses onthe interesting measure between the ontologies to collect the information from them.cross ontology similarity measure is used for semantic similarity measure.
然后,设计本体挖掘技术,使用考虑前一步开发的两个本体的有趣度量来提取信息。最后,利用提取的日志数据和web文档进行了实验。结果显示了从本体中提取的信息,以验证web文档和用户透视图[3]中提供的信息。TeenaSkaria Prof.T。Kalaikumaran,博士。Karthik[4]提出了一个收集web信息的本体模型。个性化本体是使用用户概要文件和概念模型构建的。识别用户上下文并以提高搜索精度的方式排列上下文是信息检索个性化的两个主要挑战。因此,用户概要文件以一种被称为用户概要文件本体的层次结构开发。通过聚类可以更准确地收集Web信息。聚类采用k均值算法。聚类是根据内容和位置执行的。采用kmean算法对输入数据集进行更好的划分。 It is employed to learn the user's preferences and to gather information from web according to preference of users.[4] DilpreetKaur, A.P. SukhpreetKaur[5] proposed KFCM method of fuzzy clustering to predict the user future request.First step is read web log file.Second step is preprocessing where only required attributes are selected from log file like User request,requestmethod,ipaddress,date,time.Irrelevant entries like all the entries having file name .jpeg,.jpg, .gif, robots files, error code,Request method HEAD, POST are removed from the log file and thus cleaned log file is prepared. From cleaned log fileunique users are identified according to unique webpages and IP Address. After user identification, session identification is performed. In session identification step, sessions are identified for all users by taking 30 minute time threshold value.Pagesvisited by user less than or equal to 30 minute time put into one session and another pages which are visited after 30 minute put into another session. unique session id is assigned to all sessions. Third step is clustering in which whole data of web pages visited by each user and user session and user session ids is putted in an array to make a clusters. Thendata is divided into clusters using Fuzzy CMeansandKernelized Fuzzy C-Means algorithms. Then webpages with highest grade of membership in each cluster are searched. In the fourth step according to grade of membership, weightage is assigned to each webpage, page having low weightage has lowMembership and page having high weightage has high membership. User future webpage is predicted using Fuzzy C-Means and Kernelized Fuzzy C-Means algorithms [5].
陶晓辉、李岳峰、钟宁[6]提出了一个本体论模型,该模型提供了一种在单个计算模型中强调局部知识和全局知识的解决方案。本文采用了个性化本体。本体模型试图通过使用本体用户配置文件来提高web信息收集性能。该模型采用用户的本地实例库(LIR)和世界知识。世界知识是人们从教育和经验中获得的;LIR是用户个人信息的收集。为了分析本体中指定的概念,该模型引入了一种多维本体挖掘方法。然后使用用户的lir填充个性化本体,并发现背景知识。该模型在信息系统、信息检索和网络智能等领域也有广泛的贡献。通过从LCSH系统中提取世界知识和从用户本地实例库中查找用户背景知识来构建用户个性化本体。在评价阶段,实验采用大型测试和标准题目 Maryam Jafari, ShahramJamali, FarzadSoleymaniSabzchi[7] proposed a novel algorithm named as PDFARM for FP tree mining process. To find fuzzy association rule the proposed algorithm uses fuzzy FP-tree. For pattern discovery PD-FARM algorithm is used which is based on Fuzzy Association Rules Mining (FARM).The mining algorithm has two phases. In first phase, FP-tree is constructed from database and in second phase frequent patterns are derived from FP-tree. From quantitative data fuzzy association rules are find out by fuzzy FP-tree mining algorithm.The tree structure for frequent fuzzy terms(regions) is generated using fuzzy FP-tree construction algorithm. Quantitative values of attributes in transactions are transformed into linguistic terms. The linguistic term with the maximum cardinality is only used for each term. The frequent fuzzy items are derived from the fuzzy FP-tree and represented by linguistic terms . The paper uses fuzzy partition method using CURE clustering algorithm. Next FPGrowth is applied, in the first scan regions with maximum count for each visited page are find out and in the second scan FFP tree is constructed.Finally recursive method is used Finally, recursive method is used to extract all fuzzy association rules according to min FC .The fuzzy FP-tree structure is used to handle the page visit time efficiently and effectively for a deeper understanding of user navigational behavior.[7] MehrdadJalali, Norwati Mustapha, Ali Mamat, Md. Nasir B Sulaiman[8] proposed novel contribution to clustering user navigation patterns. Graph partitioning algorithm is base for this approaching first step web log data preprocessing is done. Web log data preprocessing includes session identification, user differentiation and data cleaning. Afterprepressing task navigation modeling is done. In this paper the degree of connectivity in each pair of web pages depends on two main factors: the occurrence of two pages in a session and thetime position of two pages in a session. They proposed an algorithm to model thepages accesses information as an undirected graph M= (V, E). For approximating the connectivity degree of each two web pages in sessions they propose a weight measure. User navigation patterns are modeled using graph partitioning algorithm. An undirected graph based on connectivity between each pair of web pages is established in order to mining user navigation pattern. For assigning weights to edges of the graph they propose novel formula. The experimental results shows that the quality of clustering for user navigation pattern in web usage mining systems can be improved by this approach. [8]KobraEtminani,Mohammad-R.Akbarzadeh-T, NooraliRaeejiYanehsari[9] Proposed a new method to extract user navigational patterns from web log data. For this purpose ant-based clustering has been used. A neighborhood function needs to be defined. Once the clustering is completed, alignment processing has been applied to the extracted sequences in individual cluster and frequent patterns are identified.[9]

结论

通过文献综述和不同的研究得出了用户未来行为预测的结论。本文综述了用于预测用户未来行为的图划分、LCS、k均值、模糊c均值、蚂蚁聚类和核化模糊c均值等算法。

数字一览

图1 图2
图1 图2

参考文献


  1. R.Cooley, B. Mobasher和J. Srivastava“挖掘万维网浏览模式的数据准备”,1999年。

  2. AlexandrosNanopoulos, DimitrisKatsaros和YannisManolopoulos“网络用户访问的有效预测:一种数据挖掘方法”,在proc。WEBKDD研讨会,2001年。

  3. 吴怡鸿,陈丽萍,“基于代理服务器日志的网页访问预测”,《互联网与网络信息系统》,5,67-88,2002。

  4. Mathias Gery, Hatem Haddad,“用于用户下一个请求预测的Web使用挖掘方法的评估”WIDM, 03第五届ACM网络信息和数据管理国际研讨会论文集,p.74-81, 2003年11月7-8日。

  5. GerdStumme, Andreas Hotho, Bettina Berendt, "语义web挖掘的现状和未来方向,信息系统研究所,洪堡大学柏林,施潘道尔,卷。78,页1-36,2004。

  6. Bettina Berendt, Andreas Hotho, DunjaMladenic从Web到语义Web的Web挖掘路线图,马格德堡奥托-冯-格里克大学技术和商业信息系统研究所,第18卷,第1-21页,2008年。

  7. 安德鲁·克利尔沃特。”新本体:版权保护对使用语义网本体的公共科学数据共享的影响, Vol.10, pp. 182-205, 2010。

  8. 保罗·布伊特拉尔,菲利普·西米亚诺和贝尔纳多·马格尼尼文本本体学习综述,德国卡尔斯鲁厄大学语言技术实验室,2003年第3卷,第1-10页。

  9. 陶晓辉,李岳峰,钟宁,高级成员,“面向网络信息采集的个性化本体模型”,知识与数据工程汇刊,卷。23,第4期,第496-511页,2011。

  10. Cat ledge, L.和Pitkow, J.,“描述万维网上的浏览行为”,计算机网络和disdn系统中国科学,1995,Vol.27, No. 6, Pp. 1065-1073。

  11. Cooley, R., Srivastava, J.和Mobasher, B.,“Web挖掘:万维网上的信息和模式发现,人工智能工具”,第九届IEEE人工智能工具国际会议1997.程序。,Vol. 10, pp. 0558-567.

  12. Eirinaki, M.和Vazirgiannis, M.,“Web个性化的Web挖掘”ACM因特网技术汇刊中国农业工程学报,2003年第3卷第1期,第1-27页。

  13. http://en.wikipedia.org/wiki/Longest_common_subsequence_problem,最后访问时间27-02-2011。

  14. Huysmans, J., Baesens, B. and Vanthienen, J.),“Web使用挖掘:一个实践研究”,KatholiekeUniversitiesLeuven,部门。应用经济科学(2003)。

  15. http://en.wikipedia.org/wiki/Longest_common_subsequence_problem,最后访问时间27-02-2011。

全球科技峰会