所有提交的电磁系统将被重定向到在线手稿提交系统。作者请直接提交文章在线手稿提交系统各自的杂志。

研究在线Web日志预测系统使用Web挖掘技术:一个回顾

此举使p Jarkad1,曼西教授2和Bhonsle3
  1. 助理教授,计算机、GHRCEM-Wagholi印度浦那
  2. College-G.H。Raisoni工程学院管理Wagholi,印度浦那
  3. 普纳大学计算机科学系,MH、印度。
相关文章Pubmed,谷歌学者

访问更多的相关文章国际期刊的创新在计算机和通信工程的研究

文摘

Web程序挖掘的特征挖掘用户从网络导纳模式。Web使用挖掘可以定义在简单的词语发现和分析的用户访问模式挖掘网络日志文件从特定网站和相关的数据。领域做了大量的工作,但本文主要关注用户使用web日志数据预测未来的下一个请求,点击流记录和用户信息。本文的目的是提供过去,当前评估和更新web使用挖掘-未来的预测要求。本文还介绍了各种方法的比较和总结未来的预测与应用程序请求,它给发展研究的概述。

介绍

网络是世界上最大的知识仓库。随着越来越多的用户,从网上提取知识精通地和有效地成为一个单调乏味的过程。虽然上网用户离开手动收集有价值的信息,在网络日志文件。这个收集的数据在web日志文件中扮演重要角色发现用户web导航行为。的帮助下这些数据我们可以预测用户的兴趣和用户未来的请求。介绍了不同方法预测用户未来的请求。这些方法的优点和缺点也被讨论。剩下的纸是组织如下。第二部分介绍了论文的动机,第三节介绍了文献综述用户下一个请求预测,和第四节给出结论。

动机

万维网已经成为受欢迎的一天,导致大量用户访问web世界各地。每当任何用户访问一个网站大量的相关信息,用户如IP地址,请求的URL,由服务器自动收集和保存在访问日志文件的用户可以访问相同的重复网页。网络访问模式,只不过是一系列的所有访问页面发现用户行为中扮演着重要的角色。这种行为的帮助下我们可以预测,未来将用户访问模式,将有助于减少浏览网页的时间,从而降低服务器负载以及节省用户的时间。本研究的主要目的是了解研究已经完成web使用挖掘预测在未来的请求。

文献综述

文献综述的重点是研究和对比可用的技术来预测未来用户的动作。AlexandrasNanopoulos等。[1]关注“网页预取”,因为它的重要性在减少用户感知延迟出现在所有基于web的应用程序。作为一个流行的web极大增长,互联网有交通拥挤导致延迟的响应。web服务器underheavy负载、低带宽、网络拥堵,传播延迟和带宽未充分利用的一些原因延迟。解决这个问题的方法之一是增加带宽,但这并不是最优解,我们还必须考虑经济成本。由于这一种技术提出了客户未来的web请求对象的延迟降低这些对象转移到之前在后台缓存一个显式的请求。web文件之间的依赖关系的顺序访问和交错的请求属于模式随机用户事务内的这些重要因素影响网络在他们的论文中讨论了预抓取。[1]诉Sujatha Punithavalli[2]提出的预测用户导航模式使用聚类和分类(PUCC)从web日志数据。第一阶段是清洗阶段,所有的web日志数据的预处理,其中删除所有条目没有使用在分析矿业以及一种无格式的数据转换成可直接应用于web挖掘过程。这个阶段还包括会话和用户识别。在第二阶段,1。所有条目已访问的机器人。txt文件被标识和removed.2。All the entries which have visiting time of access as midnight are detected and removed.3.All the entries having access mode HEAD instead of GET or POST are removed.4.All the entries whose browsing speed exceeds threshold T1 are removed.The browsing speed is the ratio of number of viewed pages to session time.The potential users are then identified from others using cleaned data. From the potential users, a graph partitioned clustering algorithm was used to find the navigation pattern. An undirected graph having connectivity between each pair of web pages is used.Weight is assigned to every edge in the graph which is based on frequency and connectivity time.An LCS classification algorithm was then used to predict user future requests.[2]S. Vigneshwari, S. Vigneshwari[3] proposed a model for web information gathering using ontology mining method. Web Data Ontology(WDO) and User Profiling Ontology(UPO) are usd for extracting information from web in the proposed approach. This ontologies stand as the building blocks of the proposed web information gathering model. The web data ontology is considered as the main ontology of the proposed web information gatheringModel.The web ontology is constructed from the web site under consideration.The web sites are gathered and saved for extracting the ontology.Extracting the concepts from web documents is the initial stage of ontology construction. Based on the interrelation ship between the keywords in the web documents the concepts are extracted.Stop words in the web documents are removed in the initial preprocessing step and after that the documents are transferred to stemmer algorithm. The root words of the keywords in each of the document are extracted in stemmer.Duplicate keywords and redundancy is removed from each document.Thus the document contains only unique keywords. The keywords are then subjected to theconcept extraction. The concepts are extracted based on balanced mutual information (BMI) which is also known as semantic similarity measure.Based on BMI value all the concepts are arranged and the result is ontology.The second ontology that is user profiling based ontology is constructed based on user given tags for the web retrieval. The processing of the tags is same as the WDO model.As tags are simple keywords, there is no need for muchpreprocessing to separate the exact contentlike WDO model.In ontology mining method, based on the web log data and the web documents two ontologies are developed. The main feature for the ontology mining method is user given query. The ontology mining mainly focuses onthe interesting measure between the ontologies to collect the information from them.cross ontology similarity measure is used for semantic similarity measure.
然后,本体矿业技术使用有趣的措施,设计了提取信息考虑前面步骤中所开发的本体。最后,实验进行了使用日志数据中提取和web文档。结果显示从本体中提取的信息来验证web文档中提供的信息和用户视角[3]。TeenaSkaria Prof.T。Kalaikumaran,博士。如果我们[4]提出收集网络信息的本体模型。个性化本体构造使用用户配置文件和概念模型。识别用户上下文和安排他们的方式提高了搜索精度在个性化信息检索的两大挑战。因为这个用户配置文件是在层次结构称为本体开发的用户配置文件。网络信息收集更准确的聚类。聚类是使用k - means算法来完成的。聚类的基础上执行的内容和位置。更好的分割输入数据集是由kmean算法。 It is employed to learn the user's preferences and to gather information from web according to preference of users.[4] DilpreetKaur, A.P. SukhpreetKaur[5] proposed KFCM method of fuzzy clustering to predict the user future request.First step is read web log file.Second step is preprocessing where only required attributes are selected from log file like User request,requestmethod,ipaddress,date,time.Irrelevant entries like all the entries having file name .jpeg,.jpg, .gif, robots files, error code,Request method HEAD, POST are removed from the log file and thus cleaned log file is prepared. From cleaned log fileunique users are identified according to unique webpages and IP Address. After user identification, session identification is performed. In session identification step, sessions are identified for all users by taking 30 minute time threshold value.Pagesvisited by user less than or equal to 30 minute time put into one session and another pages which are visited after 30 minute put into another session. unique session id is assigned to all sessions. Third step is clustering in which whole data of web pages visited by each user and user session and user session ids is putted in an array to make a clusters. Thendata is divided into clusters using Fuzzy CMeansandKernelized Fuzzy C-Means algorithms. Then webpages with highest grade of membership in each cluster are searched. In the fourth step according to grade of membership, weightage is assigned to each webpage, page having low weightage has lowMembership and page having high weightage has high membership. User future webpage is predicted using Fuzzy C-Means and Kernelized Fuzzy C-Means algorithms [5].
晓惠道,李众多,NingZhong[6]提出了一个本体模型thatprovides强调本地和全球知识解决方案在一个计算模型。本文使用个性化本体。本体模型试图改善网络信息收集性能利用本体论的用户配置文件。用户的本地实例库(LIR)和世界知识用于该模型。世界知识是被人从教育和经验;LIR是用户的个人信息的集合。分析概念在本体中指定,多维本体采矿方法介绍了提出的模型。然后使用用户的LIRs填充个性化本体以及发现背景知识。模型信息系统领域也有广泛的贡献,informationretrieval和网络智能。Userpersonalized本体是由提取世界从LCSH系统知识和发现用户背景知识从用户本地实例库。在评估阶段,对于实验大testedand标准主题。[6] Maryam Jafari, ShahramJamali, FarzadSoleymaniSabzchi[7] proposed a novel algorithm named as PDFARM for FP tree mining process. To find fuzzy association rule the proposed algorithm uses fuzzy FP-tree. For pattern discovery PD-FARM algorithm is used which is based on Fuzzy Association Rules Mining (FARM).The mining algorithm has two phases. In first phase, FP-tree is constructed from database and in second phase frequent patterns are derived from FP-tree. From quantitative data fuzzy association rules are find out by fuzzy FP-tree mining algorithm.The tree structure for frequent fuzzy terms(regions) is generated using fuzzy FP-tree construction algorithm. Quantitative values of attributes in transactions are transformed into linguistic terms. The linguistic term with the maximum cardinality is only used for each term. The frequent fuzzy items are derived from the fuzzy FP-tree and represented by linguistic terms . The paper uses fuzzy partition method using CURE clustering algorithm. Next FPGrowth is applied, in the first scan regions with maximum count for each visited page are find out and in the second scan FFP tree is constructed.Finally recursive method is used Finally, recursive method is used to extract all fuzzy association rules according to min FC .The fuzzy FP-tree structure is used to handle the page visit time efficiently and effectively for a deeper understanding of user navigational behavior.[7] MehrdadJalali, Norwati Mustapha, Ali Mamat, Md. Nasir B Sulaiman[8] proposed novel contribution to clustering user navigation patterns. Graph partitioning algorithm is base for this approaching first step web log data preprocessing is done. Web log data preprocessing includes session identification, user differentiation and data cleaning. Afterprepressing task navigation modeling is done. In this paper the degree of connectivity in each pair of web pages depends on two main factors: the occurrence of two pages in a session and thetime position of two pages in a session. They proposed an algorithm to model thepages accesses information as an undirected graph M= (V, E). For approximating the connectivity degree of each two web pages in sessions they propose a weight measure. User navigation patterns are modeled using graph partitioning algorithm. An undirected graph based on connectivity between each pair of web pages is established in order to mining user navigation pattern. For assigning weights to edges of the graph they propose novel formula. The experimental results shows that the quality of clustering for user navigation pattern in web usage mining systems can be improved by this approach. [8]KobraEtminani,Mohammad-R.Akbarzadeh-T, NooraliRaeejiYanehsari[9] Proposed a new method to extract user navigational patterns from web log data. For this purpose ant-based clustering has been used. A neighborhood function needs to be defined. Once the clustering is completed, alignment processing has been applied to the extracted sequences in individual cluster and frequent patterns are identified.[9]

结论

文献综述的结论不同的研究已经完成用户的未来行为的预测。本文回顾不同算法的图像分割,LCS, K-mean,模糊c均值,基于ant集群和Kernelized模糊c均值算法s是用于发现用户的未来行为。

数据乍一看

图1 图2
图1 图2

引用


  1. R。b . Mobasher厄尔和j·斯利瓦斯塔瓦”数据准备挖掘万维网浏览模式”,1999。

  2. AlexandrosNanopoulos, DimitrisKatsaros YannisManolopoulos“有效预测网络用户访问:数据挖掘的方法,“inProc。WEBKDD研讨会,2001。

  3. 吴Yi-Hung Arbee l·p·陈,“预测Web页面访问代理服务器日志”万维网:互联网和网络为背景,5,67 - 88年,2002年。

  4. 马赛厄斯盖瑞Hatem哈达德“Web使用挖掘方法评估用户”年代下一个请求预测“WIDM”03学报the5th ACM国际研讨会在Web信息和datamanagement p。2003年11月7 - 8,74 - 81年。

  5. 安德烈亚斯Hotho GerdStumme,贝蒂娜Berendt。”语义web挖掘艺术的状态和新方向”,信息系统研究所,柏林洪堡大学施潘道,卷。78年,1-36,2004页。

  6. 贝蒂娜Berendt安德烈Hotho, DunjaMladenic。”从网络Web toSemantic路线图Web挖掘”,技术研究所和BusinessInformation系统,奥拓-冯-格里克马格德堡大学18卷,1日到21日,页。2008。

  7. 安德鲁清水。”新的本体:版权保护的影响在公共scientificdata共享使用语义网本体”Vol.10,第205 - 182页,2010年。

  8. 菲利普•保罗•Buitelaar Cimiano, Bernardo Magnini。”本体学习从文本:AnOverview”,语言技术实验室DFKI AIFB,卡尔斯鲁厄大学3卷,页1 - 10,2003。

  9. 晓惠道,李众多,NingZhong高级成员,“个性化本体梅尔Web信息采集”,IEEETransactions工程知识和数据,卷。23日,4号,第511 - 496页,2011年。

  10. 猫窗台,l . Pitkow, J。,“Characterizing browsing behaviours on the World Wide Web”,计算机网络andISDN系统,1995,硕士论文,6号,第1065 - 1073页。

  11. 厄尔R。,Srivastava, J. and Mobasher, B. , “Web mining: Information and pattern discovery on the world wide web,Tools with ArtificialIntelligence”,第九IEEE国际会议上与人工智能工具1997.程序。,Vol. 10, pp. 0558-567.

  12. Eirinaki、m和Vazirgiannis m .,“Web挖掘网络个性化”,ACM网上交易技术(原先),2003年,卷。3,问题1,1-27页。

  13. 去年访问http://en.wikipedia.org/wiki/Longest_common_subsequence_problem, 27-02-2011。

  14. Huysmans, J。,Baesens, B. and Vanthienen, J.), “Web Usage Mining: A Practical Study”,KatholiekeUniversitiesLeuven,部门。AppliedEconomic科学(2003)。

  15. 去年访问http://en.wikipedia.org/wiki/Longest_common_subsequence_problem, 27-02-2011。

全球技术峰会