ABSTRACT: Users are increasingly pursuing complex
task-oriented goals on the web, such as making travel arrangements, managing
finances, or planning purchases. To this end, they usually break down the tasks
into a few co dependent steps and issue multiple queries around these steps
repeatedly over long periods of time. To better support users in their
long-term information quests on the web, search engines keep track of their
queries and clicks while searching online. In this paper, we study the problem
of organizing a user’s historical queries into groups in a dynamic and
automated fashion. Automatically identifying query groups is helpful for a
number of different search engine components and applications, such as query
suggestions, result ranking, query alterations, sessionization, and
collaborative search. In our approach, we go beyond approaches that rely on
textual similarity or time thresholds, and we propose a more robust approach
that leverages search query logs. We experimentally study the performance of
different techniques, and showcase their potential, especially when combined


We Will Write a Custom Essay Specifically
For You For Only $13.90/page!

order now


KEYWORDS: Energy efficient algorithm;
Manets; total transmission energy; maximum number of hops; network lifetime


AS the size and
richness of information on the web grows, so does the variety and the
complexity of tasks that users try to accomplish online. Users are no longer
content with issuing simple navigational queries. Various studies on query logs
(e.g., Yahoo’s and AltaVista’s) reveal that only about 20 percent of queries
are navigational. The rest are informational or transactional in nature. This
is because users now pursue much broader informational and task oriented goals
such as arranging for future travel, managing their finances, or planning their
purchase decisions. However, the primary means of accessing information online
is still through keyword queries to a search engine. A complex task such as
travel arrangement has to be broken down into a number of codependent steps
over a period of time. For instance, a user may first search on possible
destinations, timeline, events, etc. After deciding when and where to go, the
user may then search for the most suitable arrangements for air tickets, rental
cars, lodging, meals, etc. Each step requires one or more queries, and each
query results in one or more clicks on relevant pages.

One important
step toward enabling services and features that can help users during their
complex search quests online is the capability to identify and group related
queries together. Recently, some of the major search engines have introduced a
new “Search History” feature, which allows users to track their online searches
by recording their queries and clicks. For example, a portion of a user’s
history as it is shown by the Bing search engine on February of 2010. This
history includes a sequence of four queries displayed in reverse chronological
order together with their corresponding clicks. In addition to viewing their
search history, users can manipulate it by manually editing and organizing
related queries and clicks into groups, or by sharing them with their friends.
While these features are helpful, the manual efforts involved can be disruptive
and will be untenable as the search history gets longer over time.

In fact,
identifying groups of related queries has applications beyond helping the users
to make sense and keep track of queries and clicks in their search history.
First and foremost, query grouping allows the search engine to better
understand a user’s session and potentially tailor that user’s search
experience according to her needs. Once query groups have been identified,
search engines can have a good representation of the search context behind the
current query using queries and clicks in the corresponding query group. This
will help to improve the quality of key components of search engines such as
query suggestions, result ranking, query alterations, sessionization, and
collaborative search. For example, if a search engine knows that a current
query “financial statement” belongs to a {“bank of america,” “financial
statement”} query group, it can boost the rank of the page that provides
information about how to get a Bank of America statement instead of the
Wikipedia article on “financial statement,” or the pages related to financial
statements from other banks.

Query grouping
can also assist other users by promoting task-level collaborative search. For
instance, given a set of query groups created by expert users, we can select
the ones that are highly relevant to the current user’s query activity and
recommend them to her. Explicit collaborative search can also be performed by
allowing users in a trusted community to find, share and merge relevant query
groups to perform larger, long-term tasks on the web.

Related work

Fig 1: Web Mining Structure

Web Content
mining 3 deals with discovery of useful information from unstructured, semi
structured or structured contents of web documents. Text, images, audio, video
comprised by unstructured document, semi structured data includes HTML
documents and lists and tables represent structured documents. The main aim of
web content mining is to act as tool to retrieve information easily and
quickly. Web Content Mining works by organizing a group of documents into
related categories which helps web search engine to ex-tract information more
quickly and efficiently. Web Structure Mining 6, 7 mines the information by
utilizing the link structure of the web documents. It works on inter document
level and discovers hyperlink structure. It helps in describing the
similarities and relationships between sites. Web Usage Mining 3 is a data
mining technique that mines the information by analyzing the log files that
contains the user access patterns. Web Usage Mining mines the secondary data
which is present in log files and derived from the interactions of the users
with the web. Web usage Mining techniques are applied on the data present in
web server logs, browser logs, cookies, user profiles, bookmarks, mouse clicks
etc. This information is often gathered automatically access web log through
the Web server.

2.1 web usage

Web Usage Mining
concentrates on the techniques that could predict the navigational pattern of
the user while the user interacts with the web. It is mainly divided into two
categories, they are general access pattern tracking and customized usage
tracking. In general access pattern tracking information is discovered by using
the history of web page visited by user while in customized usage tracking
mining is targeted on specific user. Mainly there are four types of data
sources present in which usage data is recorded at different levels they are:
client level collection, browser level collection, server level collection and
proxy level collection.

Client Level
collection: At this level
data is gathered together by means of java scripts or java applets. This data
shows the behavior of a single user on single site. Client side data collection
requires user participation for enabling java scripts or java applets. The
advantage of data collection at client side is that it can capture all clicks
including pressing of back or reload button 2.

Browser Level
Collection: Second method
of data collection is by modifying the browser. It shows the behavior of single
user over multiple sites. The data collection capabilities are enhanced by
modifying the source code of existing browser. They provide much more versatile
data as they consider the behavior of single user on multiple sites 2.

Server Level
Collection: Web server log
5 stores the behavior of multiple users over single site. These log files can
be stored in common log format or extended log format. Server logs are not able
to store cached page views. Another technique used for usage data collection at
server level is TCP/IP packet sniffing. Packet sniffers works by monitoring the
net-work logs and retrieve usage data directly.

Proxy Level
Collection: Proxy servers
are used by internet service provider to provide World Wide Web access to
customers. These server stores the behavior of multiple user at multiple site.
These server functions like cache server and they are able to produce cached
page views. By predicting the usage pattern of the visitor Web Usage Mining
improves the quality of e- commerce services, personalizes the web 1 or enhances
the performance of web structure and web server.

data are data that are
collected from web servers; it includes log files, cookies and explicit user
input. Servers contain different types of logs, which are considered to be the
main date resource for web usage mining.


Problem  Definition

are rich variants of browsing behaviour analysis techniques are available but
most of them are suffers from the following issues:

Web server access log based technique only contains the partial user behaviour
therefore need to improve the log management scheme

More than one pages are navigated in different times, therefore establishing
the correlation between each user event and their corresponding web page is
complex to learn by an algorithm

Huge data needs large time and space complexity

Inaccurate predictive methodology due to less number of feature availability on
the user navigation pattern.

Limitations of Existing System:

Accuracy of
system is quite less

Time consumption
increase with increase in dataset size


Proposed Architecture

The framework consists of three Levels

Level 1: In this level the basic features are
generated from web logs  where
proposed servers resides in and are used to form the web logs records for
well-defined time period. Monitoring and analysing logs to reduce the malicious
activities only on relevant users & sessions.

To provide a
best protection for a targeted sessions. This also enables our detector to
provide protection which is the best fit for the targeted users because
legitimate user profiles used by the detectors are developed for a smaller
number of logs.


Level 2: In this step the Analysis is applied in which the user
profile Generation module is applied to extract the correlation between two
separate features within individual log.

The distinct features are come from level 1
or “feature normalization module” in this step. All the extracted correlation
are stored, are then used to replace the original logs to represent the web
logs. Its differentiating between legitimate and illegitimate log data.


Level 3: The anomaly session identification mechanism is adopted
in decision making.

Normal user profile generation module is to
generate a profiles for various types of web logs and the generated normal
profiles are stored in a database. The “Tested Profile Generation” module is
used in the “test phase” to build profiles for individual observed web logs.
Then at last the tested profiles are handed over to “session identification”
module it compares tested profile with stored normal profiles.

This needs the
expertise in the targeted detection algorithm and it is manual task. The Normal
Profile Generation module is operated to generate profiles for various types of
legal records of logs, and the normal profiles generated are stored in the
database. The tested profile generation module is used in a Test Phase to build
profiles for the each observed logs documentation. Next, the profiles of tested
are passed over to an session identification part, which calculates the tested
profiles for individual with the self-stored profiles of normal. A threshold
based classifier is employed in the session identification portion module to classify
logs 8.


A.  Data Cleaning

Input: log_table

Output: refine_log_table


1. Read records in log_table

2. For each record in log_table

3. Read fields (Status code)

4. If Status code=200, Then Get all fields.

5. If suffix.URL_Link={*.gif,*.jpg,*.css,*.ico}

6. Remove suffix.URL_link

7. Save fields in new table.

End if


8. Next record

End if



B.  Detection Mechanism

In this section, we present a threshold
based on anomaly finder whose regular profiles are produced using purely legal
records of web logs and utilized for the future distinguish with new incoming
investigated logs report. The difference between an individual normal outline
and a fresh arriving logs record is examined by the planned detector. If the
variation is large than a pre-determined threshold, then a record of logs is marked
as an malicious session otherwise it is marked as the legal session.


C.  Algorithm for User Profile Generation

In this algorithm 1 the user normal
profile is built through the density estimation between individual legitimate
training web logs and the expectation of the legitimate training web logs.

Step 1: Input web logs.

Step 2: Extract original features of individual logs.

Step 3: Apply the concept user profile to extract the
geometrical correlation between the jth and kth features in the vector xi.

Step 4: User Normal profile generation

i.  Generate triangle area map of each log.

ii. Generate covariance matrix.

features between legitimate record’s value and input records value


standard deviation.

vi.      Return pro.

Step 5: Session identification.

i. Input:
observed logs, normal profile and alpha.

ii. Generate values for i/p logs

value between normal profile and i/p logs

If value
< threshold Detect Normal session Else Detect malicious session. In the training phase, we employ only the normal records. Normal profiles are built with respect to the various types of appropriate logs using the algorithm describe below. Clearly, normal profiles and threshold points have the direct power on the performance of the threshold based detector. An underlying quality usual shape origins a mistaken characterization to correct logs.   D.  Algorithm for Session identification This algorithm is used for classification purpose. Step1: Task is to classify new features as they arrive, i.e., decide to which class label they belong, based on the currently existing logs record. Step2: Formulated our prior probability, so ready to classify a new record. Step 3: Then we calculate the number of points in the record belonging to each logs record. Step 4: Final classification is produced by combining both features of information, i.e., the prior and to form a posterior probability.   E.  Mathematical Modeling Let S be the system which we use to find the session identification system. They equip proposed detection system with capabilities of accurate characterization for logs behaviours and detection of known and unknown attacks respectively. ·   Input: Given an arbitrary dataset X = {x1, x2, · · · , xn} ·   Output: DP (Detected Sessions) : DP={n,m} Where  n  is  normal  sessions  and  M  is  the malicious sessions. Process: S= {D, mvc, NP, AD, DP} Where, S= System. D= Dataset mvc     =    Multivariate   correlation analysis. NP = Normal profile generation. AD =Session identification. DP= Detected packets.   EXERPIMENT EVALUATION AND ANALYSIS Evaluation of session identification is done by using web logs dataset. User Normal Profile is built by using same dataset. Threshold range is generated by using 'µ + ? *?' and 'µ - ? *?' For normal Distribution, the value of '?' ranges from 1 to 3. Detection rate and False positive rate is evaluated for the different values of '?'.   Fig: Graph for detection of False positive rate Vs detection rate     Advantages of Proposed System: 1.       Accuracy is high 2.       Time consumption is very less as compared to previous systems 3.       Classification accuracy is better than previous systems Disadvantages of proposed system: 1.       Doest not consider real time dataset 2.       Processing speed depends on the machine configuration Future Scope: 1.       Can be implemented with other algorithms to check accuracy 2.       Hybrid approach can also be implemented to improve accuracy 3.       To be implemented using real world dataset III. Conclusion  Web usage mining is indeed one of the emerging areas of research and important sub-domain of data mining and its techniques. In order to take full advantage of web usage mining and its all techniques, it is important to carry out preprocessing stage efficiently and effectively. This paper tries to deliver areas of preprocessing, including data cleansing, session identification, user identification. Once the preprocessing stage is well-performed, we have applied data mining technique classification. Web log mining is one of the recent areas of research in Data mining. Web Usage Mining becomes an important aspect in today's era because the quantity of data is continuously increasing. Above results shows that the detection rate of session identification is far more better than previous systems and the false positive rate is very low. As the fpr changes there is certain deflection in detection rate also. Thus we prove that our system performs better on given dataset and also on realtime dataset generated from wireshark software tool. We deal with the web server logs which maintain the history of page requests. for applications of web usage mining such as business intelligence, e-commerce, e-learning, personalization, etc. References 1 J. Teevan, E. Adar, R. Jones, and M.A.S. Potts, "Information Re-Retrieval: Repeat Queries in Yahoo's Logs," Proc. 30th Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR '07), pp. 151-158, 2007. 2 A. Broder, "A Taxonomy of Web Search," SIGIR Forum, vol. 36, no. 2, pp. 3-10, 2002. 3 A. Spink, M. Park, B.J. Jansen, and J. Pedersen, "Multitasking during Web Search Sessions," Information Processing and Management, vol. 42, no. 1, pp. 264-275, 2006. 4 R. Jones and K.L. Klinkner, "Beyond the Session Timeout: Automatic Hierarchical Segmentation of Search Topics in Query Logs," Proc. 17th ACM Conf. Information and Knowledge Management (CIKM), 2008. P. Boldi, F. Bonchi, C. Castillo, D. Donato, A. Gionis, and S. Vigna, "The Query-Flow Graph: Model and Applications," Proc. 17th ACM Conf. Information and Knowledge Management (CIKM), 2008. 6 D. Beeferman and A. Berger, "Agglomerative Clustering of a Search Engine Query Log," Proc. Sixth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD), 2000. 7 R. Baeza-Yates and A. Tiberi, "Extracting Semantic Relations from Query Logs," Proc. 13th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD), 2007. 8 J. Han and M. Kamber, Data Mining: Concepts and Techniques. Morgan Kaufmann, 2000. 9 W. Barbakh and C. Fyfe, "Online Clustering Algorithms," Int'l J. Neural Systems, vol. 18, no. 3, pp. 185-194, 2008. 10 Lecture Notes in Data Mining, M. Berry, and M. Browne, eds. World Scientific Publishing Company, 2006. 11 V.I. Levenshtein, "Binary Codes Capable of Correcting Deletions, Insertions and Reversals," Soviet Physics Doklady, vol. 10, pp. 707-710, 1966. 12 M. Sahami and T.D. Heilman, "A Web-based Kernel Function for Measuring the Similarity of Short Text Snippets," Proc. the 15th Int'l Conf. World Wide Web (WWW '06), pp. 377-386, 2006. 13 J.-R. Wen, J.-Y. Nie, and H.-J. Zhang, "Query Clustering Using User Logs," ACM Trans. in Information Systems, vol. 20, no. 1, pp. 59-81, 2002. 14 A. Fuxman, P. Tsaparas, K. Achan, and R. Agrawal, "Using the Wisdom of the Crowds for Keyword Generation," Proc. the 17th Int'l Conf. World Wide Web (WWW '08), 2008. 15 K. Avrachenkov, N. Litvak, D. Nemirovsky, and N. Osipova, "Monte Carlo Methods in PageRank Computation: When One Iteration Is Sufficient," SIAM J. Numerical Analysis, vol. 45, no. 2, pp. 890-904, 2007.

Post Author: admin


I'm Irvin!

Would you like to get a custom essay? How about receiving a customized one?

Check it out