1-10hit |
Xiao XU Weizhe ZHANG Hongli ZHANG Binxing FANG
Internet computing is proposed to exploit personal computing resources across the Internet in order to build large-scale Web applications at lower cost. In this paper, a DHT-based distributed Web crawling model based on the concept of Internet computing is proposed. Also, we propose two optimizations to reduce the download time and waiting time of the Web crawling tasks in order to increase the system's throughput and update rate. Based on our contributor-friendly download scheme, the improvement on the download time is achieved by shortening the crawler-crawlee RTTs. In order to accurately estimate the RTTs, a network coordinate system is combined with the underlying DHT. The improvement on the waiting time is achieved by redirecting the incoming crawling tasks to light-loaded crawlers in order to keep the queue on each crawler equally sized. We also propose a simple Web site partition method to split a large Web site into smaller pieces in order to reduce the task granularity. All the methods proposed are evaluated through real Internet tests and simulations showing satisfactory results.
Ping WANG Binxing FANG Xiaochun YUN Jun ZHENG
We focus on the overall representation of network user behavior and observe that the number of destinations accessed by a network user is limited, which means users have certain preferential haunts in networks. And the distribution of users closely matches heavy tail distribution instead of Possion Distribution.
Predicting the routing paths between any given pair of Autonomous Systems (ASes) is very useful in network diagnosis, traffic engineering, and protocol analysis. Existing methods address this problem by resolving the best path with a snapshot of BGP (Border Gateway Protocol) routing tables. However, due to route deficiencies, routing policy changes, and other causes, the best path changes over time. Consequently, existing methods for path prediction fail to capture route dynamics. To predict AS-level paths in dynamic scenarios (e.g. network failures), we propose a per-neighbor path ranking model based on how long the paths have been used, and apply this routing model to extract each AS's route choice configurations for the paths observed in BGP data. With route choice configurations to multiple paths, we are able to predict the path in case of multiple network scenarios. We further build the model with strict policies to ensure our model's routing convergence; formally prove that it converges; and discuss the path prediction capturing routing dynamics by disabling links. By evaluating the consistency between our model's routing and the actually observed paths, we show that our model outperforms the state-of-the-art work [4].
Kaipeng LIU Binxing FANG Weizhe ZHANG
With the emergence of Web 2.0, social tagging systems become highly popular in recent years and thus form the so-called folksonomies. Personalized tag recommendation in social tagging systems is to provide a user with a ranked list of tags for a specific resource that best serves the user's needs. Many existing tag recommendation approaches assume that users are independent and identically distributed. This assumption ignores the social relations between users, which are increasingly popular nowadays. In this paper, we investigate the role of social relations in the task of tag recommendation and propose a personalized collaborative filtering algorithm. In addition to the social annotations made by collaborative users, we inject the social relations between users and the content similarities between resources into a graph representation of folksonomies. To fully explore the structure of this graph, instead of computing similarities between objects using feature vectors, we exploit the method of random-walk computation of similarities, which furthermore enable us to model a user's tag preferences with the similarities between the user and all the tags. We combine both the collaborative information and the tag preferences to recommend personalized tags to users. We conduct experiments on a dataset collected from a real-world system. The results of comparative experiments show that the proposed algorithm outperforms state-of-the-art tag recommendation algorithms in terms of prediction quality measured by precision, recall and NDCG.
Shuzhuang ZHANG Hao LUO Binxing FANG Xiaochun YUN
Scanning packet payload at a high speed has become a crucial task in modern network management due to its wide variety applications on network security and application-specific services. Traditionally, Deterministic finite automatons (DFAs) are used to perform this operation in linear time. However, the memory requirements of DFAs are prohibitively high for patterns used in practical packet scanning, especially when many patterns are compiled into a single DFA. Existing solutions for memory blow-up are making a trade-off between memory requirement and memory access of processing per input character. In this paper we proposed a novel method to drastically reduce the memory requirements of DFAs while still maintain the high matching speed and provide worst-case guarantees. We removed the duplicate transitions between states by dividing all the DFA states into a number of groups and making each group of states share a merged transition table. We also proposed an efficient algorithm for transition sharing between states. The high efficiency in time and space made our approach adapted to frequently updated DFAs. We performed several experiments on real world rule sets. Overall, for all rule sets and approach evaluated, our approach offers the best memory versus run-time trade-offs.
Yanbin SUN Yu ZHANG Binxing FANG Hongli ZHANG
Information-Centric Networking (ICN) treats contents as first class citizens and adopts name-based routing for content distribution and retrieval. Content names rather than IP addresses are directly used for routing. However, due to the location-independent naming and the huge namespace, name-based routing faces scalability and efficiency issues including large routing tables and high path stretches. This paper proposes a universal Scalable Name-based Geometric Routing scheme (SNGR), which is a careful synthesis of geometric routing and name resolution. To provide scalable and efficient underlying routing, a universal geometric routing framework (GRF) is proposed. Any geometric routing scheme can be used directly for name resolution based on GRF. To implement an overlay name resolution system, SNGR utilizes a bi-level grouping design. With this design, a resolution node that is close to the consumer can always be found. Our theoretical analyses guarantee the performance of SNGR, and experiments show that SNGR outperforms similar routing schemes in terms of node state, path stretch, and reliability.
Chuanyi LIU Jie LIN Binxing FANG
Cloud computing is broadly recognized as as the prevalent trend in IT. However, in cloud computing mode, customers lose the direct control of their data and applications hosted by the cloud providers, which leads to the trustworthiness issue of the cloud providers, hindering the widespread use of cloud computing. This paper proposes a trustworthiness verification and audit mechanism on cloud providers called T-YUN. It introduces a trusted third party to cyclically attest the remote clouds, which are instrumented with the trusted chain covering the whole architecture stack. According to the main operations of the clouds, remote verification protocols are also proposed in T-YUN, with a dedicated key management scheme. This paper also implements a proof-of-concept emulator to validate the effectiveness and performance overhead of T-YUN. The experimental results show that T-YUN is effective and the extra overhead incurred by it is acceptable.
Dongyang ZHAN Lin YE Binxing FANG Xiaojiang DU Zhikai XU
Protecting critical files in operating system is very important to system security. With the increasing adoption of Virtual Machine Introspection (VMI), designing VMI-based monitoring tools become a preferential choice with promising features, such as isolation, stealthiness and quick recovery from crash. However, these tools inevitably introduce high overhead due to their operation-based characteristic. Specifically, they need to intercept some file operations to monitor critical files once the operations are executed, regardless of whether the files are critical or not. It is known that file operation is high-frequency, so operation-based methods often result in performance degradation seriously. Thus, in this paper we present CFWatcher, a target-based real-time monitoring solution to protect critical files by leveraging VMI techniques. As a target-based scheme, CFWatcher constraints the monitoring into the operations that are accessing target files defined by users. Consequently, the overhead depends on the frequency of target files being accessed instead of the whole filesystem, which dramatically reduces the overhead. To validate our solution, a prototype system is built on Xen with full virtualization, which not only is able to monitor both Linux and Windows virtual machines, but also can take actions to prevent unauthorized access according to predefined policies. Through extensive evaluations, the experimental results demonstrate that the overhead introduced by CFWatcher is acceptable. Especially, the overhead is very low in the case of a few target files.
We consider the problem of fast identification of high-rate flows in backbone links with possibly millions of flows. Accurate identification of high-rate flows is important for active queue management, traffic measurement and network security such as detection of distributed denial of service attacks. It is difficult to directly identify high-rate flows in backbone links because tracking the possible millions of flows needs correspondingly large high speed memories. To reduce the measurement overhead, the deterministic 1-out-of-k sampling technique is adopted which is also implemented in Cisco routers (NetFlow). Ideally, a high-rate flow identification method should have short identification time, low memory cost and processing cost. Most importantly, it should be able to specify the identification accuracy. We develop two such methods. The first method is based on fixed sample size test (FSST) which is able to identify high-rate flows with user-specified identification accuracy. However, since FSST has to record every sampled flow during the measurement period, it is not memory efficient. Therefore the second novel method based on truncated sequential probability ratio test (TSPRT) is proposed. Through sequential sampling, TSPRT is able to remove the low-rate flows and identify the high-rate flows at the early stage which can reduce the memory cost and identification time respectively. According to the way to determine the parameters in TSPRT, two versions of TSPRT are proposed: TSPRT-M which is suitable when low memory cost is preferred and TSPRT-T which is suitable when short identification time is preferred. The experimental results show that TSPRT requires less memory and identification time in identifying high-rate flows while satisfying the accuracy requirement as compared to previously proposed methods.
Xiao XU Weizhe ZHANG Hongli ZHANG Binxing FANG
The basic requirements of the distributed Web crawling systems are: short download time, low communication overhead and balanced load which largely depends on the systems' Web partition strategies. In this paper, we propose a DHT-based distributed Web crawling system and several DHT-based Web partition methods. First, a new system model based on a DHT method called the Content Addressable Network (CAN) is proposed. Second, based on this model, a network-distance-based Web partition is implemented to reduce the crawler-crawlee network distance in a fully distributed manner. Third, by utilizing the locality on the link space, we propose the concept of link-based Web partition to reduce the communication overhead of the system. This method not only reduces the number of inter-links to be exchanged among the crawlers but also reduces the cost of routing on the DHT overlay. In order to combine the benefits of the above two Web partition methods, we then propose 2 distributed multi-objective Web partition methods. Finally, all the methods we propose in this paper are compared with existing system models in the simulated experiments under different datasets and different system scales. In most cases, the new methods show their superiority.