1-10hit |
The multichannel switch is an architecture widely used for ATM (Asynchronous Transfer Mode). It is known that the fault tolerant characteristic can be incorporated in into the multichannel crossbar switching fabric. For example, if a link belonging to a multichannel group fails, the remaining links can assume responsibility for some of the traffic on the failed link. On the other hand, if a fault occurs in a switching element, it can lead to erroneous routing and sequencing in the multichannel switch. We investigate several fault localization algorithms in multichannel crossbar ATM switches with a view to early fault recovery. The optimal algorithm gives the best performance in terms of time to localization but is computationally complex, which makes it difficult to operate in real time. We develop an online algorithm which is computationally more efficient than the optimal one. We evaluate its performance through simulation. The simulation results show that the performance of the online algorithm is only slightly suboptimal for both random and bursty traffic. There are cases where the proposed online algorithm cannot pinpoint down to a single fault. We explain the causes and enumerate those cases. Finally, a fault recovery algorithm is described which utilizes the information provided by the fault localization algorithm. The fault recovery algorithm adds extra rows and columns to allow cells to detour the faulty element.
Tadashi DOHI Hiroaki SUZUKI Kishor S. TRIVEDI
Software rejuvenation is a preventive and proactive solution that is particularly useful for counteracting the phenomenon of software aging. In this paper, we consider both the periodic and non-periodic software rejuvenation policies under different dependability measures. As is well known, the steady-state system availability is the probability that the software system is operating in the steady state and, at the same time, is often regarded as the mean up rate in the system operation period. We show that the mean up rate should be defined as the mean value of up rate, but not as the mean up time per mean operation time. We derive numerically the optimal software rejuvenation policies which maximize the steady-state system availability and the mean up rate, respectively, for each periodic or non-periodic model. Numerical examples show that the real mean up rate is always smaller than the system availability in the steady state and that the availability overestimates the ratio of operative time of the software system.
Edidiong Uyai EKAETTE Behrouz Homayoun FAR
This paper proposes a framework for distributed network management by incorporating fault and performance management metrics in a hierarchical decision making model. The goal of this research is to automate the fault management process. The fault management system is organized as a three level information processing model. Correlation results from each level are provided as evidence to the next level. Causal and temporal relationships between monitored variables are captured using Dynamic Bayesian Networks. As evidence is gathered, the probability of the presence of a fault is either strengthened or weakened. The proposed model is used for proactive fault detection as well as fault isolation purposes. A prototype implementing the ideas is presented.
Tadashi DOHI Kazuki IWAMOTO Hiroyuki OKAMURA Naoto KAIO
Software rejuvenation is a proactive fault management technique that has been extensively studied in the recent literature. In this paper, we focus on an example for a telecommunication billing application considered in Huang et al. (1995) and develop the discrete-time stochastic models to estimate the optimal software rejuvenation schedule. More precisely, two software availability models with rejuvenation are formulated via the discrete semi-Markov processes, and the optimal software rejuvenation schedules which maximize the steady-state availabilities are derived analytically. Further, we develop statistically non-parametric algorithms to estimate the optimal software rejuvenation schedules, provided that the complete sample data of failure times are given. Then, a new statistical device, called the discrete total time on test statistics, is introduced. Finally, we examine asymptotic properties for the statistical estimation algorithms proposed in this paper through a simulation experiment.
Hiroshi ISHII Hiroaki NISHIKAWA Yuji INOUE
This paper discusses and clarifies effectiveness of data-driven implementation of protocol handling system to access TINA (Telecommunications Information Networking Architecture) network and internet. TINA is a networking architecture that achieves networking services and management ubiquitously for users and networks. Many TINA related ACTS (Advanced Communication Technologies and Services) projects have been organized in Europe. In Japan, The TINA Trial (TTT) to achieve ATM network management and services based on TINA architectures was done by NTT and several manufactures from April 1997 to April 1999. In these studies and trials, much effort is devoted to development of software based on service architecture and network architecture being standardized in TINA-C (TINA Consortium). In order to achieve TINA environment universally in customers and network sides, we have to consider how to deploy TINA environment onto user side and how to use access transmission capacity as efficiently as possible. Recent technology can easily achieve application and environment downloading from the network side to user side by use of e. g. , JAVA. In accessing the network, there are several possible bottlenecks in information exchange in customer side such as PC processing capability, access protocol handling capability, intra-house wiring bandwidth. Authors, in parallel with TINA software architecture study, have been studying versatile requirements for hardware platform of TINA network. In those studies, we have clarified that the stream-oriented data-driven processor authors have been studying and developing have high reliability, high multiprocessing and multimedia information processing capability. Based on these studies, this paper first shows Von Neumann-based protocol handler is ineffective in case of multiprocessing through mathematical and emulation studies. Then, we show our data-driven protocol handling can effectively realize access protocol handling by emulation study. Then, we describe a result of first step of implementation of data-driven TCP/IP protocol handling. This result proves our TCP/IP hub based on data-driven processor is applicable not only for TINA/CORBA network but normal internet access. Finally, we show a possible customer premises network configuration which resolves bottleneck to access TINA network through ATM access.
Hassan HAJJI Behrouz Homayoun FAR
This paper discusses a framework for automating fault management using distributed software agents. The management function is distributed among multiple agents that can carry out advanced reasoning activities on the network domain. Network domain modeling using Bayesian network is introduced. The agent detects, correlates and selectively seeks to derive a clear explanation of the alarms generated in its domain. Depending on the network's degree of automation, the agent can even carry out local recovery actions. The ideas of the paper are implemented in a software for inference in Bayesian network. We identify the potentialities of learning in the agent model, and present the class of problems to be addressed.
Kiyohito YOSHIHARA Gen HATTORI Keizo SUGIYAMA Sadao OBANA
For backup of failed VPs (Virtual Paths) in ATM (Asynchronous Transfer Mode) networks, many self-healing algorithms have already been proposed. However, since the existing algorithms recover each failed VP with a single backup VP, a problem arises in that those algorithms cannot necessarily provide a failed VP having a higher recovery priority with a larger recovery ratio, which is the ratio of the bandwidth of a backup VP to that of a failed VP. For a solution to the problem, this paper proposes a new self-healing algorithm which recovers each failed VP with one or more backup VPs. We also evaluate its availability by comparing with an existing algorithm through simulations.
Takumi MORI Kohei OHTA Nei KATO Hideaki SONE Glenn MANSFIELD Yoshiaki NEMOTO
Network traffic contains many symptoms of various network faults. Symptoms of faults aggregate and are manifested in the aggregate traffic characteristics generally observed by a traffic monitor. It is very difficult for a manager or an NMS (Network Management Station) to isolate the symptoms manifested in the aggregate traffic characteristics. Especially, transit networks, like a backbone network, deal with many types of traffic. So, symptom isolation must be efficient. In this paper, we propose a powerful algorithm for symptom isolation. This algorithm is based on the popular SNMP-based RMON technology. Using dynamically constructed aggregate, fresh symptoms can be isolated efficiently. We apply the algorithm to two operational transit networks which connects some LANs and WANs, and evaluate it using trace data collected from these networks. The results show a significant improvement in the fault management capability and accuracy. Furthermore, the characteristics of fault symptoms and the various factors for effective system configuration are discussed.
A new scheme based on hierarchical information organization and situation awareness to support network manager in failure localization is proposed. This paper integrates the situation theory for the needs of fault management to model the states and events. As the result, the proposed information model includes four fault management viewpoints to support situational, functional, logical and physical analysis within the respective networks. Object-oriented analysis is applied to construct the information. The correlation of network situation is derived by description logic. The proposed classification algorithm is applied to solve the situation awareness problem. By using this proposal the correlation performance is enhanced to logarithmic order.
Hiroshi ISHII Hiroaki NISHIKAWA Yuji INOUE
This paper describes the effectiveness of stream-oriented data-driven scheme for achieving autonomous fault management of hyper-distributed systems such as networks based on the Telecommunications Information Networking Architecture (TINA). TINA, whose specifications are in the finalizing phase within TINA-Consortium, is aiming at achieving interoperability and reusability of telecom applications software and independent of underlying technologies. However, to actually implement TINA network, it is essential to consider the technology constraints. Especially autonomous fault management at run-time is crucial for distributed network environment because centralized control using global information is very difficult. So far many works have been done on so-called off-line management but runtime management of service failure seems immature. This paper proposes introduction of stream-oriented data-driven processors to the autonomous fault management at runtime in TINA based distributed network environment. It examines the features of distributed network applications and technology requirements to achieve fault management of those distributed applications such as effective multiprocessing of surveillance, testing, reconfiguration in addition to ordinary processing.