Very technical, but lack informative analysis. Seems to promote adaptive fault-tolerance entirely in the paper and didn't discuss issues or disadvantages that might come out of this approach (it mentioned the disadvantages of how fault-tolerant systems are done today though). Discussions about the representative systems using the proposed fault-tolerant "model" are very detailed, but again, does not contain much analysis. It sounds more like stating its features and how it supports the model, but neglects to discuss any potential shortcomings. Although this is not part of the thesis of the paper, maybe it would help (perhaps during the presentation) to discuss briefly a basic background on CORBA (which is mentioned in Electra, AFTM, and Proteus) and why is it prevalent in fault-tolerant distributed systems, to familiarize those who does not know too much about them. -------- I thought this is a very well-written paper. The topic is relevant, and important. However, it seemed to me that they didn't present much new material - it seemed more of a summary of existing implementations. While I thought it was good that they discussed the challeneges involved in designing these systems, it seemed like they didn't propose any of their own solutions. However, this was a survey paper. They discussed some key ideas, and evaluated current systems of doing fault tolerance very well, and all in all - I thought this was a well-organized and good paper. -------- The aproach was great, but I would of liked to see some algorithm development especially with group synchronization. Also the initial aproach up to section #4 is to general. I think a more technical explination is needed. Section #4 was great(Systems), Eletra was very easy to understand and brought more insight in the authors model. Although the seperate models were great in detail, it still seemed like their was a lack of continity between the models, sort of like a list but the detail was great. I would of liked more transitions, comparisions of the different systems and a longer more representitive conclusion. -------- This paper presented an interesting topic as distributed computing plays increasingly important role with the internet. Good job detailing the systems and describing the various critical aspects in distributed computing. It might be better to also discuss the cross cutting issues between the systems though. The systems seemed be presented completely separately from each other. Could different techniques from the various systems be used together for a better system? A little more on basic adaptive techniques might be good too. For instance, what happens in a failure? Does the system transparently adapt to the failure, or does it alert the applications? As for writing style, shorter sentences and active voice would make reading it easier... -------- This paper is well structured and contains much relevant information. The authors concisely summarized the approaches to fault tolerance they researched without giving too many extraneous details. I found, however, that some of the explanations and definitions given in this paper did not give me a complete understanding of the topic described. As I have a very minimal amount of prior knowledge on this topic (only the papers we have read in this class), I found some of the definitions given early in the paper a little frustrating. For example, after reading the introduction of the paper, I still did not have a very good sense of what the main concept of an adaptive policy to fault tolerance is. While the explanations later in the paper made this clearer, it was a little frustrating to not have an idea of the direction of the paper as I read it. I think perhaps this paper might have benefited from a lengthier, simplified explanation of some of the key topics at the start of the paper. Often, I find it helpful to have someone with a minimal amount of understanding of the topic I am describing edit my papers and point out weak points. But I don't want to dwell on the negative aspects of this paper! The descriptions of the different fault tolerant systems cleared up many of my misunderstandings of the paper. The analyses were coherent and logical and flowed well from the first part of the paper. Perhaps some comparisons between the different systems might have been nice. -------- The list of topics in the adaptive fault tolerant model seems very complete. The summaries provide good coverage system details. The Chameleon system summary could be trimmed a little to match the amount of information provided on other systems. It would help to have some comparison of the various approaches. How do they relate? Could they be used together or would they interfere? The overuse of acronyms should be kept in check. In the conclusion "Adaptive fault tolerance" starts a paragraph. It seems unwieldy to use AFT in the next sentence. It is unclear what an RTO.k object is. -------- Adaptive Fault Tolerance in Distributed Systems By Bharath, Dumas, Kurul From the section 2, problem statement, I think what authors try making their paper interesting is to first define what the properties of an adaptive fault tolerant model and analyze four exemplary systems, focusing on the properties they have observed. In my opinion, only the last example, Chameleon, provides a good explanation in terms of properties. However, the others focus so much on details of system that it prevents me from seeing the connection between the properties and the system itself. -------- Should use diagram to explain AFT model. Unclear figures in section Group Agreement. -------- Fault tolerance is an important topic in Operating system. Distributed environment make it more challenging. The traditional approach of fault tolerance is duplication, which is very costly. The adaptive approach is intended to utilize resources more efficiently. In this paper, the authors present the problems and decisions that are required to architect an adaptive fault tolerant system in section 3. It is very clear and and study the current systems. Four systems are described in section 4 and these systems are representative. The figures in the paper are hard to read. The description of the systems in section 4 is not well organized. It will be better if the authors can explicitly compare these systems, especially their decisions of those important problems and the reason why they got such decisions. I choose the scores for the following reasons Important: 6 Fault tolerant system is very desirable for large systems and the adaptive approach seems promising. Novelty: 4 This is basically a description and comparison paper Quality: 4 Section 4 is not well organized for each system. The figures are hard to read. Overall: 5 For a survey paper, it is well done. -------- This paper defines adaptive fault tolerance and its motivation, defines the characteristics of such a system, and describes some current implementations. The motivation for adaptive fault tolerance is that static allocation of resources to provide fault tolerance is very expensive in that the system is always prepared for some "worst" case. Adaptive fault tolerance can provide more efficient use of resources. The characteristics of an AFT system are the timing it supports (real-time or not), how resources are replicated, how replicated processes are grouped, how group members communicate, and how faults are detected and dealt with transparently. Current AFT systems that are built on top of CORBA include Electra which defines object groups to provide structure for redundancy, AFTM which provides real-time support and uses a highly componentized architecture, and Proteus which allows the user to dynamically control the redundancy configuration. One thing I thought was good about this paper was how it brings in CORBA technology, which is fairly mainstream. I have some experience with CORBA so I can understand the problem better. I also thought the environmental awareness part was very interesting (3.9), but too short. I would have liked another paragraph or two. I am not clear on to what degree existing systems do this. The wording in most of the paper is good. Most of it seems well-polished. The paper is also very well organized. I knew which way the paper was going from the first time through it. A couple things gave me some trouble. The second paragraph of section 3.5 confused me -- I could have used a little more detail about the "majority voter" concept. Also, I had some minor troubles with the wording in the first half of section 3. Going over this part one more time would be beneficial. Lastly, I couldn't make out most of the text in the diagrams. It looked like the text was gray. Black would improve reproducibility. -------- This was a very interesting paper. Several things that I thought the paper could use was better analysis of the described systems and their relevance to the adaptive model. Also I found the included pictures hard to read, and references 1-3 were too vague (a url would be useful). -------- 1) The paper provides a good summary of the Adaptive Fault tolerance model. I found the material interesting to read especially when I hadn't read much on Adaptive fault tolerance. 2) I would like an answer to this question.Do we require Adaptive Fault Tolerance to be a part of OS? It can be provided through a layer of abstraction on top of OS and OS need not even be aware of Adaptive Fault Tolerance. In fact the examples (chameleon) they have used Chameleon is used for detect and recover from faults in OS also. 3) A lot of research is ongoing on Adaptive Fault Tolerance and the paper provides no insight into that, like Wolfpack used by Micorosoft or FRIENDS which provides a reflection-oriented architecture for metaobjects or fault tolerance. 4) The important feature of chameleon is implementation flexibility which armors allow. For example if checkpointing is not needed for a specific user application, the checkpointing armor need not be present. The paper fails to highlight this important point. 5) The paper talks about Electra with no reference to Piranaha which addresses the issue of service availability in distributed application by using sphisticated ORB that provides failure detection. 6) I don't find any refernce to software-based approaches like, Delta-4, Isis, Horus, Totem. 7) The paper fails to highlight the problems with group communication paradigm. A process pair approach (Tandem's) 8) An adaptive design strategy can take into account available resources, deadlines and observed faults and notify on-line scheduling mechanism about relative instances of tasks, their timing requirements and both their worst-case and active usage of resources. The paper failed to highlight this aspect. 9) No information about as how Timing requirements are met in Adaptive Fault tolerance. -------- It is not explained in 4.2 why the AFTM system must be real-time to manage resources. Perhaps this could be expanded to explain why the timing is necessary. Please label the figures as they were confusing. Perhaps section 2.2 could incorporated in section 2.3?