Continuing to work correctly, even when things go wrong.

  1. The application performs the function that the user expected.
  2. It can tolerate the user making mistakes or using the software in unexpected ways.
  3. Its performance is good enough for the required use case, under the expected load and data volume.
  4. The system prevents any unauthorized access and abuse.

Fault vs Failure

  • Fault: one component of the system deviating from its spec.
  • Failure: the system as a whole stops providing the required service to the user.

Systems that anticipate faults and can cope with them are called fault-tolerant or resilient.

Types of faults

  1. Hardware faults: disk ram crash / power outage / loss network connection. - Add redundancy.
  2. Software errors: - thorough testing / process isolation / processes restart / measuring, monitoring, and analyzing.
  3. Human errors: - minimizes opportunities for error.


a system’s ability to cope with increased load.

Describing Load

a data syste

Load parameters - choice depends on the architecture of the system.

  • requests per second to a web server
  • the ratio of reads to writes in a database
  • the number of simultaneously active users in a chat room
  • the hit rate on a cache

Perhaps the average case is what matters for you, or perhaps your bottleneck is dominated by a small number of extreme cases.

Describing Performance

  • Throughput: the number of records we can process per second, or the total time it takes to run a job on a dataset of a certain size.
  • Response time: the time between a client sending a request and receiving a response.

  • We need to think of response time not as a single number, but as a distribution of values that you can measure.
  • It is important to measure response times on the client side
  • Median is a good metric if you want to know how long users typically have to wait.
  • High percentiles of response times, also known as tail latencies, are important because they directly affect users’ experience of the service.
  • The right way of aggregating response time data is to add the histograms

the customers with the slowest requests are often those who have the most data on their accounts because they have made many purchases—that is, they’re the most valuable customers.

Coping with Load

  • scaling up / vertical scaling / moving to a more powerful machine.
  • scaling out / horizontal scaling / distributing the load across multiple smaller machines.

Distributing load across multiple machines is also known as a shared-nothing architecture.

Elastic systems meaning that they can automatically add computing resources when they detect a load increase. They can be useful if load is highly unpredictable.

common wisdom until recently was to keep your database on a single node (scale up) until scaling cost or highavailability requirements forced you to make it distributed. It is conceivable that distributed data systems will become the default in the future.


  1. Operability - Making Life Easy for Operations.
    • Make it easy for operations teams to keep the system running smoothly.
  2. Simplicity - Managing Complexity.
    • Make it easy for new engineers to understand the system, by removing as much complexity as possible from the system. (Note this is not the same as simplicity of the user interface.)
  3. Evolvability - Making Change Easy.
    • Make it easy for engineers to make changes to the system in the future, adapting it for unanticipated use cases as requirements change. Also known as extensibility, modifiability, or plasticity.