AI News, BOOK REVIEW: Difference between revisions of "Portal:Computer architecture"

Difference between revisions of "Portal:Computer architecture"

In computer engineering, computer architecture is the conceptual design and fundamental operational structure of a computer system.

It is a blueprint and functional description of requirements (especially speeds and interconnections) and design implementations for the various parts of a computer — focusing largely on the way by which the central processing unit (CPU) performs internally and accesses addresses in memory.

we should learn about the most common ways to transform simple streams of electrons and 'holes' into the picture you see on your monitor, the sound you listen in the background - while reading this, and a way all these are interconected and what makes them happen.

We should also mention the kinds of devices used to enable your computer to connect to a server half way across the planet (which would have a tough time playing you a song but exist just to route network traffic and they're mighty good at it), the ones that control the industrial machinery, etc.

Computer architecture

In computer engineering, computer architecture is a set of rules and methods that describe the functionality, organization, and implementation of computer systems.

When building the computer Z1 in 1936, Konrad Zuse described in two patent applications for his future projects that machine instructions could be stored in the same storage used for data, i.e.

To describe the level of detail for discussing the luxuriously embellished computer, he noted that his description of formats, instruction types, hardware parameters, and speed enhancements were at the level of “system architecture” – a term that seemed more useful than “machine organization.”[7]

Computer architecture, like other architecture, is the art of determining the needs of the user of a structure and then designing to meet those needs as effectively as possible within economic and technological constraints.

computer architecture prototypes were physically built in the form of a transistor–transistor logic (TTL) computer—such as the prototypes of the 6800 and the PA-RISC—tested, and tweaked, before committing to the final hardware form. As

The purpose is to design a computer that maximizes performance while keeping power consumption in check, costs low relative to the amount of expected performance, and is also very reliable. For

A good ISA compromises between programmer convenience (how easy the code is to understand), size of the code (how much code is required to do a specific action), cost of the computer to interpret the instructions (more complexity means more hardware needed to decode and execute the instructions), and speed of the computer (with more complex decoding hardware comes longer decode time).

For example, a computer capable of running a virtual machine needs virtual memory hardware so that the memory of different virtual computers can be kept separated.

Computer architectures usually trade off standards, power versus performance, cost, memory capacity, latency (latency is the amount of time that it takes for information from one node to travel to the source) and throughput.

The 'instruction' in the standard measurements is not a count of the ISA's actual machine language instructions, but a unit of measurement, usually based on the speed of the VAX computer architecture.

Other factors influence speed, such as the mix of functional units, bus speeds, available memory, and the type and order of instructions in the programs.

Performance is affected by a very wide range of design choices — for example, pipelining a processor usually makes latency worse, but makes throughput better.

For example, computer-controlled anti-lock brakes must begin braking within a predictable, short time after the brake pedal is sensed or else failure of the brake will occur.

Furthermore, designers may target and add special features to their products, through hardware or software, that permit a specific benchmark to execute quickly but don't offer similar advantages to general tasks.

Increases in publicly released refresh rates have grown slowly over the past few years, with respect to vast leaps in power consumption reduction and miniaturization demand.

This has led to a new demand for longer battery life and reductions in size due to the mobile technology being produced at a greater rate.

This change in focus from greater refresh rates to power consumption and miniaturization can be shown by the significant reductions in power consumption, as much as 50%, that were reported by Intel in their release of the Haswell microarchitecture;

it can be seen that the focus in research and development are shifting away from refresh rates and moving towards consuming less power and taking up less space.

Schematic diagram of a modern von Neumann processor, where

the CPU is denoted by a shaded box -adapted from [Maf01].

Register file (a) block diagram, (b) implementation

of two read ports, and (c) implementation of write port -

Schematic high-level diagram of MIPS datapath from an implementational

Note that the execute step also includes writing of data

back to the register file, which is not shown in the figure, for simplicity

step does not include writing of results back to the register

Schematic diagram of a composite datapath for R-format and load/store instructions [MK98].

Schematic diagram of a composite datapath for R-format, load/store, and branch instructions [MK98].

Schematic diagram of composite datapath for R-format, load/store, and branch instructions (from Figure 4.11) with control

signals and extra multiplexer for WriteReg signal generation [MK98].

Schematic diagram of composite datapath for R-format, load/store, and branch instructions (from Figure 4.12) with control

Schematic diagram of composite datapath for R-format, load/store, branch, and jump instructions, with control signals

for the multicycle datapath finite-state control.

fetch and decode states of the multicycle datapath. Figure

numbers refer to figures in the textbook [Pat98,MK98].

numbers refer to figures in the textbook [Pat98,MK98].

(b) jump instruction-specific states of the multicycle datapath. Figure

CPI = [#Loads &#183 5 + #Stores &#183 4 + #ALU-instr's &#183

4 + #Branches &#183 3 + #Jumps &#183 3] / (Total Number of Instructions)

for the MIPS multicycle datapath, including exception handling [MK98].

Central processing unit

A central processing unit (CPU) is the electronic circuitry within a computer that carries out the instructions of a computer program by performing the basic arithmetic, logical, control and input/output (I/O) operations specified by the instructions.

Traditionally, the term 'CPU' refers to a processor, more specifically to its processing unit and control unit (CU), distinguishing these core elements of a computer from external components such as main memory and I/O circuitry.[2]

Principal components of a CPU include the arithmetic logic unit (ALU) that performs arithmetic and logic operations, processor registers that supply operands to the ALU and store the results of ALU operations and a control unit that orchestrates the fetching (from memory) and execution of instructions by directing the coordinated operations of the ALU, registers and other components.

Since the term 'CPU' is generally defined as a device for software (computer program) execution, the earliest devices that could rightly be called CPUs came with the advent of the stored-program computer.

Both the miniaturization and standardization of CPUs have increased the presence of digital devices in modern life far beyond the limited application of dedicated computing machines.

While von Neumann is most often credited with the design of the stored-program computer because of his design of EDVAC, and the design became known as the von Neumann architecture, others before him, such as Konrad Zuse, had suggested and implemented similar ideas.[16]

The key difference between the von Neumann and Harvard architectures is that the latter separates the storage and treatment of CPU instructions and data, while the former uses the same memory space for both.[20]

Aside from facilitating increased reliability and lower power consumption, transistors also allowed CPUs to operate at much higher speeds because of the short switching time of a transistor in comparison to a tube or relay.[31]

The increased reliability and dramatically increased speed of the switching elements (which were almost exclusively transistors by this time), CPU clock rates in the tens of megahertz were easily obtained during this period.[32]

Lee Boysel published influential articles, including a 1967 'manifesto', which described how to build the equivalent of a 32-bit mainframe computer from a relatively small number of large-scale integration circuits (LSI).[37][38]

Since the introduction of the first commercially available microprocessor, the Intel 4004 in 1970, and the first widely used microprocessor, the Intel 8080 in 1974, this class of CPUs has almost completely overtaken all other central processing unit implementation methods.

Mainframe and minicomputer manufacturers of the time launched proprietary IC development programs to upgrade their older computer architectures, and eventually produced instruction set compatible microprocessors that were backward-compatible with their older hardware and software.

The overall smaller CPU size, as a result of being implemented on a single die, means faster switching time because of physical factors like decreased gate parasitic capacitance.[46][47]

These newer concerns are among the many factors causing researchers to investigate new methods of computing such as the quantum computer, as well as to expand the usage of parallelism and other methods that extend the usefulness of the classical von Neumann model.

After the execution of an instruction, the entire process repeats, with the next instruction cycle normally fetching the next-in-sequence instruction because of the incremented value in the program counter.

such instructions are generally called 'jumps' and facilitate program behavior like loops, conditional program execution (through the use of a conditional jump), and existence of functions.[c]

Often, one group of bits (that is, a 'field') within the instruction, called the opcode, indicates which operation is to be performed, while the remaining fields usually provide supplemental information required for the operation, such as the operands.

Those operands may be specified as a constant value (called an immediate value), or as the location of a value that may be a processor register or a memory address, as determined by some addressing mode.

For example, if an addition instruction is to be executed, the arithmetic logic unit (ALU) inputs are connected to a pair of operand sources (numbers to be summed), the ALU is configured to perform an addition operation so that the sum of its operand inputs will appear at its output, and the ALU output is connected to storage (e.g., a register or memory) that will receive the sum.

When the clock pulse occurs, the sum will be transferred to storage and, if the resulting sum is too large (i.e., it is larger than the ALU's output word size), an arithmetic overflow flag will be set.

A complete machine language instruction consists of an opcode and, in many cases, additional bits that specify arguments for the operation (for example, the numbers to be summed in the case of an addition operation).

Beside the instructions for integer mathematics and logic operations, various other machine instructions exist, such as those for loading data from memory and storing it back, branching operations, and mathematical operations on floating-point numbers performed by the CPU's floating-point unit (FPU).[55]

The result consists of both a data word, which may be stored in a register or memory, and status information that is typically stored in a special, internal CPU register reserved for this purpose.

Most high-end microprocessors (in desktop, laptop, server computers) have a memory management unit, translating logical addresses into physical RAM addresses, providing memory protection and paging abilities, useful for virtual memory.

In setting the clock period to a value well above the worst-case propagation delay, it is possible to design the entire CPU and the way it moves data around the 'edges' of the rising and falling clock signal.

One method of dealing with the switching of unneeded components is called clock gating, which involves turning off the clock signal to unneeded components (effectively disabling them).

While removing the global clock signal makes the design process considerably more complex in many ways, asynchronous (or clockless) designs carry marked advantages in power consumption and heat dissipation in comparison with similar synchronous designs.

Rather than totally removing the clock signal, some CPU designs allow certain portions of the device to be asynchronous, such as using asynchronous ALUs in conjunction with superscalar pipelining to achieve some arithmetic performance gains.

While it is not altogether clear whether totally asynchronous designs can perform at a comparable or better level than their synchronous counterparts, it is evident that they do at least excel in simpler math operations.

For example, some early digital computers represented numbers as familiar decimal (base 10) numeral system values, and others have employed more unusual representations such as ternary (base three).

In the case of a binary CPU, this is measured by the number of bits (significant digits of a binary encoded integer) that the CPU can process in one operation, which is commonly called word size, bit width, data path width, integer precision, or integer size.

As a result, smaller 4- or 8-bit microcontrollers are commonly used in modern applications even though CPUs with much larger word sizes (such as 16, 32, 64, even 128-bit) are available.

For example, even though the IBM System/360 instruction set was a 32-bit instruction set, the System/360 Model 30 and Model 40 had 8-bit data paths in the arithmetic logical unit, so that a 32-bit add required four cycles, one for each 8 bits of the operands, and, even though the Motorola 68000 series instruction set was a 32-bit instruction set, the Motorola 68000 and Motorola 68010 had 16-bit data paths in the arithmetic logical unit, so that a 32-bit add required two cycles.

To gain some of the advantages afforded by both lower and higher bit lengths, many instruction sets have different bit widths for integer and floating-point data, allowing CPUs implementing that instruction set to have different bit widths for different portions of the device.

For example, the IBM System/360 instruction set was primarily 32 bit, but supported 64-bit floating point values to facilitate greater accuracy and range in floating point numbers.[27]

Many later CPU designs use similar mixed bit width, especially when the processor is meant for general-purpose usage where a reasonable balance of integer and floating point capability is required.

This design, wherein the CPU's execution resources can operate on only one instruction at a time, can only possibly reach scalar performance (one instruction per clock cycle, IPC = 1).

Designs that are said to be superscalar include a long instruction pipeline and multiple identical execution units, such as load-store units, arithmetic-logic units, floating-point units and address generation units.[59]

The dispatcher needs to be able to quickly and correctly determine whether instructions can be executed in parallel, as well as dispatch them in such a way as to keep as many execution units busy as possible.

It also makes hazard-avoiding techniques like branch prediction, speculative execution, register renaming, out-of-order execution and transactional memory crucial to maintaining high levels of performance.

By attempting to predict which branch (or path) a conditional instruction will take, the CPU can minimize the number of times that the entire pipeline must wait until a conditional instruction is completed.

Also in case of single instruction stream, multiple data stream—a case when a lot of data from the same type has to be processed—, modern processors can disable parts of the pipeline so that when a single instruction is executed many times, the CPU skips the fetch and decode phases and thus greatly increases performance on certain occasions, especially in highly monotonous program engines such as video creation software and photo processing.

Intel's successor to the P5 architecture, P6, added superscalar capabilities to its floating point features, and therefore afforded a significant increase in floating point instruction performance.

Both simple pipelining and superscalar design increase a CPU's ILP by allowing a single processor to complete execution of instructions at rates surpassing one instruction per clock cycle.[i]

The strategy of the very long instruction word (VLIW) causes some ILP to become implied directly by the software, reducing the amount of work the CPU must perform to boost ILP and thereby reducing the design's complexity.

For several decades from the 1970s to early 2000s, the focus in designing high performance general purpose CPUs was largely on achieving high ILP through technologies such as pipelining, caches, superscalar execution, out-of-order execution, etc.

By the early 2000s, CPU designers were thwarted from achieving higher performance from ILP techniques due to the growing disparity between CPU operating frequencies and main memory operating frequencies as well as escalating CPU power dissipation owing to more esoteric ILP techniques.

CPU designers then borrowed ideas from commercial computing markets such as transaction processing, where the aggregate performance of multiple programs, also known as throughput computing, was more important than the performance of a single thread or process.

Late designs in several processor families exhibit CMP, including the x86-64 Opteron and Athlon 64 X2, the SPARC UltraSPARC T1, IBM POWER4 and POWER5, as well as several video game console CPUs like the Xbox 360's triple-core PowerPC design, and the PlayStation 3's 7-core Cell microprocessor.

Using Flynn's taxonomy, these two schemes of dealing with data are generally referred to as single instruction stream, multiple data stream (SIMD) and single instruction stream, single data stream (SISD), respectively.

The great utility in creating processors that deal with vectors of data lies in optimizing tasks that tend to require the same operation (for example, a sum or a dot product) to be performed on a large set of data.

Whereas a scalar processor must complete the entire process of fetching, decoding and executing each instruction and value in a set of data, a vector processor can perform a single operation on a comparatively large set of data with one instruction.

Shortly after inclusion of floating-point units started to become commonplace in general-purpose processors, specifications for and implementations of SIMD execution units also began to appear for general-purpose processors.[when?]

The performance or speed of a processor depends on, among many other factors, the clock rate (generally given in multiples of hertz) and the instructions per clock (IPC), which together are the factors for the instructions per second (IPS) that the CPU can perform.[66] Many

reported IPS values have represented 'peak' execution rates on artificial instruction sequences with few branches, whereas realistic workloads consist of a mix of instructions and applications, some of which take longer to execute than others.

Because of these problems, various standardized tests, often called 'benchmarks' for this purpose‍—‌such as SPECint‍—‌have been developed to attempt to measure the real effective performance in commonly used applications.

Due to specific capabilities of modern CPUs, such as hyper-threading and uncore, which involve sharing of actual CPU resources while aiming at increased utilization, monitoring performance levels and hardware use gradually became a more complex task.[69]

Microprocessor Design/Computer Architecture

To reprogram a computer meant changing the hardware switches manually, that took a long time with potential errors.

As the data travels to different parts of the datapath, the command signals from the control unit cause the data to be manipulated in specific ways, according to the instruction.

Many DSPs are modified Harvard architectures, designed to simultaneously access three distinct memory areas: the program instructions, the signal data samples, and the filter coefficients (often called the P, X, and Y memories).

In theory, such three-way Harvard architectures can be three times as fast as a Von Neumann architecture that is forced to read the instruction, the data sample, and the filter coefficient, one at a time.

However, a modern feature called 'paging' allows the physical memory to be segmented into large blocks of memory called 'pages'.

CISC systems actually have 'complex instructions', in the sense that at least one instruction takes a long time to execute -- for example, the 'double indirect' addressing mode inherently requires two memory cycles to execute, and a few CPUs have a 'string copy' instruction that may require hundreds of memory cycles to execute.

Other ISA types include DSPs, stack machines, VLIW machines, MISC machines, TTA architectures, massively parallel processor arrays, etc.

The control unit, as described above, reads the instructions, and generates the necessary digital signals to operate the other components.

The most general meaning is a 'hardware register': anything that can be used to store bits of information, in a way that all the bits of the register can be written to or read out simultaneously. Since

registers outside of a CPU are also outside the scope of the book, this book will only discuss processor registers, which are hardware registers that happen to be inside a CPU. But

The programmer-visible registers, also called the user-accessible registers, also called the architectural registers, often simply called 'the registers', are the registers that are directly encoded as part of at least one instruction in the instruction set.

Some computers have highly specialized registers -- memory addresses always came from the program counter or 'the' index register or 'the' stack pointer;

Other computers have more general-purpose registers -- any instruction that access memory can use any address register as a index register or as a stack pointer;

many designers choose to design a CPU with lots of physical registers, using them in ways that make the CPU execute the same given instruction set much faster than a CPU that lacks those registers.

The cache is used because reading external memory is very slow (compared to the speed of the processor), and reading a local cache is much faster.

Some computers order their data with the most significant byte of a word in the lowest address, while others order their data with the most significant byte of a word in the highest address.

Computers that order data with the least significant byte in the lowest address are known as 'Little Endian', and computers that order the data with the most significant byte in the lowest address are known as 'Big Endian'.

It is easier for a human (typically a programmer) to view multi-word data dumped to a screen one byte at a time if it is ordered as Big Endian.

When communicating over a network composed of both big-endian and little-endian machines, the network hardware (should) apply the Address Invariance principle, to avoid scrambling text (avoiding the NUXI problem). High-level

software (should) be written as 'endian clean' -- always reading and writing 16 bit integers as whole 16 bit integers, 32 bit integers as whole 32 bit integers, etc.

that is not 'endian clean' -- software that writes integers, but then reads them out as 8 bit octets or integers of some other length -- usually fails when re-compiled for another computer.

Advanced CPU Designs: Crash Course Computer Science #9

So now that we've built and programmed our very own CPU, we're going to take a step back and look at how CPU speeds have rapidly increased from just a few ...

Intro to Computer Architecture

An overview of hardware and software components of a computer system.

Inside your computer - Bettina Bair

View full lesson: How does a computer work? The critical components of a computer are the ..

Computer Basics: Hardware

Computer Hardware Hard Drive Video A desktop computer is comprised of many diverse component

Registers and RAM: Crash Course Computer Science #6

Take the 2017 PBS Digital Studios Survey: Today we're going to create memory! Using the basic logic gates we ..

How a CPU Works

New Course from InOneLesson (Coming Soon): Uncover the inner workings of the CPU. Author's Website: ..

Lecture -18 Processor Design

Lecture Series on Computer Architecture by Prof. Anshul Kumar, Department of Computer Science & Engineering ,IIT Delhi. For more details on NPTEL visit ...

How Computers Calculate - the ALU: Crash Course Computer Science #5

Take the 2017 PBS Digital Studios Survey: Today we're going to talk about a fundamental part of all modern computers

Lecture - 3 Introduction To System : Hardware

Lecture Series on Computer Organization by Prof.S. Raman, Department of Computer Science and Engineering, IIT Madras. For More details on NPTEL visit ...

Lecture - 16 CPU - Memory Interaction

Lecture Series on Computer Organization by Prof. S. Raman, Department of Computer Science and Engineering, IIT Madras. For more details on NPTEL visit ...