Programming Survival Skills

| September 7, 2006

Active Image
Active Image del.icio.us

Discuss in Forums {mos_smf_discuss:Book Reviews}

Why study programming? Ethical hackers should study programming and learn as much about the subject as possible in order to find vulnerabilities in programs and get them fixed before unethical hackers take advantage of them. It is very much a foot race: if the vulnerability exists, who will find it first? The purpose of this article is to give you the survival skills and the ability to find holes in software before the black hats do.

It should be said at the outset that programming is not something you learn in an article, a chapter or in one book, for that matter. There are professional and hobbyist programmers who spend years perfecting their skills. However, there are a few core concepts that can be picked up rather quickly. We will not try to turn you into a programmer. Instead, we hope that by the end of this article you won't be afraid to look at source code and that you've learned a few skills to "hack" up some code if you need to.

This chapter (CH 7) is excerpted from the book titled "Gray Hat Hacking: The Ethical Hacker's Handbook" By Shon Harris, et al, published by McGraw-Hill Osborne. ISBN: 0072257091; Published: November 2004; Pages: 434; Edition: 1st.

Programming

The Problem-Solving Process

Experienced programmers develop their own process of tackling problems. They will consolidate or even skip steps, but most of them will admit that when they first learned, they followed a more structured approach. Programming usually revolves around a problem-solving process as follows:

1 Define the problem. The first step in the process is to clearly define the problem. This step can be as simple as brainstorming; however, there is usually a need to develop requirements of some sort. In other words, "What is the problem we are trying to solve?" The next question to ask is "What are the high-level functions that will allow us to solve that problem?"

2 Distill the problem down to byte-sized chunks. At this point, the high-level functions may be hard to wrap your mind around. In fact, they may be very abstract, so the next step is to break the high-level functions into smaller ones through an iterative process until they are easy to digest.

3 Develop pseudo-code. The term pseudo means fake. Pseudo-code refers to the clear, concise description of logic or computer algorithms in plain English, without regard to syntax. In fact, there is no precise syntax for pseudo-code. That said, most programmers gravitate toward a format like their favorite programming language.

4 Group like components into modules. This is an important step. Recent programming programs rely on a concept called modularity. Simply put, modular means self-contained and reusable bits of code-after all, why reinvent the wheel if the module already exists? Over time, programmers develop their own software module library or learn to quickly find others and incorporate them into their own programs. Modules are loosely related to functions or procedures in most languages.

5 Translate to a programming language. At this point, you should have pseudocode that resembles your favorite programming language. It's a simple process to add the nuances of your favorite language, such as its syntax and format. Then run your favorite compiler to turn the source code into machine language.

6 Debug errors. The term bug (as it relates to hardware) was first coined by Thomas Edison, but it was Navy Admiral Grace Hopper who made it famous

when she found a bug (actually a moth) while cleaning the Harvard Mark II calculator in 1945. Many credit her with coining the term debugging, which means to remove any errors in either computer hardware or software. For our purposes, there are two types of programming errors or software bugs that require debugging: syntax and runtime.

- Syntax errors Syntax errors are caught by the compiler during the compiling process, and are the easiest type of error to find and fix. The compiler usually provides enough information to allow the programmer to find the syntax error and fix it.

Active ImageNOTE: Many black and gray hat hackers will modify the source code for their exploits and purposefully add syntax errors to keep script kiddies from compiling and running them. The source code could be crippled by removing needed syntax (for example, missing ; or {}).

7 Runtime errors. Runtime errors are harder to find and fix. They are not caught by the compiler and may or may not be found during program execution. Runtime errors (or bugs) are problems that will cause the program to act in a manner not intended by the programmer. There are too many types of bugs to cover in this chapter; however, we will focus on input validation errors. These errors occur when the user provides input in a manner not foreseen or handled properly by the programmer. The program may crash (at the least) or may allow the attacker to gain control of the program and the underlying operating system with administrator or root level access (in the worst case). The rest of this chapter will focus on this type of programming error.

8 Test the program. The purpose of testing is to find the runtime errors while confirming the functionality of the program. In other words, does the program perform as planned without any unforeseen consequences? It's the latter part of this question that's the hardest to answer. The problem centers around the fact that a programmer is often the last person to find all of the runtime errors. The programmer is typically too close to the program to think outside the box. It is usually best to separate the testing and programming functionalities and have separate people perform them.

9 Implement production. Unfortunately, for too many software products, production begins as soon as the program compiles without syntax errors and without adequate testing. There's often a rush to market and a ready acceptance that bugs will be found and handled later.

Pseudo-code

In this section, we will elaborate on the concept of pseudo-code. In order to better prepare you for upcoming sections, we will aim for a C language format. The reason for this is to quickly move from pseudo to real code.

Though we don't typically think of it in those terms, humans naturally act like programs during the course of their day. To illustrate this point, the process of getting ready for work generally goes like this:

1 Wake up
2 Use toilet
3 Shower
4 Fix hair
5 Brush teeth
6 Shave/apply make-up
7 Select clothes to wear
8 Get dressed
9 Eat breakfast
10 Watch news
11 Grab keys and wallet/purse
12 Head for the door

To a programmer, this process would follow an algorithm like this:

Wake up ( )
Visit the bathroom ( )
Get dressed ( )
Prepare to leave ( )
Leave ( )

At this point, using ( ) indicates the use of modules. The important thing to note is that each step follows a logical progression. It would make no sense, for example, to leave without first getting dressed.

If we were to make a computer program to model this process, we would further refine each of the preceding modules by creating smaller modules, some nested within each other:

Prepare to leave ( ){
    Eat breakfast ( )
    Until it is time to leave {
        Watch news ( )
    }
    Grab keys and wallet ( )
}

The { and } symbols are used to mark the beginning and the end of a block of pseudocode. Don't worry about the symbols we've chosen here. They'll make more sense later. Further refined, Eat breakfast ( ) could look like this:

Eat breakfast ( ){
    Choose a brand of cereal ( )
    Prepare a bowl of cereal ( )
    Grab a large spoon ( )
    Eat the bowl of cereal ( )

And so on, and so on…

Programmers vs. Hackers

There are several differences between programmers and hackers, some of which are discussed here.

Order vs. Disorder

Simply put, programmers like order, whereas hackers thrive upon disorder. Programmers will develop complicated yet logical algorithms that are created to accomplish a particular task. The better ones will transform their pseudo-code into a formal mathematical language and attempt to prove the correctness and/or security of the algorithm, ensuring that the program will "never" crash (at least as best as they can tell). On the other hand, a hacker will think outside the box, and throw a million monkey-wrenches into your program to reduce it to a willing zombie-like slave that yields control of itself to the hacker.

Time

The main difference between the operating manner of programmers and hackers is time. Programmers are very limited in the area of time and usually have some type of deadline to meet. Less-experienced programmers will stay up all night removing the bugs until the code finally compiles, after which it is shipped. Hackers, on the other hand, operate within a nearly unlimited timeframe. I say "nearly" because if there is a vulnerability present, it's only a matter of time before some hacker "good or bad" will find it. You see, with hackers, other things become more important than time, such as the challenge. This problem will always exist, and the programmers will always have less time than the hackers to "test" software. That is one of the motivating factors of this book. We need more honorable, ethical hackers out there finding problems and getting them fixed before the attackers do.

Motive

Motive can be compared in two ways: defense vs. offense and money vs. bragging rights.

Defense vs. Offense – In general terms, the programmer normally plays a defensive position, performing a function while trying to protect the user from hackers. Meanwhile, attackers are on the offense, attacking the programmer's defenses and intending harm to the user. Following this analogy, any football player will tell you that it is much more physically demanding to play defense than to play offense. The problem is that the defender must defend against all possible attacks and the offender must only be good at a few. To make matters worse, each programmer that makes a server program with ports accessible from the Internet has to defend against millions of bright and dangerous attackers.

Active ImageNOTE This is where the ethical hacker can level the playing field. By acting on the defender's behalf, the ethical hacker can beat the attacker to the punch by identifying problems and getting them fixed.

Money vs. Bragging Rights – Although money is desirable, never underestimate the power and influence of bragging rights. There are many talented programmers contributing to open source and free software, but the vast majority of programmers program for a living. However, although some attackers are on the payroll of a "client," most are in the game to brag about their conquests and earn the respect of their peers. Besides, collecting a fee for maliciously attacking systems is always illegal and particularly hard to explain away before a judge as some kind of research venture. Ethical hackers, on the other hand, have the best of both worlds. Whether ethically hacking for a paying client or conducting research for free, they have nothing to hide; they earn the respect of their peers and get to brag about their exploits afterwards, too!

References

Grace Hopper Biography www.history.navy.mil/photos/pers-us/uspers-h/g-hoppr.htm

Explanation of pseudo-code www.cs.iit.edu/~cs561/cs105/pseudocode1/header.html

  

C Programming Language

The C programming language was developed in 1972 by Dennis Ritchie from AT&T Bell Labs. The language was heavily used in Unix and is thereby ubiquitous. In fact, much of the staple networking programs and operating systems are based in C.

Basic C Language Constructs

Although each C program is unique, there are common structures that can be found in most programs. We'll discuss these in the next few sections.

main()

All C programs contain a main structure (lowercase) that follows this format:

<optional return value type> main(<optional argument>) {
    <optional procedure statements or function calls>;
}

where both the return value type and arguments are optional. If you use command-line arguments for main(), use the format:

<optional return value type> main(int argc, char * argv[]){

where the argc integer holds the number of arguments and the argv array holds the input arguments (strings). The parentheses and brackets are mandatory, but white space between these elements does not matter. The brackets are used to denote the beginning and end of a block of code. Although procedure and function calls are optional, the program would do nothing without them. Procedure statements are simply a series of commands that perform operations on data or variables and normally end with a semicolon.

Functions

Functions are self-contained bundles of algorithms that can be called for execution by main() or other functions. Technically, the main() structure of each C program is also a function; however, most programs contain other functions. The format is as follows:

<optional return value type> function name (<optional function argument>){
}

The first line of a function is called the signature. By looking at it, you can tell if the function returns a value after executing or requires arguments that will be used in processing the procedures of the function.

The call to the function looks like this:

<optional variable to store the returned value =>function name (arguments
if called for by the function signature);

Again, notice the required semicolon at the end of the function call. In general, the semicolon is used on all stand-alone command lines (not bounded by brackets or parentheses).

Functions are used to modify the flow of a program. When a call to a function is made, the execution of the program temporarily jumps to the function. After execution of the called function has completed, the program continues executing on the line following the call. This will make more sense during our later discussion of stack operation.

Variables

Variables are used in programs to store pieces of information that may change and may be used to dynamically influence the program.

Table 7-1 shows some common types of variables.

When the program is compiled, most variables are preallocated memory of a fixed size according to system-specific definitions of size. Sizes in the table are considered typical; there is no guarantee that you will get those exact sizes. It is left up to the hardware implementation to define this size. However, the function sizeof() is used in C to ensure the correct sizes are allocated by the compiler.

Active Image

Table 7-1 Types of Variables

Variables are typically defined near the top of a block of code. As the compiler chews up the code and builds a symbol table, it must be aware of a variable before it is used in the code later. This formal declaration of variables is done in the following manner:

<variable type> <variable name> <optional initialization starting with "=">;

For example:

int a = 0;

where an integer (normally 4 bytes) is declared in memory with a name of a and an initial value of 0.

Once declared, the assignment construct is used to change the value of a variable. For example, the statement:

x=x+1;

is an assignment statement containing a variable x modified by the + operator. The new value is stored into x. It is common to use the format:

destination = source <with optional operators>

where destination is where the final outcome is stored.

printf

The C language comes with many useful constructs for free (bundled in the libc library). One of the most commonly used constructs is the printf command, generally used to print output to the screen. There are two forms of the printf command:

printf(<string>);
printf(<format string>, <list of variables/values>);

The first format is straightforward and is used to display a simple string to the screen. The second format allows for more flexibility through the use of a format string which can be comprised of normal characters and special symbols that act as placeholders for the list of variables following the comma. Commonly used format symbols are shown in Table 7-2.

These format symbols may be combined in any order to produce the desired output. Except for the n symbol, the number of variables/values needs to match the number of symbols in the format string; otherwise, problems will arise, as shown in Chapter 9.

Active Image

scanf

The scanf command complements the printf command and is generally used to get input from the user. The format is as follows:

scanf(<format string>, &<named variable>);

where the format string can contain format symbols as those shown in printf. For example, the following code will read an integer from the user and store it into the variable called number:

scanf("%d", &number);

Actually, the & symbol means we are storing the value into the memory location pointed to by number; that will make more sense when we talk about pointers later. For now, realize that you must use the & symbol before any variable name with scanf. The command is smart enough to change types on the fly, so if you were to enter a character in the previous command prompt, the command will convert the character into the decimal (ASCII) value automatically. However, bounds checking is not done in regards to string size, which may lead to problems (as discussed later in Chapter 8).

strcpy/strncpy

The strcpy command is probably the most dangerous command used in C. The format of the command is

strcpy(<destination>, <source>);

The purpose of the command is to copy each character in the source string (a series of characters ending with a null character: ) into the destination string. This is particularly dangerous because there is no checking of the size of the source before it is copied over the destination. In reality, we are talking about overwriting memory locations here, something which will be explained later. Suffice it to say, when the source is larger than the space allocated for the destination, bad things happen (buffer overflows). A much safer command is the strncpy command. The format of that command is

strncpy(<destination>, <source>, <width>);

The width field is used to ensure that only a certain number of characters are copied from the source string to the destination string, allowing for greater control by the programmer.

Active ImageNOTE: It is unsafe to use unbounded functions like strcpy; however, most programming courses do not cover the dangers posed by these functions. In fact, if programmers would simply use the safer alternatives-for example: strncpy-then the entire class of buffer overflow attacks would not exist. Obviously, programmers continue to use these dangerous functions since buffer overflows are the most common attack vector.

for and while Loops

Loops are used in programming languages to iterate through a series of commands multiple times. The two common types are for and while loops. for loops start counting at a beginning value, test the value for some condition, execute the statement, and increment the value for the next iteration. The format is as follows:

for(<beginning value>; <test value>; <change value>){
    <statement>;
}

Therefore, a for loop like:

for(i=0; i<10; i++){
    printf("%d", i);
}

will print the numbers 0 to 9 on the same line (since n is not used), like this: 0123456789. With for loops, the condition is checked prior to the iteration of the statements in the loop, so it is possible that even the first iteration will not be executed. When the condition is not met, the flow of the program continues after the loop.

Active ImageNOTE: It is important to note the use of the less-than operator (<) in place of the less-than-or-equal-to operator (<=), which allows the loop to proceed one more time until i=10. This is an important concept that can lead to off-by-one errors. Also, note the count was started with 0. This is common in C and worth getting used to.

The while loop is used to iterate through a series of statements until a condition is met. The format is as follows:

while(<conditional test>){
    <statement>;
}

Unlike the for loop, the while loop will always execute at least once. This is because the condition test is checked after the first iteration. It is important to realize that loops may be nested within each other.

if/else

The if/else construct is used to execute a series of statements if a certain condition is met; otherwise, the optional else block of statements is executed. If there is no else block of statements, the flow of the program will continue after the end of the closing if block bracket (}). The format is as follows:

if(<condition>) {
    <statements to execute if condition is met>
} <else>{
    <statements to execute if the condition above is false>;
}

Comments

To assist in the readability and sharing of source code, programmers include comments in the code. There are two ways to place comments in code: // or /* and */. The // indicates that any characters on the rest of that line are to be treated as comments and not acted on by the computer when the program executes. The /* and */ pair start and stop blocks of comment that may span multiple lines. The /* is used to start the comment, and the */ is used to indicate the end of the comment block.

Sample Program

You are now ready to review your first program. We will start by showing the program with // comments included, and will follow up with a discussion of the program.

//hello.c                         //customary comment of program name
#include <stdio.h>          //needed for screen printing
main ( ) {                      //required main function
    printf("Hello haxor");   //simply say hello
}                                   //exit program

This is a very simple program that prints out "Hello haxor" to the screen using the printf function, included in the stdio.h library. Now for one that's a little more complex:

//meet.c
#include <stdio.h>                                  // needed for screen printing
greeting(char *temp1,char *temp2){         // greeting function to say hello
    char name[400];                                 // string variable to hold the name
    strcpy(name, temp2);                          // copy the function argument to name
    printf("Hello %s %sn", temp1, name);  //print out the greeting
}
main(int argc, char * argv[]){                    //note the format for arguments
    greeting(argv[1], argv[2]);                    //call function, pass title & name
    printf("Bye %s %sn", argv[1], argv[2]); //say "bye"
}                                                            //exit program

This program takes two command-line arguments and calls the greeting() function, which prints "Hello" and the name given and a carriage return. When the greeting() function finishes, control is returned to main(), which prints out "Bye" and the name given. Finally, the program exits.

Compiling with gcc

Compiling is the process of turning human-readable source code into machine-readable binary files that can be digested by the computer and executed. More specifically, a compiler takes source code and translates it into an intermediate set of files called object code. These files are nearly ready to execute but may contain unresolved references to symbols and functions not included in the original source code file. These symbols and references are resolved through a process called linking as each object file is linked together into an executable binary file. We have simplified the process for you here.

When programming with C on Unix systems, the compiler of choice is GNU C Compiler (gcc). gcc offers plenty of options when compiling. The most commonly used flags are shown in Table 7-3.

Active Image

Table 7-3 Commonly Used gcc Flags

For example, to compile our meet.c program, you would type

$gcc -o meet meet.c

Then to execute the new program, you would type

$./meet Mr Haxor
Hello Mr Haxor
Bye Mr Haxor
$

References

C Programming Methodology www.comp.nus.edu.sg/~hugh/TeachingStuff/cs1101c.pdf

Introduction to C Programming www.le.ac.uk/cc/tutorials/c/

How C Works http://computer.howstuffworks.com/c.htm

 

  

Computer Memory

In the simplest terms, computer memory is an electronic mechanism that has the ability to store and retrieve data. The smallest amount of data that can be stored is 1 bit, which can be represented by eithera1ora0in memory. When you put 4 bits together, it is called a "nibble," which can represent values from 0000 to – There are exactly 16 binary values, ranging from 0 to -15, in decimal format. When you put two nibbles or 8 bits together, you get a "byte," which can represent values from 0 to (28-1)=0-255 decimal. When you put 2 bytes together, you get a "word," which can represent values from 0 to (216-1)=0-65,535 in decimal. Continuing to piece data together, if you put two words together, you get a "double word" or "DWORD," which can represent values from 0 to (232-1)=0-4,294,967,295 in decimal.

 

There are many types of computer memory; we will focus on random access memory (RAM) and registers. Registers are special forms of memory embedded within processors, which will be discussed later in this chapter in the "Registers" section.

Random Access Memory (RAM)

In RAM, any piece of stored data can be retrieved at any time-thus, the term random access. However, RAM is volatile, meaning that when the computer is turned off, all data is lost from RAM. When discussing modern Intel-based products (x86), the memory is 32bit addressable, meaning that the address bus the processor uses to select a particular memory address is 32 bits wide. Therefore, the most memory that can be addressed in an x86 processor is 4,294,967,295 bytes.

Endian

As Danny Cohen summarized Swift's Gulliver travels in 1980:

"Some notes on Swift's Gulliver's Travels: Gulliver finds out that there is a law, proclaimed by the grandfather of the present ruler, requiring all citizens of Lilliput to break their eggs only at the little ends. Of course, all those citizens who broke their eggs at the big ends were angered by the proclamation. Civil war broke out between the Little-Endians and the Big-Endians, resulting in the Big-Endians taking refuge on a nearby island, the kingdom of Blefuscu…"

He went on to describe a holy war that broke out between the two sides. The point of his paper was to describe the two schools of thought when writing data into memory. Some feel that the high order bytes should be written first (called "Little Endian") while others think the low order bytes should be written first. It really depends on the hardware you are using as to the difference. For example, on Intel based processors, they use Little Endian, where as on Motorola based processors, they use Big Endian. This will come into play later as we talk about shellcode.

Segmentation of Memory

The subject of segmentation could easily consume a chapter itself. However, the basic concept is simple. Each process (oversimplified as an executing program) needs to have access to its own areas in memory. After all, you would not want one process overwriting another process's data. So memory is broken down into small segments and handed out to processes as needed. Registers, discussed later, are used to store and keep track of the current segments a process maintains. Offset registers are used to keep track of where in the segment the critical pieces of data are kept.

Programs in Memory

When processes are loaded into memory, they are basically broken into many small sections. There are six main sections that we are concerned with, and we'll discuss them in the following sections.

.text Section

The .text section basically corresponds to the .text portion of the binary executable file. It contains the machine instructions to get the task done. This section is marked as read-only and will cause a segmentation fault if written to. The size is fixed at runtime when the process is first loaded.

.data Section

The .data section is used to store initialized variables, such as:

int a = 0;

The size of this section is fixed at runtime.

.bss Section

The below stack section (.bss) is used to store noninitialized variables, such as:

int a;

The size of this section is fixed at runtime.

Heap Section

The heap section is used to store dynamically allocated variables and grows from the lower-addressed memory to the higher-addressed memory. The allocation of memory is controlled through the malloc() and free() functions. For example, to declare an integer and have the memory allocated at runtime, you would use something like:

int i = malloc (sizeof (int)); //dynamically allocates an integer, contains
                                       //the pre-existing value of that memory

Stack Section

The stack section is used to keep track of function calls (recursively) and grows from the higher-addressed memory to the lower-addressed memory on most systems. As we will see, the fact that the stack grows in this manner allows the subject of buffer overflows to exist.

Environment/Arguments Section

The environment/arguments section is used to store a copy of system-level variables that may be required by the process during runtime. For example, among other things, the path, shell name, and hostname are made available to the running process. This section is writable, allowing its use in format string and buffer overflow exploits. Additionally, the command-line arguments are stored in this area. The sections of memory reside in the order presented. The memory space of a process looks like this:

 Active Image

Buffers

The term buffer refers to a storage place used to receive and hold data until it can be handled by a process. Since each process can have its own set of buffers, it is critical to keep them straight. This is done by allocating the memory within the .data or .bss section of the process's memory. Remember, once allocated, the buffer is of fixed length. The buffer may hold any predefined type of data; however, for our purpose, we will focus on string-based buffers, used to store user input and variables.

Strings in Memory

Simply put, strings are just continuous arrays of character data in memory. The string is referenced in memory by the address of the first character. The string is terminated or ended by a null character ( in C).

Pointers

Pointers are special pieces of memory that hold the address of other pieces of memory. Moving data around inside of memory is a relatively slow operation. It turns out that instead of moving data, it is much easier to simply keep track of the location of items in memory (through pointers) and simply change the pointers. Pointers are saved in 4 bytes of contiguous memory because memory addresses are 32 bits in length (4 bytes). For example, as mentioned, strings are referenced by the address of the first character in the array. That address value is called a pointer. So the variable declaration of a string in C is written as follows:

char * str; //this is read, give me 4 bytes called str which is a pointer
               //to a Character variable (the first byte of the array).

It is important to note that even though the size of the pointer is set at 4 bytes, the size of the string has not been set with the preceding command, therefore this data is considered uninitialized and will be placed in the .bss section of the process memory.

 

As another example, if you wanted to store a pointer to an integer in memory, you would issue the following command in your C program:

int * point1; // this is read, give me 4 bytes called point1 which is a
                  //pointer to an integer variable.

To read the value of the memory address pointed to by the pointer, you dereference the pointer with the * symbol. Therefore, if you wanted to print the value of the integer pointed to by point1 in the preceding code, you would use the following command:

printf("%d", *point1);

where the * is used to dereference the pointer called point1 and display the value of the integer using the printf() function.

Putting the Pieces of Memory Together

Now that you have the basics down, we will present a simple example to illustrate the usage of memory in a program:

/* memory.c */      // this comment simply holds the program name
    int index = 5;     // integer stored in data (initialized)
    char * str;          // string stored in bss (uninitialized)
    int nothing;        // integer stored in bss (uninitialized)
void funct1(int c){   // bracket starts function1 block
    int i=c;                                                 // stored in the stack region
    str = (char*) malloc (10 * sizeof (char));  // Reserves 10 characters in
                                                               // the heap region */
    strncpy(str, "abcde", 5);  //copies 5 characters "abcde" into str
}                                      //end of function1
main (){                            //the required main function
    funct1(1);                      //main calls function1 with an argument
}                                      //end of the main function

 

This program does not do much. First, several pieces of memory are allocated in different sections of the process memory. When main is executed, funct1() is called with an argument of 1. Once funct1() is called, the argument is passed to the function variable called c. Next, memory is allocated on the heap for a 10-byte string called str. Finally, the 5-byte string "abcde" is copied into the new variable called str. The function ends, and then the main() program ends.

Active ImageCAUTION: You must have a good grasp of this material before moving on in the book. If you need to review any part of this chapter, please do so before continuing.

References

Smashing the Stack…, Aleph One www.mindsec.com/files/p49-14.txt

How Memory Works http://computer.howstuffworks.com/c23.htm

Memory Concepts www.groar.org/expl/beginner/buffer1.txt

Little-Endian vs. Big Endian www.rdrop.com/~cary/html/endian_faq.html

 

  

Intel Processors

There are several commonly used computer architectures. In this chapter, we will focus on the Intel family of processors or architecture. Table 7-4 shows the highlights of the Intel processors.

Active ImageNOTE: After the 80486, Intel decided to use more trademark-friendly names, such as Pentium, Xeon, and Itanium.

The term architecture simply refers to the way a particular manufacturer implemented their processor. Since the bulk of the processors in use today are Intel 80x86, we will focus on that architecture. All 80x86 processors have the following three functions in common:

  • They can do complex arithmetic.
  • They can move data around.
  • They can interpret instructions to make logic decisions and control other devices.

 

These functions are accomplished through the use of the following resources:

Registers

Registers are used to store data temporarily. Think of them as fast 8-to 32-bit chunks of RAM for use internally by the processor. Registers can be divided into four categories (32 bits each unless otherwise noted). These are shown in Table 7-5.

Arithmetic Logic Unit (ALU)

The arithmetic logic unit (ALU) is used to perform mathematical functions such as addition, multiplication, subtraction, and division. The ALU is also used to perform logical functions such as Boolean AND, OR, and NOT.

Active Image

 

Table 7-4 Features of Various Intel Processors

 

Active Image

 

Table 7-5 Categories of Registers

Program Counter

The program counter is a special register used to store the address of the next instruction to be processed. This is referred to as an extended instruction pointer (EIP).

Control Unit

The control unit is the brains of the operation. It can be simplified into two components:

  • Instruction fetch/decoder unit A set of latches, clocks, and buses that effectively fetch the next instruction to be processed, increment the program counter, and then decode the instruction for execution.
  • I/O control unit Responsible for interacting with external I/O devices.

 

Buses

Information flows around the processor and to external devices through a device called a bus. Much like the flat ribbon cables that can be seen inside a PC case, the internal buses of the processor are between 16 and 64 bits wide. The wider the bus, the faster the processor can operate. For our purposes, there are three buses worth knowing about:

  • Address bus Used to select addresses to be read or written to in memory
  • Data bus Used to move data around the processor and to/from memory
  • Control bus Used to control external devices and execute instructions

Figure 7-1 shows how these elements work together.

 

Active Image

References

x86 Registers www.mindsec.com/files/avoid.html#lfindex6

History of Processors http://home.si.rr.com/mstoneman/pub/docs/Processors%20History.rtf

Processors www.cs.princeton.edu/courses/archive/fall99/cs318/Files/pc-arch.html

  

Assembly Language Basics

Though entire books have been written about the ASM language, there are a few basics you can easily grasp to become a more effective ethical hacker.

Machine vs. Assembly vs. C

Computers only understand machine language-that is, a pattern of 1s and 0s. Humans, on the other hand, have trouble interpreting large strings of 1s and 0s, so assembly was designed to assist programmers with mnemonics to remember the series of numbers. Later, higher-level languages were designed, such as C and others, which remove humans even further from the 1s and 0s. If you want to become a good ethical hacker, you must resist societal trends and get back to basics with assembly.

AT&T vs. NASM

There are two main forms of assembly syntax: AT&T and Intel. AT&T syntax is used by the GNU Assembler (gas), contained in the gcc compiler suite, and is often used by Linux developers. Of the Intel syntax assemblers, the Netwide Assembler (NASM) is the most commonly used. The NASM format is used by many windows assemblers and debuggers. The two formats yield exactly the same machine language; however, there are a few differences in style and format:

  • The source and destination operands are reversed, and different symbols are used to mark the beginning of a comment:
    • NASM format: CMD <dest>, <source> <; comment>
    • AT&T format: CMD <source>, <dest> <# comment>
  • AT&T format uses a % before registers; NASM does not.
  • AT&T format uses a $ before literal values; NASM does not.
  • AT&T handles memory references differently than NASM.

 

In this section, we will show the syntax and examples in NASM format for each command. Additionally, we will show an example of the same command in AT&T format for comparison. In general, the following format is used for all commands:

<optional label:> <mnemonic> <operands> <optional comments>

The number of operands (arguments) depend on the command (mnemonic). Although there are many assembly instructions, you only need to master a few. These are shown in the following sections.

mov

The mov command is used to copy data from the source to the destination. The value is not removed from the source location.

Active Image

Data cannot be moved directly from memory to a segment register. Instead, you must use a general-purpose register as an intermediate step, for example:

mov eax, 1234h ; store the value 1234 (hex) into EAX
mov cs, ax ; then copy the value of AX into CS.

add and sub

The add command is used to add the source to the destination and store the result in the destination. The sub command is used to subtract the source from the destination and store the result in the destination.

Active Image

push and pop

The push and pop commands are used to push and pop items from the stack.

Active Image

xor

The xor command is used to conduct a bitwise logical "exclusive or" (XOR) function- for example,  XOR  = 00000000. Therefore, XOR value, value can be used to zero out or clear a register or memory location.

Active Image

 

jne, je, jz, jnz, and jmp

The jne, je, jz, jnz, and jmp commands are used to branch the flow of the program to another location based on the value of the eflag "zero flag." jne/jnz will jump if the "zero flag" =0; je/jz will jump if the "zero flag" =1; and jmp will always jump.

Active Image

call and ret

The call command is used to call a procedure (not jump to a label). The ret command is used at the end of a procedure to return the flow to the command after the call.

Active Image

inc and dec

The inc and dec commands are used to increment or decrement the destination.

Active Image

lea

The lea command is used to load the effective address of the source into the destination.

 

Active Image

 

int

The int command is used to throw a system interrupt signal to the processor. The common interrupt you will use is 0×80, which is used to signal a system call to the kernel.

Active Image

 

Addressing Modes

In assembly, several methods can be used to accomplish the same thing. In particular, there are many ways to indicate the effective address to manipulate in memory. These options are called addressing modes and are summarized in Table 7-6.

 

Active Image

 

Table 7-6 Addressing Modes

 

Assembly File Structure

An assembly source file is broken into the following sections:

  • .model The .model directive is used to indicate the size of the .data and .text sections.
  • .stack The .stack directive marks the beginning of the stack segment and is used to indicate the size of the stack in bytes.
  • .data The .data directive marks the beginning of the data segment and is used to define the variables, both initialized and uninitialized.
  • .text The .text directive is used to hold the program's commands.

 

For example, the following assembly program prints "Hello, haxor!" to the screen.

section .data                          ;section declaration
msg db "Hello, haxor!",0xa      ;our string with a carriage return
len equ $ – msg                      ;length of our string, $ means here
section .text             ;mandatory section declaration
                              ;export the entry point to the ELF linker or
    global _start         ;loaders conventionally recognize
                              ; _start as their entry point
_start:

                              ;now, write our string to stdout
                              ;notice how arguments are loaded in reverse
mov edx,len            ;third argument (message length)
mov ecx,msg           ;second argument (pointer to message to write)
mov ebx,1               ;load first argument (file handle (stdout))
mov eax,4               ;system call number (4=sys_write)
int 0×80                   ;call kernel interrupt and exit
mov ebx,0               ;load first syscall argument (exit code)
mov eax,1               ;system call number (1=sys_exit)
int 0×80                   ;call kernel interrupt and exit

Assembling

The first step in assembling is to make the object code:

$ nasm -f elf hello.asm

Next, you will invoke the linker to make the executable:

$ ld -s -o hello hello.o

Finally, you can run the executable:

$ ./hello
Hello, haxor!

References

Art of Assembly Language Programming http://webster.cs.ucr.edu/

Notes on x86 Assembly www.ccntech.com/code/x86asm.txt

AT&T Assembly Syntax http://sig9.com/articles/index.php?section=asm&aid=19

  

Debugging with gdb

When programming with C on Unix systems, the debugger of choice is gdb. It provides a robust command-line interface, allowing you to run a program while maintaining full control. For example, you may set breakpoints in the execution of the program and monitor the contents of memory or registers at any point you like. For this reason, debuggers like gdb are invaluable to programmers and hackers alike.

gdb Basics

Commonly used commands in gdb are shown in Table 7-7.

Active Image
Active Image

Table 7-7 Common gdb Commands

To debug our example program, we issue the following commands. The first will recompile with debugging options:

 Active Image
 Active Image

Disassembly with gdb

To conduct disassembly with gdb, you need the two following commands:

set disassembly-flavor <intel/att>
disassemble <function name>

The first command toggles back and forth between Intel (NASM) and AT&T format. By default, gdb uses AT&T format. The second command disassembles the given function (to include main if given). For example, to disassemble the function called greeting in both formats, you would type

Active Image
Active Image
References

Debugging with NASM and gdb www.csee.umbc.edu/help/nasm/nasm.shtml

Smashing the Stack…, Aleph One www.mindsec.com/files/p49-14.txt

 

Summary

If you have a basic understanding of the following concepts, you are ready to move on.

Programming in general

o Based on an iterative refinement process from requirements to pseudo-code to modules

o The two main differences between programmers and hackers are purpose and time

C language programming

o Commonly used program constructs: main, functions, variables, loops, if/then, comments, brackets for blocks of code, arguments

o Compilation process: from source code to object code to executable code

Memory

o Main concepts: RAM, endian (Intel: little, Motorola: big), segmentation, buffers, strings, pointers

o Process memory space: .text, .data, .bss, heap, stack, environment

Intel processors

o Intel architecture: data bus, address bus, control bus, registers, ALU, control unit, external memory, external I/O devices

o Registers: general purpose, segment, offset, special purpose

ASM language basics

o Machine language (binary), assembly (mnemonics), C language (higher-level statements)

o AT&T vs. NASM, common commands, addressing modes

o Assembly process: ASM source code, assembler, linker

Debugging with gdb

o Common commands, breakpoints, stepping through the program, checking registers and memory

o Disassembling with both AT&T and NASM style

 

  

Questions

1. The most commonly used variable types in C are:

A. single, double, int, float

B. int, char, double, float

C. double, buffer, float, int

D. char, array, string, int

 

2. The memory structure called a stack can best be described as:

A. A first-in first-out data structure that grows from the highest to the lowest memory addresses on Intel architectures

B. A first-in last-out data structure that grows from the lowest to highest memory addresses on Intel architectures

C. A last-in first-out data structure that grows from the lowest to highest memory addresses on Intel architectures

D. A first-in last-out data structure that grows from the highest to lowest memory addresses on Intel architectures

 

3. Which of the following registers are used to control stacks by pointing to the bottom and top of the stack frame?

A. The offset registers: EBP and ESP, respectively

B. The general purpose registers: EAX and EBX, respectively

C. The offset registers: EDI and ESI, respectively

D. The segment registers: stack segment (SS) and extra space (ES), respectively

 

4. The statement mov eax, 16h can best be described as:

A. An AT&T format command that moves the value 38 decimal into the register eax

B. A NASM format command that moves the value 16 hex into the register eax

C. An AT&T format command that moves the value of eax into memory address 0×16

D. A NASM format command that moves the value 22 decimal into the register eax

 

5. The two most important commands when debugging with gdb are:

A. set disassemble-flavor <intel/att> and disassembly <function name>

B. disassembly-flavor set <intel/att> and disassembly <function name>

C. set disassembly-flavor <intel/att> and disassemble <function name>

D. set disassemble-flavor <intel/att> and disassemble <function name>

 

6. To compile a program, you would use something like:

A. gcc -o outputname inputname.c

B. gcc -d outputname inputname.c

C. gcc -l links -S simplename -o outputname.o

D. gcc -c inputname.c -o outputname.c

 

7. What is the main difference between a hacker and a software developer?

A. The hacker has a harder job than the software developer.

B. The hacker has unlimited time, whereas the software developer is constrained in time.

C. The software developer has unlimited time, whereas the hacker is usually competing with others and in a hurry.

D. Money is the major motivating factor that gives the software developer an edge.

 

8. Which of the following sets of gdb commands are the most useful when trying to inspect the values of the stack while debugging?

A. print esp

B. info reg esp

C. bt, up, down

D. print stack info

  

Answers

1 B. The int, char, double, and float variables are the most commonly used in C. Each of the other options contain nonexistent variables. Answer A contained valid variables, but single is rarely used because of its size and range limitations.

2 D. A stack can best be described as a first-in last-out data structure that grows from the highest to lowest memory addresses on Intel architectures. Answer A describes a queue, not a stack; answers B and C are basically saying the same thing but with memory growing in the wrong direction.

3 A. The offset registers EBP and ESP are used to indicate the bottom or base of a stack and the top of a stack, respectively. The other options contained incorrect combinations.

4 D. The command mov eax, 16h can best be described as an NASM format command that moves the value 22 decimal into the register eax. 16h=22 decimal and the mov command used with eax is in NASM format, not AT&T.

5 C. The two most important commands when disassembling with gdb are:

set disassembly-flavor disassemble <intel/att>
disassemble <function name>

6. A. To successfully compile, you can use the following format:

gcc –o outputname inputname.c

Remember, -o outputname is optional. If you omit it, the compiler will just create an a.out file for execution.

7 B. The hacker has the easiest job, is not motivated by money, and often has unlimited time.

8 C. The backtrace (bt), up, and down commands allow you to fully inspect the contents of the stack while debugging.

Category: Book Reviews

Comments are closed.