e-maknyus: April 2011

Sunday, April 17, 2011

Hoard memory allocator

What kind of applications will Hoard speed up?

Hoard will always improve the performance of multithreaded programs running on multiprocessors that make frequent use of the heap (calls to malloc/free or new/delete, as well as many STL functions). Because Hoard avoids false sharing, Hoard also speeds up programs that only occasionally call heap functions but access these objects frequently.

I'm using the STL but not seeing any performance improvement. Why not?

In order to benefit from Hoard, you have to tell STL to use malloc instead of its internal custom memory allocator:

typedef list<int, malloc_alloc> mylist;

typedef list<int, std::allocator<int> > mylist;

The first form works for most platforms using g++, while the second form is for Visual Studio (Windows).

What systems does Hoard work on?

Hoard is fully supported for the following platforms.

Windows NT/2000/XP/Server (32 and 64-bit)
Linux x86 (32 and 64-bit)
Solaris (Sparc, x86, and x86-64)
Mac OS X (Intel)

Hoard also runs on PowerPC systems (including IBM Power and Macs running Apple OS X).

Have you compared Hoard with SmartHeap SMP?

We tried SmartHeap SMP but it did not work on our Suns (due to an apparent race condition in the code).

Have you compared Hoard against mtmalloc or libumem?

Yes. Hoard is much faster than either. For example, here's an execution of threadtest on Solaris:

Default:	4.60 seconds
Libmtmalloc:	6.23 seconds
Libumem:	5.47 seconds
Hoard 3.2:	1.99 seconds

Courtesy : http://www.hoard.org/

Static, Shared Dynamic and Loadable Linux Libraries

This tutorial discusses the philosophy behind libraries and the creation and use of C/C++ library "shared components" and "plug-ins". The various technologies and methodologies used and insight to their appropriate application, is also discussed. In this tutorial, all libraries are created using the GNU Linux compiler.

Why libraries are used:
This methodology, also known as "shared components" or "archive libraries", groups together multiple compiled object code files into a single file known as a library. Typically C functions/C++ classes and methods which can be shared by more than one application are broken out of the application's source code, compiled and bundled into a library. The C standard libraries and C++ STL are examples of shared components which can be linked with your code. The benefit is that each and every object file need not be stated when linking because the developer can reference the individual library. This simplifies the multiple use and sharing of software components between applications. It also allows application vendors a way to simply release an API to interface with an application. Components which are large can be created for dynamic use, thus the library remain separate from the executable reducing it's size and thus disk space used. The library components are then called by various applications for use when needed.

Linux Library Types:

There are two Linux C/C++ library types which can be created:

Static libraries (.a): Library of object code which is linked with, and becomes part of the application.
Dynamically linked shared object libraries (.so): There is only one form of this library but it can be used in two ways.
1. Dynamically linked at run time but statically aware. The libraries must be available during compile/link phase. The shared objects are not included into the executable component but are tied to the execution.
2. Dynamically loaded/unloaded and linked during execution (i.e. browser plug-in) using the dynamic linking loader system functions.

Library naming conventions:

Libraries are typically names with the prefix "lib". This is true for all the C standard libraries. When linking, the command line reference to the library will not contain the library prefix or suffix. Thus the following link command: gcc src-file.c -lm -lpthread
The libraries referenced in this example for inclusion during linking are the math library and the thread library. They are found in /usr/lib/libm.a and /usr/lib/libpthread.a.

Static Libraries: (.a)

How to generate a library:

Compile: cc -Wall -c ctest1.c ctest2.c
Compiler options:
- -Wall: include warnings. See man page for warnings specified.
Create library "libctest.a": ar -cvq libctest.a ctest1.o ctest2.o
List files in library: ar -t libctest.a
Linking with the library:
- cc -o executable-name prog.c libctest.a
- cc -o executable-name prog.c -L/path/to/library-directory -lctest

Example files:

ctest1.c

void ctest1(int *i)
{
   *i=5;
}

ctest2.c

void ctest2(int *i)
{
   *i=100;
}

prog.c

#include <stdio.h>
void ctest1(int *);
void ctest2(int *);

int main()
{
   int x;
   ctest1(&x);
   printf("Valx=%d\n",x);

   return 0;
}

Historical note: After creating the library it was once necessary to run the command: ranlib ctest.a. This created a symbol table within the archive. Ranlib is now embedded into the "ar" command. Note for MS/Windows developers: The Linux/Unix ".a" library is conceptually the same as the Visual C++ static ".lib" libraries.

Dynamically Linked "Shared Object" Libraries: (.so)

How to generate a shared object: (Dynamically linked object library file.) Note that this is a two step process.

Create object code
Create library
Optional: create default version using a symbolic link.

Library creation example:

gcc -Wall -fPIC -c *.c
    gcc -shared -Wl,-soname,libctest.so.1 -o libctest.so.1.0   *.o
    mv libctest.so.1.0 /opt/lib
    ln -sf /opt/lib/libctest.so.1.0 /opt/lib/libctest.so
    ln -sf /opt/lib/libctest.so.1.0 /opt/lib/libctest.so.1

This creates the library libctest.so.1.0 and symbolic links to it. Compiler options:

-Wall: include warnings. See man page for warnings specified.
-fPIC: Compiler directive to output position independent code, a characteristic required by shared libraries. Also see "-fpic".
-shared: Produce a shared object which can then be linked with other objects to form an executable.
-W1: Pass options to linker.
In this example the options to be passed on to the linker are: "-soname libctest.so.1". The name passed with the "-o" option is passed to gcc.
Option -o: Output of operation. In this case the name of the shared object to be output will be "libctest.so.1.0"

Library Links:

The link to /opt/lib/libctest.so allows the naming convention for the compile flag -lctest to work.
The link to /opt/lib/libctest.so.1 allows the run time binding to work. See dependency below.

Compile main program and link with shared object library:

Compiling for runtime linking with a dynamically linked libctest.so.1.0:

gcc -Wall -I/path/to/include-files -L/path/to/libraries prog.c -lctest -o prog
Use:
    gcc -Wall -L/opt/lib prog.c -lctest -o prog

Where the name of the library is libctest.so. (This is why you must create the symbolic links or you will get the error "/usr/bin/ld: cannot find -lctest".) The libraries will NOT be included in the executable but will be dynamically linked during runtime execution.

List Dependencies:
The shared library dependencies of the executable can be listed with the command: ldd name-of-executable

Example: ldd prog

libctest.so.1 => /opt/lib/libctest.so.1 (0x00002aaaaaaac000)
        libc.so.6 => /lib64/tls/libc.so.6 (0x0000003aa4e00000)
        /lib64/ld-linux-x86-64.so.2 (0x0000003aa4c00000)

Run Program:

Set path: export LD_LIBRARY_PATH=/opt/lib:$LD_LIBRARY_PATH
Run: prog

Man Pages:

gcc - GNU C compiler
ld - The GNU Linker
ldd - List dependencies

Links:

LDP: Shared libraries

Library Path:

In order for an executable to find the required libraries to link with during run time, one must configure the system so that the libraries can be found. Methods available: (Do at least one of the following)

Add library directories to be included during dynamic linking to the file /etc/ld.so.conf Sample: /etc/ld.so.conf
/usr/X11R6/lib /usr/lib ... .. /usr/lib/sane /usr/lib/mysql /opt/lib
Add the library path to this file and then execute the command (as root) ldconfig to configure the linker run-time bindings.
You can use the "-f file-name" flag to reference another configuration file if you are developing for different environments.
See man page for command ldconfig. OR

Add specified directory to library cache: (as root)
ldconfig -n /opt/lib
Where /opt/lib is the directory containing your library libctest.so
(When developing and just adding your current directory: ldconfig -n . Link with -L.) This will NOT permanently configure the system to include this directory. The information will be lost upon system reboot.
OR

Specify the environment variable LD_LIBRARY_PATH to point to the directory paths containing the shared object library. This will specify to the run time loader that the library paths will be used during execution to resolve dependencies.
(Linux/Solaris: LD_LIBRARY_PATH, SGI: LD_LIBRARYN32_PATH, AIX: LIBPATH, Mac OS X: DYLD_LIBRARY_PATH, HP-UX: SHLIB_PATH) Example (bash shell): export LD_LIBRARY_PATH=/opt/lib:$LD_LIBRARY_PATH or add to your ~/.bashrc file:
... if [ -d /opt/lib ]; then LD_LIBRARY_PATH=/opt/lib:$LD_LIBRARY_PATH fi ... export LD_LIBRARY_PATH
This instructs the run time loader to look in the path described by the environment variable LD_LIBRARY_PATH, to resolve shared libraries. This will include the path /opt/lib.

Library paths used should conform to the "Linux Standard Base" directory structure.

Library Info:

The command "nm" lists symbols contained in the object file or shared library.
Use the command nm -D libctest.so.1.0
(or nm --dynamic libctest.so.1.0)

0000000000100988 A __bss_start
000000000000068c T ctest1
00000000000006a0 T ctest2
                 w __cxa_finalize
00000000001007b0 A _DYNAMIC
0000000000100988 A _edata
0000000000100990 A _end
00000000000006f8 T _fini
0000000000100958 A _GLOBAL_OFFSET_TABLE_
                 w __gmon_start__
00000000000005b0 T _init
                 w _Jv_RegisterClasses

Man page for nm

Symbol Type	Description
A	The symbol's value is absolute, and will not be changed by further linking.
B	Un-initialized data section
D	Initialized data section
T	Normal code section
U	Undefined symbol used but not defined. Dependency on another library.
W	Doubly defined symbol. If found, allow definition in another library to resolve dependency.

Also see: objdump man page

Library Versions:

Library versions should be specified for shared objects if the function interfaces are expected to change (C++ public/protected class definitions), more or fewer functions are included in the library, the function prototype changes (return data type (int, const int, ...) or argument list changes) or data type changes (object definitions: class data members, inheritance, virtual functions, ...).
The library version can be specified when the shared object library is created. If the library is expected to be updated, then a library version should be specified. This is especially important for shared object libraries which are dynamically linked. This also avoids the Microsoft "DLL hell" problem of conflicting libraries where a system upgrade which changes a standard library breaks an older application expecting an older version of the the shared object function.
Versioning occurs with the GNU C/C++ libraries as well. This often make binaries compiled with one version of the GNU tools incompatible with binaries compiled with other versions unless those versions also reside on the system. Multiple versions of the same library can reside on the same system due to versioning. The version of the library is included in the symbol name so the linker knows which version to link with.
One can look at the symbol version used: nm csub1.o

00000000 T ctest1

No version is specified in object code by default.

ld and object file layout There is one GNU C/C++ compiler flag that explicitly deals with symbol versioning. Specify the version script to use at compile time with the flag: --version-script=your-version-script-file
Note: This is only useful when creating shared libraries. It is assumed that the programmer knows which libraries to link with when static linking. Runtime linking allows opportunity for library incompatibility.
GNU/Linux, see examples of version scripts here: sysdeps/unix/sysv/linux/Versions
Some symbols may also get version strings from assembler code which appears in glibc headers files. Look at include/libc-symbols.h.
Example: nm /lib/libc.so.6 | more

00000000 A GCC_3.0
00000000 A GLIBC_2.0
00000000 A GLIBC_2.1
00000000 A GLIBC_2.1.1
00000000 A GLIBC_2.1.2
00000000 A GLIBC_2.1.3
00000000 A GLIBC_2.2
00000000 A GLIBC_2.2.1
00000000 A GLIBC_2.2.2
00000000 A GLIBC_2.2.3
00000000 A GLIBC_2.2.4
...
..

Note the use of a version script. Library referencing a versioned library: nm /lib/libutil-2.2.5.so

..
...
         U strcpy@@GLIBC_2.0
         U strncmp@@GLIBC_2.0
         U strncpy@@GLIBC_2.0
...
..

Links:

Dynamic loading and un-loading of shared libraries using libdl:

These libraries are dynamically loaded / unloaded and linked during execution. Usefull for creating a "plug-in" architecture.
Prototype include file for the library: ctest.h

#ifndef CTEST_H
#define CTEST_H

#ifdef __cplusplus
extern "C" {
#endif

void ctest1(int *);
void ctest2(int *);

#ifdef __cplusplus
}
#endif

#endif

Use the notation extern "C" so the libraries can be used with C and C++. This statement prevents the C++ from name mangling and thus creating "unresolved symbols" when linking.

Load and unload the library libctest.so (created above), dynamically:

#include <stdio.h>
#include <dlfcn.h>
#include "ctest.h"

int main(int argc, char **argv) 
{
   void *lib_handle;
   double (*fn)(int *);
   int x;
   char *error;

   lib_handle = dlopen("/opt/lib/libctest.so", RTLD_LAZY);
   if (!lib_handle) 
   {
      fprintf(stderr, "%s\n", dlerror());
      exit(1);
   }

   fn = dlsym(lib_handle, "ctest1");
   if ((error = dlerror()) != NULL)  
   {
      fprintf(stderr, "%s\n", error);
      exit(1);
   }

   (*fn)(&x);
   printf("Valx=%d\n",x);

   dlclose(lib_handle);
   return 0;
}

gcc -rdynamic -o progdl progdl.c -ldl
Explanation:

dlopen("/opt/lib/libctest.so", RTLD_LAZY);
Open shared library named "libctest.so".
The second argument indicates the binding. See include file dlfcn.h.
Returns NULL if it fails.
Options:
- RTLD_LAZY: If specified, Linux is not concerned about unresolved symbols until they are referenced.
- RTLD_NOW: All unresolved symbols resolved when dlopen() is called.
- RTLD_GLOBAL: Make symbol libraries visible.
dlsym(lib_handle, "ctest1");
Returns address to the function which has been loaded with the shared library..
Returns NULL if it fails.
Note: When using C++ functions, first use nm to find the "mangled" symbol name or use the extern "C" construct to avoid name mangling.
i.e. extern "C" void function-name();

Object code location: Object code archive libraries can be located with either the executable or the loadable library. Object code routines used by both should not be duplicated in each. This is especially true for code which use static variables such as singleton classes. A static variable is global and thus can only be represented once. Including it twice will provide unexpected results. The programmer can specify that specific object code be linked with the executable by using linker commands which are passed on by the compiler.
Use the "-Wl" gcc/g++ compiler flag to pass command line arguments on to the GNU "ld" linker.
Example makefile statement: g++ -rdynamic -o appexe $(OBJ) $(LINKFLAGS) -Wl,--whole-archive -L{AA_libs} -laa -Wl,--no-whole-archive $(LIBS)

--whole-archive: This linker directive specifies that the libraries listed following this directive (in this case AA_libs) shall be included in the resulting output even though there may not be any calls requiring its presence. This option is used to specify libraries which the loadable libraries will require at run time.
-no-whole-archive: This needs to be specified whether you list additional object files or not. The gcc/g++ compiler will add its own list of archive libraries and you would not want all the object code in the archive library linked in if not needed. It toggles the behavior back to normal for the rest of the archive libraries.

Man pages:

dlopen() - gain access to an executable object file
dclose() - close a dlopen object
dlsym() - obtain the address of a symbol from a dlopen object
dlvsym() - Programming interface to dynamic linking loader.
dlerror() - get diagnostic information

Links:

Shared Libraries-Dynamic Loading and Unloading
GNOME Glib dynamic loading of modules - cross platform API for dynamically loading "plug-ins".

C++ class objects and dynamic loading:

C++ and name mangling:
When running the above "C" examples with the "C++" compiler one will quickly find that "C++" function names get mangled and thus will not work unless the function definitions are protected with extern "C"{}.

Note that the following are not equivalent:

extern "C"
{
   int functionx();
}

extern "C" int functionx();

The following are equivalent:

extern "C"
{
   extern int functionx();
}

extern "C" int functionx();

Dynamic loading of C++ classes:
The dynamic library loading routines enable the programmer to load "C" functions. In C++ we would like to load class member functions. In fact the entire class may be in the library and we may want to load and have access to the entire object and all of its member functions. Do this by passing a "C" class factory function which instantiates the class.
The class ".h" file:

class Abc {

...
...

};

// Class factory "C" functions

typedef Abc* create_t;
typedef void destroy_t(Abc*);

The class ".cpp" file:

Abc::Abc()
{
    ...
}

extern "C"
{
   // These two "C" functions manage the creation and destruction of the class Abc

   Abc* create()
   {
      return new Abc;
   }

   void destroy(Abc* p)
   {
      delete p;   // Can use a base class or derived class pointer here
   }
}

This file is the source to the library. The "C" functions to instantiate (create) and destroy a class defined in the dynamically loaded library where "Abc" is the C++ class.

Main executable which calls the loadable libraries:

// load the symbols
    create_t* create_abc = (create_t*) dlsym(lib_handle, "create");

...
...

    destroy_t* destroy_abc = (destroy_t*) dlsym(lib_handle, "destroy");

...
...

Pitfalls:

The new/delete of the C++ class should both be provided by the executable or the library but not split. This is so that there is no surprise if one overloads new/delete in one or the other.

Links:

Comparison to the Microsoft DLL:

The Microsoft Windows equivalent to the Linux / Unix shared object (".so") is the ".dll". The Microsoft Windows DLL file usually has the extension ".dll", but may also use the extension ".ocx". On the old 16 bit windows, the dynamically linked libraries were also named with the ".exe" suffix. "Executing" the DLL will load it into memory.
The Visual C++ .NET IDE wizard will create a DLL framework through the GUI, and generates a ".def" file. This "module definition file" lists the functions to be exported. When exporting C++ functions, the C++ mangled names are used. Using the Visual C++ compiler to generate a ".map" file will allow you to discover the C++ mangled name to use in the ".def" file. The "SECTIONS" label in the ".def" file will define the portions which are "shared". Unfortunately the generation of DLLs are tightly coupled to the Microsoft IDE, so much so that I would not recomend trying to create one without it.
The Microsoft Windows C++ equivalent functions to libdl are the following functions:

::LoadLibrary() - dlopen()
::GetProcAddress() - dlsym()
::FreeLibrary() - dlclose()

[Potential Pitfall]: Microsoft Visual C++ .NET compilers do not allow the linking controll that the GNU linker "ld" allows (i.e. --whole-archive, -no-whole-archive). All symbols need to be resolved by the VC++ compiler for both the loadable library and the application executable individually and thus it can cause duplication of libraries when the library is loaded. This is especially bad when using static variables (i.e. used in singleton patterns) as you will get two memory locations for the static variable, one used by the loadable library and the other used by the program executable. This breaks the whole static variable concept and the singleton pattern. Thus you can not use a static variable which is referenced by by both the loadable library and the application executable as they will be unique and different. To use a unique static variable, you must pass a pointer to that static variable to the other module so that each module (main executable and DLL library) can use the same instatiation. On MS/Windows you can use shared memory or a memory mapped file so that the main executable and DLL library can share a pointer to an address they both will use.
Cross platform (Linux and MS/Windows) C++ code snippet:
Include file declaration: (.h or .hpp)

class Abc{
public:
   static Abc* Instance(); // Function declaration. Could also be used as a public class member function.

private:
   static Abc *mInstance;      // Singleton. Use this declaration in C++ class member variable declaration.
   ...
}

C/C++ Function source: (.cpp)

/// Singleton instantiation
Abc* Abc::mInstance = 0;   // Use this declaration for C++ class member variable
                           // (Defined outside of class definition in ".cpp" file)

// Return unique pointer to instance of Abc or create it if it does not exist.
// (Unique to both exe and dll)

static Abc* Abc::Instance() // Singleton
{
#ifdef WIN32
    // If pointer to instance of Abc exists (true) then return instance pointer else look for 
    // instance pointer in memory mapped pointer. If the instance pointer does not exist in
    // memory mapped pointer, return a newly created pointer to an instance of Abc.

    return mInstance ? 
       mInstance : (mInstance = (Abc*) MemoryMappedPointers::getPointer("Abc")) ? 
       mInstance : (mInstance = (Abc*) MemoryMappedPointers::createEntry("Abc",(void*)new Abc));
#else
    // If pointer to instance of Abc exists (true) then return instance pointer 
    // else return a newly created pointer to an instance of Abc.

    return mInstance ? mInstance : (mInstance = new Abc);
#endif
}

Windows linker will pull two instances of object, one in exe and one in loadable module. Specify one for both to use by using memory mapped pointer so both exe and loadable library point to same variable or object.
Note that the GNU linker does not have this problem.
For more on singletons see the YoLinux.com C++ singleton software design pattern tutorial.

Cross platform programming of loadable libraries:

#ifndef USE_PRECOMPILED_HEADERS
#ifdef WIN32
#include <direct.h>
#include <windows.h>
#else
#include <sys/types.h>
#include <dlfcn.h>
#endif
#include <iostream>
#endif

    using namespace std;

#ifdef WIN32
    HINSTANCE lib_handle;
#else
    void *lib_handle;
#endif

    // Where retType is the pointer to a return type of the function
    // This return type can be int, float, double, etc or a struct or class.

    typedef retType* func_t;  

    // load the library -------------------------------------------------
#ifdef WIN32
    string nameOfLibToLoad("C:\opt\lib\libctest.dll");
    lib_handle = LoadLibrary(TEXT(nameOfLibToLoad.c_str()));
    if (!lib_handle) {
        cerr << "Cannot load library: " << TEXT(nameOfDllToLoad.c_str()) << endl;
    }
#else
    string nameOfLibToLoad("/opt/lib/libctest.so");
    lib_handle = dlopen(nameOfLibToLoad.c_str(), RTLD_LAZY);
    if (!lib_handle) {
        cerr << "Cannot load library: " << dlerror() << endl;
    }
#endif

...
...
...

    // load the symbols -------------------------------------------------
#ifdef WIN32
    func_t* fn_handle = (func_t*) GetProcAddress(lib_handle, "superfunctionx");
    if (!fn_handle) {
        cerr << "Cannot load symbol superfunctionx: " << GetLastError() << endl;
    }
#else
    // reset errors
    dlerror();

    // load the symbols (handle to function "superfunctionx")
    func_t* fn_handle= (func_t*) dlsym(lib_handle, "superfunctionx");
    const char* dlsym_error = dlerror();
    if (dlsym_error) {
        cerr << "Cannot load symbol superfunctionx: " << dlsym_error << endl;
    }
#endif

...
...
...

    // unload the library -----------------------------------------------

#ifdef WIN32
    FreeLibrary(lib_handle);
#else
    dlclose(lib_handle);
#endif

Tools:

Man pages:

ar - create, modify, and extract from archives
ranlib - generate index to archive
nm - list symbols from object files
ld - Linker
ldconfig - configure dynamic linker run-time bindings
ldconfig -p : Print the lists of directories and candidate libraries stored in the current cache.
i.e. /sbin/ldconfig -p |grep libGL
ldd - print shared library dependencies
gcc/g++ - GNU project C and C++ compiler
man page to: ld.so - a.out dynamic linker/loader

Notes:

Direct loader to preload a specific shared library before all others: export LD_PRELOAD=/usr/lib/libXXX.so.x; exec program. This is specified in the file /etc/ld.so.preload and extended with the environment variable LD_PRELOAD.
Also see:
- man page to: ld.so - a.out dynamic linker/loader
- LD_PRELOAD and Linux function interception.

Running Red Hat 7.1 (glibc 2.2.2) but compiling for Red Hat 6.2 compatibility.
See RELEASE-NOTES

export LD_ASSUME_KERNEL=2.2.5
        . /usr/i386-glibc21-linux/bin/i386-glibc21-linux-env.sh

Environment variable to highlight warnings, errors, etc: export CC="colorgcc"

Courtesy : http://www.yolinux.com/TUTORIALS/LibraryArchives-StaticAndDynamic.html

Saturday, April 16, 2011

System Limits

This chapter describes various limits on the size of files and file systems. These limits are imposed by either the Lustre architecture or the Linux VFS and VM subsystems. In a few cases, a limit is defined within the code and could be changed by re-compiling Lustre. In those cases, the selected limit is supported by Lustre testing and may change in future releases. This chapter includes the following sections:

Maximum Stripe Count

Maximum Stripe Size

Minimum Stripe Size

Maximum Number of OSTs and MDTs

Maximum Number of Clients

Maximum Size of a File System

Maximum File Size

Maximum Number of Files or Subdirectories in a Single Directory

MDS Space Consumption

Maximum Length of a Filename and Pathname

Maximum Number of Open Files for Lustre File Systems

OSS RAM Size

33.1 Maximum Stripe Count

The maximum number of stripe count is 160. This limit is hard-coded, but is near the upper limit imposed by the underlying ext3 file system. It may be increased in future releases. Under normal circumstances, the stripe count is not affected by ACLs.

33.2 Maximum Stripe Size

For a 32-bit machine, the product of stripe size and stripe count (stripe_size * stripe_count) must be less than 2^32. The ext3 limit of 2TB for a single file applies for a 64-bit machine. (Lustre can support 160 stripes of 2 TB each on a 64-bit system.)

33.3 Minimum Stripe Size

Due to the 64 KB PAGE_SIZE on some 64-bit machines, the minimum stripe size is set to 64 KB.

33.4 Maximum Number of OSTs and MDTs

You can set the maximum number of OSTs by a compile option. The limit of 1020 OSTs in Lustre release 1.4.7 is increased to a maximum of 8150 OSTs in 1.6.0. Testing is in progress to move the limit to 4000 OSTs.

The maximum number of MDSs will be determined after accomplishing MDS clustering.

33.5 Maximum Number of Clients

Currently, the number of clients is limited to 131072. We have tested up to 22000 clients.

33.6 Maximum Size of a File System

For i386 systems with 2.6 kernels, the block devices are limited to 16 TB. Each OST or MDT can have a file system up to 16 TB, regardless of whether 32-bit or 64-bit kernels are on the server.

You can have multiple OST file systems on a single node. Currently, the largest production Lustre file system has 448 OSTs in a single file system. There is a compile-time limit of 8150 OSTs in a single file system, giving a theoretical file system limit of nearly 64 PB.

Several production Lustre file systems have around 200 OSTs in a single file system. The largest file system in production is at least 1.3 PB (184 OSTs). All these facts indicate that Lustre would scale just fine if more hardware is made available.

33.7 Maximum File Size

Individual files have a hard limit of nearly 16 TB on 32-bit systems imposed by the kernel memory subsystem. On 64-bit systems this limit does not exist. Hence, files can be 64-bits in size. Lustre imposes an additional size limit of up to the number of stripes, where each stripe is 2 TB. A single file can have a maximum of 160 stripes, which gives an upper single file limit of 320 TB for 64-bit systems. The actual amount of data that can be stored in a file depends upon the amount of free space in each OST on which the file is striped.

33.8 Maximum Number of Files or Subdirectories in a Single Directory

Lustre uses the ext3 hashed directory code, which has a limit of about 25 million files. On reaching this limit, the directory grows to more than 2 GB depending on the length of the filenames. The limit on subdirectories is the same as the limit on regular files in all later versions of Lustre due to a small ext3 format change.

In fact, Lustre is tested with ten million files in a single directory. On a properly-configured dual-CPU MDS with 4 GB RAM, random lookups in such a directory are possible at a rate of 5,000 files / second.

33.9 MDS Space Consumption

A single MDS imposes an upper limit of 4 billion inodes. The default limit is slightly less than the device size of 4 KB, meaning 512 MB inodes for a file system with MDS of 2 TB. This can be increased initially, at the time of MDS file system creation, by specifying the --mkfsoptions='-i 2048' option on the --add mds config line for the MDS.

For newer releases of e2fsprogs, you can specify '-i 1024' to create 1 inode for every 1 KB disk space. You can also specify '-N {num inodes}' to set a specific number of inodes. The inode size (-I) should not be larger than half the inode ratio
(-i). Otherwise, mke2fs will spin trying to write more number of inodes than the inodes that can fit into the device.

For more information, see Options for Formatting the MDT and OSTs.

33.10 Maximum Length of a Filename and Pathname

This limit is 255 bytes for a single filename, the same as in an ext3 file system. The Linux VFS imposes a full pathname length of 4096 bytes.

33.11 Maximum Number of Open Files for Lustre File Systems

Lustre does not impose maximum number of open files, but practically it depends on amount of RAM on the MDS. There are no "tables" for open files on the MDS, as they are only linked in a list to a given client's export. Each client process probably has a limit of several thousands of open files which depends on the ulimit.

33.12 OSS RAM Size

For a single OST, there is no strict rule to size the OSS RAM. However, as a guideline for Lustre 1.8 installations, 2 GB per OST is a reasonable RAM size. For details on determining the memory needed for an OSS node, see OSS Memory Requirements

Reference

RedHat has sysstat preinstalled and collect system activity data routinely. It logs it in the /var/log/sa directory. For Suse it needs to be installed manually.
Linux implementation is well described in Linux.com article CLI Magic Tracking system performance with sar

Sadc (system activity data collector) is the program that gathers performance data. It pulls its data out of the virtual /proc filesystem, then it saves the data in a file (one per day) named /var/log/sa/saDD where DD is the day of the month. Two shell scripts from the sysstat package control how the data collector is run. The first script, sa1, controls how often data is collected, while sa2 creates summary reports (one per day) in /var/log/sa/sarDD. Both scripts are run from cron. In the default configuration, data is collected every 10 minutes and summarized just before midnight.
If you suspect a performance problem with a particular program, you can use sadc to collect data on a particular process (with the -x argument), or its children (-X), but you will need to set up a custom script using those flags.
As Dr. Heisenberg showed, the act of measuring something changes it. Any tool that collects performance data has some overall negative impact on system performance, but with sar, the impact seems to be minimal. I ran a test with the sa1 cron job set to gather data every minute (on a server that was not busy) and it didn't cause any serious issues. That may not hold true on a busy system.
Creating reports
If the daily summary reports created by the sa2 script are not enough, you can create your own custom reports using sar. The sar program reads data from the current daily data file unless you specify otherwise. To have sar read a particular data file, use the -f /var/log/sa/saDD option. You can select multiple files by using multiple -f options. Since many of sar's reports are lengthy, you may want to pipe the output to a file.
To create a basic report showing CPU usage and I/O wait time percentage, use sar with no flags. It produces a report similar to this:
01:10:00 PM       CPU     %user     %nice   %system   %iowait     %idle
01:20:00 PM       all      7.78      0.00      3.34     20.94     67.94
01:30:00 PM       all      0.75      0.00      0.46      1.71     97.08
01:40:00 PM       all      0.65      0.00      0.48      1.63     97.23
01:50:00 PM       all      0.96      0.00      0.74      2.10     96.19
02:00:00 PM       all      0.58      0.00      0.54      1.87     97.01
02:10:00 PM       all      0.80      0.00      0.60      1.27     97.33
02:20:01 PM       all      0.52      0.00      0.37      1.17     97.94
02:30:00 PM       all      0.49      0.00      0.27      1.18     98.06
Average:          all      1.85      0.00      0.44      2.56     95.14
If the %idle is near zero, your CPU is overloaded. If the %iowait is large, your disks are overloaded.
To check the kernel's paging performance, use sar -B, which will produce a report similar to this:
11:00:00 AM  pgpgin/s pgpgout/s   fault/s  majflt/s
11:10:00 AM      8.90     34.08      0.00      0.00
11:20:00 AM      2.65     26.63      0.00      0.00
11:30:00 AM      1.91     34.92      0.00      0.00
11:40:01 AM      0.26     36.78      0.00      0.00
11:50:00 AM      0.53     32.94      0.00      0.00
12:00:00 PM      0.17     30.70      0.00      0.00
12:10:00 PM      1.22     27.89      0.00      0.00
12:20:00 PM      4.11    133.48      0.00      0.00
12:30:00 PM      0.41     31.31      0.00      0.00
Average:       130.91     27.04      0.00      0.00
Raw paging numbers may not be of concern, but a high number of major faults (majflt/s) indicate that the system needs more memory. Note that majflt/s is only valid with kernel versions 2.5 and later.
For network statistics, use sar -n DEV. The -n DEV option tells sar to generate a report that shows the number of packets and bytes sent and received for each interface. Here is an abbreviated version of the report:
11:00:00 AM     IFACE   rxpck/s   txpck/s   rxbyt/s   txbyt/s
11:10:00 AM        lo      0.62      0.62     35.03     35.03
11:10:00 AM      eth0     29.16     36.71   4159.66  34309.79
11:10:00 AM      eth1      0.00      0.00      0.00      0.00
11:20:00 AM        lo      0.29      0.29     15.85     15.85
11:20:00 AM      eth0     25.52     32.08   3535.10  29638.15
11:20:00 AM      eth1      0.00      0.00      0.00      0.00
To see network errors, try sar -n EDEV, which shows network failures.
Reports on current activity
Sar can also be used to view what is happening with a specific subsystem, such as networking or I/O, almost in real time. By passing a time interval (in seconds) and a count for the number of reports to produce, you can take an immediate snapshot of a system to find a potential bottleneck.
For example, to see the basic report every second for the next 10 seconds, use sar 1 10. You can run any of the reports this way to see near real-time results.
Benchmarking
Even if you have plenty of horsepower to run your applications, you can use sar to track changes in the workload over time. To do this, save the summary reports (sar only saves seven) to a different directory over a period of a few weeks or a month. This set of reports can serve as a baseline for the normal system workload. Then compare new reports against the baseline to see how the workload is changing over time. You can automate your comparison reports with AWK or your favorite programming language.
In large systems management, benchmarking is important to predict when and how hardware should be upgraded. It also provides ammunition to justify your hardware upgrade requests.

Linux implementation of sar

The SAR suite of utilities originated in Solaris. It became popular and now runs on most flavors of UNIX, including AIX, HP-UX, and Linux. Sysstat package is installed by default in standard Red Hat installation. For Suse it is not installed by default and you need to install sysstat package manually (package is provided by Novell).

The reason for sar creation was that gathering system activity data from vmstat and iostat is pretty time-consuming. If you try to automate the gathering of system activity data, and creation of periodic repots you naturally come to creation of a tool like sar. To avoid reinventing the bicycle again and again, Sun engineers wrote sar sar (System Activity Reporter) and included it in standard Solaris distribution. The rest is history.

Linux reimplementation is a part of sysstat package and like in Solaris sar pulls its data out of the virtual /proc filesystem (pioneered by Solaris).

    The sysstat package contains the sar, sadf, iostat, mpstat, and pidstat commands for Linux. The sar command collects and reports system activity information. The statistics reported by sar concern I/O transfer rates, paging activity, process-related activites, interrupts, network activity, memory and swap space utilization, CPU utilization, kernel activities, and TTY statistics, among others. The sadf command may be used to display data collected by sar in various formats. The iostat command reports CPU statistics and I/O statistics for tty devices and disks. The pidstat command reports statistics for Linux processes. The mpstat command reports global and per-processor statistics.

In addition to sar, Linux package provides several other useful utilities:

    sadf(1) -- similar to sar but can write its data in different formats (CSV, XML, etc.). This is useful to load performance data into a database, or import them in a spreadsheet to make graphs.
    iostat(1) reports CPU statistics and input/output statistics for devices, partitions and network filesystems.
    mpstat(1) reports individual or combined processor related statistics.
    pidstat(1) reports statistics for Linux tasks (processes) : I/O, CPU, memory, etc.
    nfsiostat(1) reports input/output statistics for network filesystems (NFS).
    cifsiostat(1) reports CIFS statistics.

As in Solaris there are two main binaries and two shell scripts that constitute sar package:

    Binaries
        /usr/lib64/sa/sadc -- System activity data collector binary, a backend to the sar command. Writes binary log of kernel data to the /var/log/sa/sadd file, where the dd parameter indicates the current day
        /usr/bin/sar -- reporting utility
    Scripts
        /usr/lib64/sa/sa1
        /usr/lib64/sa/sa2

System activity data collector binary can monitor a a large number of different metrics, selectable by options:

    Input / Output and transfer rate statistics (global, per device, per partition, per network filesystem and per Linux task / PID).
    CPU statistics (global, per CPU and per Linux task / PID), including support for virtualization architectures.
    Memory and swap space utilization statistics.
    Virtual memory, paging and fault statistics.
    Per-task (per-PID) memory and page fault statistics.
    Global CPU and page fault statistics for tasks and all their children.
    Process creation activity.
    Interrupt statistics (global, per CPU and per interrupt, including potential APIC interrupt sources, hardware and software interrupts).
    Extensive network statistics: network interface activity (number of packets and kB received and transmitted per second, etc.) including failures from network devices; network traffic statistics for IP, TCP, ICMP and UDP protocols based on SNMPv2 standards; support for IPv6-related protocols.
    NFS server and client activity.
    Socket statistics.
    Run queue and system load statistics.
    Kernel internal tables utilization statistics.
    System and per Linux task switching activity.
    Swapping statistics.
    TTY device activity.
    Power management statistics (CPU clock frequency, fans speed, devices temperature, voltage inputs).

Script sa1 calls sadc to write stats to the file

#!/bin/sh
# /usr/lib64/sa/sa1.sh
# (C) 1999-2006 Sebastien Godard (sysstat wanadoo.fr)
#
umask 0022
ENDIR=/usr/lib64/sa
cd ${ENDIR}
if [ $# = 0 ]
then
# Note: Stats are written at the end of previous file *and* at the
# beginning of the new one (when there is a file rotation) only if
# outfile has been specified as '-' on the command line...
        exec ${ENDIR}/sadc -F -L 1 1 -
else
        exec ${ENDIR}/sadc -F -L $* -
fi

Script sa2 produces reports from the binary file

#!/bin/sh
# /usr/lib64/sa/sa2.sh
# (C) 1999-2006 Sebastien Godard (sysstat wanadoo.fr)
#
# Changes:
# - 2004-01-22 Nils Philippsen
#   make history configurable
#
HISTORY=7
[ -r /etc/sysconfig/sysstat ] && . /etc/sysconfig/sysstat
[ ${HISTORY} -gt 25 ] && HISTORY=25
S_TIME_FORMAT=ISO ; export S_TIME_FORMAT
umask 0022
DATE=`date +%d`
RPT=/var/log/sa/sar${DATE}
ENDIR=/usr/bin
DFILE=/var/log/sa/sa${DATE}
[ -f "$DFILE" ] || exit 0
cd ${ENDIR}
${ENDIR}/sar $* -f ${DFILE} > ${RPT}
find /var/log/sa $ -name 'sar??' -o -name 'sa??' $ -mtime +"${HISTORY}" -exec rm -f {} \;

Here are files installed:

/etc/init.d/sysstat
/etc/sysstat
/etc/sysstat/sysstat
/etc/sysstat/sysstat.cron
/etc/sysstat/sysstat.ioconf
/usr/bin/iostat
/usr/bin/mpstat
/usr/bin/pidstat
/usr/bin/sadf
/usr/bin/sar
/usr/lib64/sa
/usr/lib64/sa/sa1
/usr/lib64/sa/sa2
/usr/lib64/sa/sadc -- System activity data collector.
/usr/sbin/rcsysstat
/usr/share/doc/packages/sysstat
/usr/share/doc/packages/sysstat/CHANGES
/usr/share/doc/packages/sysstat/COPYING
/usr/share/doc/packages/sysstat/CREDITS
/usr/share/doc/packages/sysstat/FAQ
/usr/share/doc/packages/sysstat/README
/usr/share/doc/packages/sysstat/TODO
/usr/share/doc/packages/sysstat/sysstat-8.0.4.lsm
/usr/share/locale/af/LC_MESSAGES/sysstat.mo
/usr/share/locale/da/LC_MESSAGES/sysstat.mo
/usr/share/locale/de/LC_MESSAGES/sysstat.mo
/usr/share/locale/es/LC_MESSAGES/sysstat.mo
/usr/share/locale/fr/LC_MESSAGES/sysstat.mo
/usr/share/locale/it/LC_MESSAGES/sysstat.mo
/usr/share/locale/ja/LC_MESSAGES/sysstat.mo
/usr/share/locale/ky
/usr/share/locale/ky/LC_MESSAGES
/usr/share/locale/ky/LC_MESSAGES/sysstat.mo
/usr/share/locale/nb/LC_MESSAGES/sysstat.mo
/usr/share/locale/nl/LC_MESSAGES/sysstat.mo
/usr/share/locale/nn/LC_MESSAGES/sysstat.mo
/usr/share/locale/pl/LC_MESSAGES/sysstat.mo
/usr/share/locale/pt/LC_MESSAGES/sysstat.mo
/usr/share/locale/pt_BR/LC_MESSAGES/sysstat.mo
/usr/share/locale/ro/LC_MESSAGES/sysstat.mo
/usr/share/locale/ru/LC_MESSAGES/sysstat.mo
/usr/share/locale/sk/LC_MESSAGES/sysstat.mo
/usr/share/locale/sv/LC_MESSAGES/sysstat.mo
/usr/share/locale/vi/LC_MESSAGES/sysstat.mo
/usr/share/man/man1/iostat.1.gz
/usr/share/man/man1/mpstat.1.gz
/usr/share/man/man1/pidstat.1.gz
/usr/share/man/man1/sadf.1.gz
/usr/share/man/man1/sar.1.gz
/usr/share/man/man8/sa1.8.gz
/usr/share/man/man8/sa2.8.gz
/usr/share/man/man8/sadc.8.gz
/var/log/sa

To activate sar you need to either to create a link in /etc/cron.d to /etc/sysstat/sysstat.cron (approach taken by Suse) or copy the file (simpler approach used by Red Hat).

On Suse /etc/init.d/sysstat creates and deleted symbolic link to cron.d on start and stop commands :

    lrwxrwxrwx 1 root root 25 Sep 12 10:32 sysstat -> /etc/sysstat/sysstat.cron

here is the fragment of the script that does that

case "$1" in
    start)
        echo "Running sadc"
        /usr/lib64/sa/sa1 1>/dev/null 2>&1 \
                && ln -fs /etc/sysstat/sysstat.cron /etc/cron.d/sysstat \
                || rc_failed 1
        rc_status -v
        ;;

    stop)
        echo "Removing sysstat's crontab"
        rm -f /etc/cron.d/sysstat
        rc_status -v
        ;;

This is a neat trick which permits running sadc only on certain runlevels as well as the ability to enable/disable data collection anytime.

On Red Hat /etc/init.d/sysstat the script inserts message LINUX RESTART and tell sar that the kernel counters have been reinitialized (/usr/lib64/sa/sadc -F -L -).

case "$1" in
start)
        echo -n "Calling the system activity data collector (sadc): "
        /usr/lib64/sa/sadc -F -L - && touch /tmp/sysstat.run

# Try to guess if sadc was successfully launched. The difficulty
# here is that the exit code is lost when the above command is
# run via "su foo -c ..."
        if [ ! -f /tmp/sysstat.run ]; then
                RETVAL=1
        else
                rm -f /tmp/sysstat.run
        fi
        echo
        ;;
stop|status|restart|reload)
        ;;
*)
        echo "Usage: sysstat {start|stop|status|restart|reload}"
        exit 1
esac
exit ${RETVAL}

The sysstat bootscript only needs to run at system startup, therefore only one symlink is required. Create this symlink in /etc/rc.d/rcsysinit.d using the following command

If sar package is activated the crontab for root should contain something like this (this is actual Red Hat file /etc/cron.d/sysstat ):

# run system activity accounting tool every 10 minutes
*/10 * * * * root /usr/lib64/sa/sa1 1 1
# generate a daily summary of process accounting at 23:53
53 23 * * * root /usr/lib64/sa/sa2 -A;

or for 64-bit Suse the content of the file /etc/sysstat/sysstat.cron looks like

root@usrklxbck01:/etc/init.d # cat /etc/sysstat/sysstat.cron
#crontab for sysstat

# activity reports every 10 minutes everyday
-*/10 * * * *     root /usr/lib64/sa/sa1 -d 1 1

# update reports every 6 hours
0 */6 * * *       root /usr/lib64/sa/sa2 -A

# Generate a daily summary of process accounting at 23:53
53 23 * * *       root /usr/lib64/sa/sa2 -A

As you can see there are two shell scripts sa1 and sa2. The first script invokes the utility sard for collecting the data, the second produces reports.

System Activity Recorder can monitor several system functions related to overall system performance, for example:

    cpu utilization (it is pretty effective tool for spotting CPU bottlenecks)
    hard disk utilization
    terminal IO
    number of files open
    processes running

SAR as many options and provides queuing, paging, CPU and many other metrics. The system maintains a series of system activity counters that record various activities and provide the data that sar reports. The command merely extracts the data in the counters and saves them based on the sampling rate and the number of samples specified to sar. It costs of two programs sarc and sar:

    The sadc command is a data collecting part of the package. It writes in binary format to the specified output file. To run sar in real time type:

        sar -u 2 5

    In this case the sar command calls sadc to access system data.

    To report on previously captured data – type
        sar -u -f filename > file
    Two shell scripts /usr/lib/sa/sa1 and usr/ib/sa/sa2 can be run by the cron daemon and provide daily statistics and reports.

Side effects

Any tool that collects performance data has some impact on system performance, but with sar, it seems to be minimal. Even one minute sampling usually does not cause any serious issues. That may not hold true on a system that is very busy.
Alternatives

There is also older and not currently maintained atsar project:

    The atsar command can be used to detect performance bottlenecks on Linux systems. It is similar to the sar command on other UNIX platforms. Atsar has the ability to show what is happening on the system at a given moment. It also keeps track of the past system load by maintaining history files from which information can be extracted. Statistics about the utilization of CPUs, disks and disk partitions, memory and swap, tty's, TCP/IP (v4/v6), NFS, and FTP/HTTP traffic are gathered. Most of the functionality of atsar has been incorporated in the atop project.

http://www.softpanorama.org/Admin/Monitoring/Sar/linux_implementation_of_sar.shtml

Friday, April 15, 2011

Solaris Performance Monitoring & Tuning – iostat, vmstat, netstat

Introduction to iostat , vmstat and netstat
This document is primarily written with reference to solaris performance monitoring and tuning but these tools are available in other unix variants also with slight syntax difference.
iostat , vmstat and netstat are three most commonly used tools for performance monitoring . These comes built in with the operating system and are easy to use .iostat stands for input output statistics and reports statistics for i/o devices such as disk drives . vmstat gives the statistics for virtual Memory and netstat gives the network statstics .
Following paragraphs describes these tools and their usage for performance monitoring.
Table of content :
1. Iostat
* Syntax
* example
* Result and Solutions
2. vmstat
* syntax
* example
* Result and Solutions
3. netstat
* syntax
* example
* Result and Solutions
Input Output statistics ( iostat )
iostat reports terminal and disk I/O activity and CPU utilization. The first line of output is for the time period since boot & each subsequent line is for the prior interval . Kernel maintains a number of counters to keep track of the values.
iostat’s activity class options default to tdc (terminal, disk, and CPU). If any other option/s are specified, this default is completely overridden i.e. iostat -d will report only statistics about the disks.
syntax:
Basic synctax is iostat interval count
option – let you specify the device for which information is needed like disk , cpu or terminal. (-d , -c , -t or -tdc ) . x options gives the extended statistics .
interval – is time period in seconds between two samples . iostat 4 will give data at each 4 seconds interval.
count – is the number of times the data is needed . iostat 4 5 will give data at 4 seconds interval 5 times
Example

$ iostat -xtc 5 2
                          extended disk statistics       tty         cpu
     disk r/s  w/s Kr/s Kw/s wait actv svc_t  %w  %b  tin tout us sy wt id
     sd0   2.6 3.0 20.7 22.7 0.1  0.2  59.2   6   19   0   84  3  85 11 0
     sd1   4.2 1.0 33.5  8.0 0.0  0.2  47.2   2   23
     sd2   0.0 0.0  0.0  0.0 0.0  0.0   0.0   0    0
     sd3  10.2 1.6 51.4 12.8 0.1  0.3  31.2   3   31

The fields have the following meanings:
      disk    name of the disk
      r/s     reads per second
      w/s     writes per second
      Kr/s    kilobytes read per second
      Kw/s    kilobytes written per second
      wait    average number of transactions waiting for service (Q length)
      actv    average number of transactions  actively being serviced
(removed  from  the  queue but not yet completed)
      %w      percent of time there are transactions  waiting
              for service (queue non-empty)
      %b      percent of time the disk is busy  (transactions
                  in progress)

Results and Solutions
The values to look from the iostat output are:
* Reads/writes per second (r/s , w/s)
* Percentage busy (%b)
* Service time (svc_t)
If a disk shows consistently high reads/writes along with , the percentage busy (%b) of the disks is greater than 5 percent, and the average service time (svc_t) is greater than 30 milliseconds, then one of the following action needs to be taken
1.) Tune the application to use disk i/o more efficiently by modifying the disk queries and using available cache facilities of application servers .
2.) Spread the file system of the disk on to two or more disk using disk striping feature of volume manager /disksuite etc.
3.) Increase the system parameter values for inode cache , ufs_ninode , which is Number of inodes to be held in memory. Inodes are cached globally (for UFS), not on a per-file system basis
4.) Move the file system to another faster disk /controller or replace existing disk/controller to a faster one.
Next Page vmstat

Virtual Memory Statistics ( vmstat )
vmstat – vmstat reports virtual memory statistics of process, virtual memory, disk, trap, and CPU activity.
On multicpu systems , vmstat averages the number of CPUs into the output. For per-process statistics .Without options, vmstat displays a one-line summary of the virtual memory activity since the system was booted.
syntax
Basic synctax is vmstat interval count
option – let you specify the type of information needed such as paging -p , cache -c ,.interrupt -i etc.
if no option is specified information about process , memory , paging , disk ,interrupts & cpu is displayed .
interval – is time period in seconds between two samples . vmstat 4 will give data at each 4 seconds interval.
count – is the number of times the data is needed . vmstat 4 5 will give data at 4 seconds interval 5 times.
Example
The following command displays a summary of what the system
is doing every five seconds.
example% vmstat 5

procs  memory          page             disk      faults        cpu
     r b w swap  free re mf pi p fr de sr s0 s1 s2 s3  in  sy  cs us sy id
     0 0 0 11456 4120 1  41 19 1  3  0  2  0  4  0  0  48 112 130  4 14 82
     0 0 1 10132 4280 0   4 44 0  0  0  0  0 23  0  0 211 230 144  3 35 62
     0 0 1 10132 4616 0   0 20 0  0  0  0  0 19  0  0 150 172 146  3 33 64
     0 0 1 10132 5292 0   0  9 0  0  0  0  0 21  0  0 165 105 130  1 21 78

The fields of vmstat's display are
procs
r     in run queue
b     blocked for resources I/O, paging etc.
w     swapped
memory (in Kbytes)
swap -  amount  of  swap   space   currently   available
free   - size of the free list

page ( in units per second).
re    page reclaims -  see  -S  option  for  how  this
field is modified.
mf    minor faults -  see  -S  option  for  how    this
field is modified.
pi    kilobytes paged in
po    kilobytes paged out
fr    kilobytes freed
de    anticipated short-term memory shortfall (Kbytes)
sr    pages scanned by clock algorithm
disk  ( operations per second )
There are  slots for up to four disks,
 labeled with a single letter and number.
The letter indicates  the  type  of disk
 (s = SCSI, i = IPI, etc).
The number is  the logical unit number.

faults
in    (non clock) device interrupts
sy    system calls
cs    CPU context switches

cpu  -   breakdown of percentage usage of CPU  time.
 On multiprocessors  this is an a
 average across all processors.
us    user time
sy    system time
id    idle time

Results and Solution from iostat
A. CPU issues
Following columns has to be watched to determine if there is any cpu issue
1. Processes in the run queue (procs r)
2. User time (cpu us)
3. System time (cpu sy)
4. Idle time (cpu id)

procs      cpu
     r b w    us sy  id
     0 0 0    4  14  82
     0 0 1    3  35  62
     0 0 1    3  33  64
     0 0 1    1  21  78

Problem symptoms
A.) Number of processes in run queue
1.) If the number of processes in run queue (procs r) are consistently greater than the number of CPUs on the system it will slow down system as there are more processes then available CPUs .
2.) if this number is more than four times the number of available CPUs in the system then system is facing shortage of cpu power and will greatly slow down the processess on the system.
3.) If the idle time (cpu id) is consistently 0 and if the system time (cpu sy) is double the user time (cpu us) system is facing shortage of CPU resources.
Resolution
Resolution to these kind of issues involves tuning of application procedures to make efficient use of cpu and as a last resort increasing the cpu power or adding more cpu to the system.
B. Memory Issues
Memory bottlenecks are determined by the scan rate (sr) . The scan rate is the pages scanned by the clock algorithm per second. If the scan rate (sr) is continuously over 200 pages per second then there is a memory shortage.
Resolution
1. Tune the applications & servers to make efficient use of memory and cache.
2. Increase system memory .
3. Implement priority paging in s in pre solaris 8 versions by adding line “set priority paging=1″ in
/etc/system. Remove this line if upgrading from Solaris 7 to 8 & retaining old /etc/system file.
Next Page netstat

Network Statistics (netstat)
netstat displays the contents of various network-related data structures in depending on the options selected.
Syntax
netstat
multiple options can be given at one time.
Options
-a – displays the state of all sockets.
-r – shows the system routing tables
-i – gives statistics on a per-interface basis.
-m – displays information from the network memory buffers. On Solaris, this shows statistics
for STREAMS
-p [proto] – retrieves statistics for the specified protocol
-s – shows per-protocol statistics. (some implementations allow -ss to remove fileds with a value of 0 (zero) from the display.)
-D – display the status of DHCP configured interfaces.
-n do not lookup hostnames, display only IP addresses.
-d (with -i) displays dropped packets per interface.
-I [interface] retrieve information about only the specified interface.
-v be verbose
interval – number for continuous display of statictics.
Example
$netstat -rn

Routing Table: IPv4
    Destination           Gateway               Flags  Ref   Use   Interface
-------------------- -------------------- ----- ----- ------ ---------
192.168.1.0         192.168.1.11          U        1   1444      le0
224.0.0.0           192.168.1.11          U        1   0            le0
default             192.168.1.1           UG       1   68276
127.0.0.1           127.0.0.1             UH       1   10497     lo0

This shows the output on a Solaris machine who’s IP address is 192.168.1.11 with a default router at 192.168.1.1
Results and Solutions
A.) Network availability
The command as above is mostly useful in troubleshooting network accessibility issues . When outside network is not accessible from a machine check the following
1. if the default router ip address is correct
2. you can ping it from your machine.
3. If router address is incorrect it can be changed with route add command. See man route for more information.
route command examples
$route add default
$route add 192.0.2.32
If the router address is correct but still you can’t ping it there may be some network cable /hub/switch problem and you have to try and eliminate the faulty component .
B.) Network Response
$ netstat -i

Name Mtu  Net/Dest Address  Ipkts  Ierrs  Opkts Oerrs  Collis  Queue
lo0 8232  loopback localhost  77814  0  77814  0  0  0
hme0 1500  server1 server1  10658  3  48325  0  279257  0

This option is used to diagnose the network problems when the connectivity is there but it is slow in response .
Values to look at:
* Collisions (Collis)
* Output packets (Opkts)
* Input errors (Ierrs)
* Input packets (Ipkts)
The above values will give information to workout
i. Network collision rate as follows :
Network collision rate = Output collision counts / Output packets
Network-wide collision rate greater than 10 percent will indicate
* Overloaded network,
* Poorly configured network,
* Hardware problems.
ii. Input packet error rate as follows :
Input Packet Error Rate = Ierrs / Ipkts.
If the input error rate is high (over 0.25 percent), the host is dropping packets. Hub/switch cables etc needs to be checked for potential problems.
C. Network socket & TCP Connection state
Netstat gives important information about network socket and tcp state . This is very useful in
finding out the open , closed and waiting network tcp connection .
Network states returned by netstat are following

CLOSED       ----  Closed.  The socket  is  not  being used.
LISTEN       ----  Listening for incoming connections.
SYN_SENT     ----  Actively trying to  establish  connection.
SYN_RECEIVED ---- Initial synchronization of the connection under way.
ESTABLISHED  ----  Connection has been established.
CLOSE_WAIT   ----  Remote shut down; waiting  for  the socket to close.
FIN_WAIT_1   ----  Socket closed; shutting  down  connection.
CLOSING      ----  Closed,
then   remote   shutdown; awaiting acknowledgement.
LAST_ACK     ----   Remote  shut  down,  then   closed ;awaiting acknowledgement.
FIN_WAIT_2   ----  Socket closed; waiting for shutdown from remote.
TIME_WAIT    ----  Wait after close for  remote  shutdown retransmission..

Example
#netstat -a

Local Address  Remote Address  Swind    Send-Q  Rwind  Recv-Q  State
*.*  *.*  0  0  24576  0  IDLE
*.22  *.*  0  0  24576  0  LISTEN
*.22  *.*  0  0  24576  0  LISTEN
*.*  *.*  0  0  24576  0  IDLE
*.32771  *.*  0  0  24576  0  LISTEN
*.4045  *.*  0  0  24576  0  LISTEN
*.25  *.*  0  0  24576  0  LISTEN
*.5987  *.*  0  0  24576  0  LISTEN
*.898  *.*  0  0  24576  0  LISTEN
*.32772  *.*  0  0  24576  0  LISTEN
*.32775  *.*  0  0  24576  0  LISTEN
*.32776  *.*  0  0  24576  0  LISTEN
*.*  *.*  0  0  24576  0  IDLE
192.168.1.184.22  192.168.1.186.50457  41992  0  24616  0  ESTABLISHED
192.168.1.184.22  192.168.1.186.56806  38912  0  24616  0  ESTABLISHED
192.168.1.184.22  192.168.1.183.58672  18048  0  24616  0  ESTABLISHED

if you see a lots of connections in FIN_WAIT state tcp/ip parameters have to be tuned because the
connections are not being closed and they gets accumulating . After some time system may run out of
resource . TCP parameter can be tuned to define a time out so that connections can be released and used by new connection.

Oracle disk I/O on Linux Tips

With Linux becoming the most popular OS for Oracle, many professionals have questions about how to manage disk I/O for Linux Oracle databases. I've devoted over a hundred pages in my book "Oracle Tuning: The Definitive Reference" to Linux disk I/O management, but we still have the issue that super-large disks will impose enqueues because the mechanical device can only relocate to a single cylinder at a time.
On busy Oracle databases on a single disk spindle, the disk can shake like an out-of-balance washing machine as competing tasks enqueue for data service. There are several ways to minimize disk I/O for Oracle on Linux:

Large data buffers - The 64-bit Linux allows for super-large data buffers. The new solid state disks provide up to 100,000 I/Os per second, six times faster than traditional disk devices.
Multiple blocksizes - I/O segregation with multiple blocksizes (i.e. indexes on a 32k blocksize) provides additional I/O manageability. This is especially important if you are doing full-scans in Linux with multi-block reads.
Linux Direct I/O - Always make sure that you are using direct I/O. Linux systems support direct I/O on a per-filehandle basis (which is much more flexible) with the O_DIRECT parameter. See Kernel Asynchronous I/O (AIO) Support for Linux.

Linux datafile I/O management for Oracle

Understanding the Linux I/O calls (Completely Fair Queuing (CFQ), Deadline I/O scheduling, NOOP I/O and Anticipatory I/O). This guy has a write-up on Linux kernel I/O for large Oracle systems in Linux. Also full-scan access speed is aggravated by Oracle willy-nilly block placement in Automated Storage Management (ASM) and using bitmap freelists (Automated Segment Storage Management).
The problem with most large Linux Oracle databases is that the super-large disk devices have introduced seek-time latency, as the read-write heads traverse between the cylinders. See
This author also notes this seek latency issue in Linux and suggests how changing I/O drivers may be an option for very large Oracle Linux databases:
"When Oracle is performing a full table scan using parallel query it is continually issuing read requests of around 1Mb (for example) for a large set of blocks that are contiguous. Hence there ought to be little or no latency due to disk head movement.
When another parallel query slave, possibly for the very same query as the first, is also trying to retrieve a large set of contiguous data the danger is that the disk head will continually be flicking around between the two processes, incurring latency each time it does so.
The most efficient scheduling method would therefore appear to me to be one that allows the second process to wait while satisfying more requests from the first process, thus reducing the disk head movement and increasing the rate of blocks read from disk."

Seek time (read-write head movement remains the largest component of Linux I/O latency. The Oracle professional can work-around this issue by intelligently placing high I/O data files in the middle absolute track number to minimize read-write head movement, allocating "hot" data files near the middle absolute track of the disk spindle:

Finding disk I/O bottlenecks in Linux

The majority of the wait time in most large Linux Oracle databases is spent accessing data blocks. You can also run STATSPACK I/O queries to see Linux disk I/O details.

Top 5 Timed Events
                                                      % Total
Event                            Waits    Time (s) Ela Time
--------------------------- ------------ ----------- --------
db file sequential read            2,598       7,146    48.54
db file scattered read            25,519       3,246    22.04
library cache load lock              673       1,363     9.26
CPU time                              44       1,154     7.83
log file parallel write           19,157         837     5.68
Far and away, the easiest way to spot hidden Linux I/O bottlenecks is with Ion, where the sources of the Linux disk I/O contention become immediately apparent. Ion is the most useful because it tracks workload-related I/O bottlenecks that are often too transient to see with scripts:

The Ion tool is amazing at spotting hidden I/O trends on Linux databases. I rarely recommend GUI tools, but Ion is one exception because it removes the tedium of running dozens of scripts to locate Linux disk I/O contention.

e-maknyus

Sunday, April 17, 2011

Hoard memory allocator

What kind of applications will Hoard speed up?

I'm using the STL but not seeing any performance improvement. Why not?

What systems does Hoard work on?

Have you compared Hoard with SmartHeap SMP?

Have you compared Hoard against mtmalloc or libumem?

Static, Shared Dynamic and Loadable Linux Libraries

Library naming conventions:

Saturday, April 16, 2011

System Limits

33.1 Maximum Stripe Count

33.2 Maximum Stripe Size

33.3 Minimum Stripe Size

33.4 Maximum Number of OSTs and MDTs

33.5 Maximum Number of Clients

33.6 Maximum Size of a File System

33.7 Maximum File Size

33.8 Maximum Number of Files or Subdirectories in a Single Directory

33.9 MDS Space Consumption

33.10 Maximum Length of a Filename and Pathname

33.11 Maximum Number of Open Files for Lustre File Systems

33.12 OSS RAM Size

Reference

Linux implementation of sar

Friday, April 15, 2011

Solaris Performance Monitoring & Tuning – iostat, vmstat, netstat

Oracle disk I/O on Linux Tips

Linux datafile I/O management for Oracle

Finding disk I/O bottlenecks in Linux

References on Linux I/O for Oracle

Stack Overflow

Server Fault