PostgreSQL

Dear readers of our blog, we'd like to recommend you to visit the main page of our website, where you can learn about our product SQLS*Plus and its advantages.

SQLS*Plus - best SQL Server command line reporting and automation tool! SQLS*Plus is several orders of magnitude better than SQL Server sqlcmd and osql command line tools.

REQUEST COMPLIMENTARY SQLS*PLUS LICENCE

Enteros UpBeat offers a patented database performance management SaaS platform. It proactively identifies root causes of complex revenue-impacting database performance issues across a growing number of RDBMS, NoSQL, and deep/machine learning database platforms. We support Oracle, SQL Server, IBM DB2, MongoDB, Casandra, MySQL, Amazon Aurora, and other database systems.

9 June 2020

PostgreSQL is one of the most popular database management systems. The postgresql project itself has evolved from another project called Ingres. Formally, the development of postgresql began back in 1986. At that time it was called POSTGRES. And in 1996, the project was renamed PostgreSQL, which reflected a greater emphasis on SQL. And actually on July 8, 1996, the first release of the product took place.

What is PostgreSQL?

Since then, many versions of postgresql have been released. The current version is version 12, but regular versions are also released.

PostgreSQL is supported for all major operating systems – Windows, Linux, MacOS.

The official website of the project is postgresql.org.

PostgreSQL evolves as an open source. The source code of the project can be found in the Githaab repository at github.com.

PostgreSQL Development Story

The PostgreSQL project is derived from the POSTGRES project, which was developed under the direction of Michael Stonebraker, a professor at the University of California at Berkeley (UCB). I wanted to show a little bit more about the relationships of pedigree databases in order to better understand the place of PostgreSQL among the major players in the modern database market.

all the most visible RDBMS and the connections between them

I tried to graphically to display all the most visible RDBMS and the connections between them and approximately gave the dates of their creation and end. Crossing of objects means absorption, thus the absorbed object is more pale and not edged. The dollar sign means that the database is commercial. In doing so, I relied on information available on the Internet, particularly in Wikipedia, in scientific articles that I read and comments from direct database users that I received after publishing this picture on the Internet.

It must be said that despite the fact that the entire history of relational databases is less than 4 decades, many facts from the history of creation are treated differently, dates do not agree, and the participants in the events often just freely interpret the past. In addition, databases at that time were the subject of scientific research, so the priority of works is not the last argument when writing memoirs and interviews. Perhaps, given this complexity, the ACM Software System Award #6 was awarded simultaneously to two rival groups of researchers from IBM for their work on “System R” and Berkeley for INGRES, although Stonebraker received the ACM SIGMOD award (now named after Ted Codd, author of relational database theory) #1 in 1992 and Jim Gray (Microsoft) #2 in 1993.

So, as follows from the picture, you can see two branches of database development – one follows from “System R”, which was developed at IBM in the early 70’s, and the other from the project “INGRES”, which was managed by Stonebraker around the same time. These two projects began as a need to make practical use of the relational database model developed by Ted Codd of IBM in 1969 and 1970. It should be remembered that at that time there were two alternative database models – network and hierarchical, and behind them stood a powerful force – CODASYL Data Base Task Group (network) and IBM itself with its IMS (Information Management System with a hierarchical database model). A little aside is Oracle, whose rise is largely due to Ellison’s commercial talent to be in the right place at the right time, as Stonebraker said in an interview, although she, together with IBM, played a major role in creating and promoting SQL.

“System R” has played a big role in the development of relational databases, the creation of SQL language (originally SEQUEL, but because of problems with the existing brand had to throw out all the vowels). From “System R” developed SQL/DS and DB2. Actually, in IBM there were some more projects, but they were purely internal. You can read more about this branch in a very instructive document “The 1995 SQL Reunion: People, Projects, and Politics”, and Russian translation is also available.

INGRES (or Ingres89), unlike “System R”, developed quite in the spirit of Berkeley as an open database whose codes were distributed on tapes almost for free (postage and tape costs were paid). By 1980, about 1000 copies were distributed. The name stands for “INteractive Graphics (and) REtrieval System” and is associated with the French artist Jean Auguste Dominique Ingres by accident. The distinctive feature of this system was that it was developed for the UNIX operating system, which worked on the then widespread PDP 11, which predetermined its popularity, while “System R” worked only on large and expensive mainframes.

The QUEL query language was developed, which, as Stonebraker wrote, is similar to SEQUEL in that the programmer is free from knowledge of data structure and algorithms, which contributes to a significant degree of data independence. The availability of INGRES and the very liberal BSD license, as well as creativity, have contributed to a large number of relational databases, as shown in the figure.

Stonebraker personally contributed to their emergence, so he set up Ingres Corporation in the late 70’s (as he himself explains, he had to go for it because the University of Arizona demanded support), which released a commercial version of Ingres, in 1994 it was bought by CA (Computer Associates) and which in 2004 became open as Ingres r3.

Tandem Computers’ “NonStop SQL” was a modified version of Ingres that worked effectively on parallel computers and distributed data. It was able to execute queries in parallel and scaled almost linearly with the number of processors. Its authors were Berkeley graduates. Subsequently, Tandem Computers was bought by Compaq (2000) and then HP.

Sybase company was also organized by a man from Berkeley (Robert Epstein) and based on Ingres. It is known that Microsoft’s “SQL Server” database is nothing but a Sybase database, which was licensed for Windows NT. Since 1993, Sybase and Microsoft have separated and in 1995 Sybase renamed its database ASE (Adaptive Server Enterprise), and Microsoft began to further develop MS SQL.

Informix also originated from Ingres, but this time not from Berkeley, although Stonebraker still served as its CEO after Informix bought Ilustra in 1995 to add object-relationality and extensibility (DataBlade), which was organized by the same Michael Stonebraker as a result of the commercialization of Postgres in 1992. In 2001, it was bought by IBM, which acquired a significant number of Informix users and technology. Thus, DB2 also acquired a bit of object-relationality.

The Postgres project emerged as a result of the understanding of the Ingres errors and the desire to overcome the limitations of data types by allowing the definition of new data types. Work on the project started in 1985 and during 1985-1988, several articles were published describing the data model, the query language POSTQUEL, and the Postgres repository. POSTGRES is sometimes also referred to as a so-called Postgres DBMS. The limitations of the relational model have always been the subject of criticism, although everyone understood that this is a consequence of its simplicity and its merit. However, the penetration of computer technologies in all spheres of life required new applications, and from databases – support for new data types and features, such as support for inheritance, creation and management of complex objects.

Even when designing the original version of POSTGRES, the main attention was paid to extensibility and object-oriented capabilities. Even then, it was clear that DMBS needed to expand its functionality from data management to object management and knowledge management. In doing so, object functionality would allow for efficient storage and manipulation of non-traditional data types, while knowledge management would allow for storage and enforcement of a collection of rules that carry the semantics of the application. Stonebraker has defined the main task of POSTGRES as “to provide support for applications that require data, object and knowledge management services”.

One of the fundamental concepts of POSTGRES is class. Class is a named collection of instances of objects. Each instance has a collection of named attributes and each attribute has a specific type. Classes can be of three types – the main class, whose instances are stored in the database, the virtual (view) class, whose instances materialize only when requested (they are supported by the rule management system), and can be a version of another (parent) class.

The first version was released in 1989, followed by several other rewrites of the rule system. Note that the Ingres and Postgres codes had nothing in common ! In POSTGRES, support for types such as multidimensional arrays was implemented, which was already in conflict with the relational model, storing the object versionality (later, in version 6.3, this type was removed, because its support required a lot of effort, and versioning could be implemented on the application side using triggers). In 1992 the Illustra company was founded, and the project was closed in 1993 with the release of version 4.2. However, despite the official closure of the project, the open source and BSD license prompted Berkeley’s Andrew Yu and Jolly Chen graduates in 1994 to embark on further development. In 1995, they replaced POSTQUEL with the commonly used SQL query language, the project was named Postgres95, version numbering was changed, the project’s website was created, and many new users (including the author) appeared.

By 1996 it became clear that the name “Postgres95” would not stand the test of time and a new name was chosen – “PostgreSQL”, which reflects the relationship with the original project POSTGRES and the acquisition of SQL. Also, the old version numbering was returned, so the new version started as 6.0. In 1997, an elephant was offered as a logo, an e-mail from March 3, 1997, and a subsequent discussion in the mailing archives -hackers – were saved. The elephant was proposed by David Young in honor of Agatha Christie’s novel “Elephants can remember”.

Before that, the logo was a running leopard (jaguar). The project became big and was taken over by a small group of enterprising users and developers at the beginning, which was called PGDG (PostgreSQL Global Development Group). Further development of the project is fully documented and reflected in the archives of the mailing list – hackers.

What is PostgreSQL today ?

To date, released version of PostgreSQL v8 (19 January 2005), which is a significant event in the world of databases, as the number of new features added in this version, allows us to talk about the emergence of interest in large businesses both in the use and promotion. Thus, the largest company in the world, Fujitsu supported the work on version 8, released a commercial module Extended Storage Management. The liberal BSD license allows commercial companies to release their PostgreSQL versions under their own name and provide commercial support. For example, Pervasive has announced the release of Pervasive Postgres.

PostgreSQL is supported on all modern Unix systems (34 platforms), including the most common ones such as Linux, FreeBSD, NetBSD, OpenBSD, SunOS, Solaris, DUX, as well as under Mac OS X. Since version 8.X PostgreSQL works in “native” mode under MS Windows NT, Win2000, WinXP, Win2003. It is known that there are successful attempts to work with PostgreSQL under Novell Netware 6 and OS2.

PostgreSQL has been repeatedly recognized as the base of the year, for example, Linux New Media AWARD 2004, 2003 Editors’ Choice Awards, 2004 Editors’ Choice Awards.

PostgreSQL is used as a testing ground for a new type of flow-oriented databases – the TelegraphCQ project, which started in 2002 in Berkeley after the successful Telegraph project (the name of the main street in Berkeley). Interestingly, Streambase, which was founded by Mike Stonebraker in 2003 (originally “Grassy Brook”) to commercially promote this new generation of databases, is not in any way associated with the Berkeley project.

Key features and functionality of PostgreSQL

A full list of all features provided by PostgreSQL and a detailed description can be found in the extensive documentation (1300 pages).

Reliability of PostgreSQL

Reliability of PostgreSQL is a proven and proven fact and is provided by the following features:

Full compliance with ACID principles – atomicity, consistency, isolation, data integrity.
Atomicity – a transaction is considered as a single logical unit, all its changes are either saved entirely or rolled back completely.
Consistency – the transaction moves the database from one consistent state (at the moment of transaction start) to another consistent state (at the moment of transaction completion). A consistent database state is considered to be a state when all physical and logical integrity restrictions of the database are executed and integrity restrictions are violated during a transaction, but all integrity restrictions, both physical and logical, must be complied with at the time of transaction termination.
Isolation – data changes in competitive transactions are isolated from each other on the basis of the versioning system.
Durability – PostgreSQL takes care that the results of successful transactions are guaranteed to be saved to the hard disk regardless of hardware failures.
Multiversion Concurrency Control (MVCC) is used to maintain data consistency in competitive environments, while traditional databases use locks. MVCC means that each transaction sees a copy of the data (database version) at the time the transaction starts, even though the database state may have already changed. This protects the transaction from uncoordinated data changes that may have been caused by (another) competitive transaction and provides transaction isolation. The main benefit of using MVCC over lockout is that the lockout that MVCC puts in place for reading does not conflict with a write lockout and therefore reading never locks a write and vice versa. Competitive write operations “interfere” with each other only when working with the same write.
Write Ahead Logging (WAL) is a common mechanism for logging all transactions, which allows to restore the system after possible failures. The basic idea behind WAL is that all changes should be written to files on disk only after these log entries describing the changes are written to disk and guaranteed. This avoids dropping data pages to disk after each transaction is committed, as we know and are confident that we can always restore the database using the transaction log.
Point in Time Recovery (PITR) – the ability to restore the database (using WAL) at any point in the past, which allows for continuous backup of the cluster PostgreSQL.
Replication also improves the reliability of PostgreSQL. There are several replication systems, such as Slony (tested version 1.1), which is a free and most used solution, supports master-slaves replication. It is expected that Slony-II will support multi-master mode.
Data integrity is the heart of PostgreSQL. In addition to MVCC, PostgreSQL supports data integrity at the schema level – these are external keys (foreign keys), constraints (constraints).
PostgreSQL development model, which is absolutely transparent to anyone, as all plans, problems and priorities are openly discussed. Users and developers are in constant dialogue through mailing lists. All proposals and patches are thoroughly tested before they are accepted into the program tree. A large number of beta-testers help to test the version before release and clean up small bugs.
The openness of PostgreSQL codes means they are absolutely available to anyone, and the liberal BSD license does not impose any restrictions on the use of the code.

PostgreSQL performance

PostgreSQL performance is based on the use of indexes, intelligent query scheduler, thin locking system, memory buffer management and caching, excellent scalability in competitive work.

Index support

Standard indices are B-tree, hash, R-tree, GiST (generalized search tree).
Partial indices (partial indices) – you can create an index by a limited subset of values, for example,
create index idx_partial on foo (x) where x > 0;
Functional indexes (function indices) allow you to create indexes using function values from a parameter, for example, create index idx_functional on foo ( length(x) );

The request scheduler is based on the cost of different plans, taking into account many factors. It provides the user with the ability to debug requests and customize the system.
The Lockup System supports locks at a lower level, which helps maintain a high level of competitiveness while protecting data integrity. Lockouts are supported at the table and record level. At the lower level, locks for shared resources are optimized for a specific operating system and architecture.
Buffer management and caching uses sophisticated algorithms to maintain efficiency in the use of allocated memory resources.
Tablespaces to manage storage at the object level, such as databases, charts, tables and indices. This allows for flexible use of disk space and increases reliability, performance, and system scalability.
Scalability is based on the features described above. PostgreSQL’s low resource requirements and flexible lockout system provide scalability, while indexes and buffer management provide good system manageability even at high loads.

The extensibility of PostgreSQL

The extensibility of PostgreSQL means that the user can customize the system by defining new functions, aggregates, types, languages, indices and operators. The object-oriented PostgreSQL allows you to move the application logic to the database level, which greatly simplifies the development of clients, since the entire business logic is in the database. The functions in PostgreSQL are unambiguously defined by the name, number and types of arguments.

The figure shows the ER diagram of PostgreSQL system directory, which contains all the information about system objects, operators and methods of access to them. When initializing a PostgreSQL cluster (command initdb), two databases are created – template0 and template1, which contain a predefined by default set of functions. Any other database inherits template1, so you can add frequently used objects and methods to the template1 system directory.

ER diagram of PostgreSQL

PostgreSQL provides a command interface for working with the system directory, with which you can not only get information about the system objects, but also create new ones. For example, create databases using CREATE DATABASE, new domain – CREATE DOMAIN, operator – CREATE OPERATOR, data type – CREATE TYPE.

To create a new data type and index access methods are enough:

Write input/output functions and register them in the system directory using CREATE FUNCTION
Identify the type in the system directory using CREATE TYPE
Create operators for this data type with CREATE OPERATOR
Write comparison functions and register them in the system catalogue using CREATE FUNCTION.
Create a default operator that will be used to create an index by primary key – CREATE OPERATOR CLASS

The described scenario uses the existing index type. To create new indexes, you need to use GiST.
One of the remarkable features of PostgreSQL is a generalized search tree or GiST (project home page), which allows specialists in a particular field of knowledge to create specialized data types and provides index access to them without being database experts. The analogue of GiST is the technology DataBlade, which is now owned by IBM (see the historical reference above).

The idea of GiST was invented by Professor Berkeley Joseph M. Hellerstein and published in the article Generalized Search Trees for Database Systems. The original version of GiST was developed in Berkeley as a patch for POSTGRES and later incorporated into PostgreSQL. Later in 2001, the code was heavily modified to support variable length keys, multi-attribute indexes and secure NULL, and several bugs were fixed. So far, quite a few interesting GiST-based extensions have been written:

full text search module tsearch2. From version 8.3 full-text search will be built into the PostgreSQL kernel.
module for working with hierarchical data (tree-like) ltree
integer array module intarray

The PostgreSQL distribution in the contrib/ subdirectory contains a large number (about 80) of so-called contrib- modules implementing various additional functionality, such as full-text search, working with xml, mathematical statistics functions, error search, cryptographic modules, etc. Also, there are utilities to facilitate migration with mysql, oracle, for administrative work.

SQL support

SQL support, besides the basic features inherent in any SQL database, PostgreSQL supports:

Very high level of ANSI SQL 92, ANSI SQL 99 and ANSI SQL 2003 compliance. You can read more in the documentation.
Schemes that provide SQL level namespace. Schemes contain tables and can be used to define data types, functions and operators. Using the full object name, you can work with several schemes simultaneously.
Schemes allow organizing databases a set of several logical parts, each of which has its own access policy, data types. For applications that create new objects in the database, it is convenient and safe to create a separate schema (and include it in SEARCH_PATH) in order to avoid possible conflict with object names and the convenience of updating the application.
Subqueries – subqueries (subselects), full SQL92 support. Subqueries make the SQL language more flexible and often more efficient.
Outer Joins – external relations (LEFT, RIGHT, FULL).
Rules – rules according to which the original request is modified. The main difference from triggers is that the rule works at the request level and before the execution of the request, and the trigger is the system response to the data change, i.e. the trigger is launched during the query execution for each modified record (PER ROW). The rules are used to tell the system what actions to take when attempting to update the view.
Views – views, virtual tables. Real instances of these tables do not exist, they materialize only during the query. One of the main purposes of ‘view’ is to separate the access rights to the parent tables from ‘view’ and to ensure that the user interface remains consistent when the parent tables are changed. Updating ‘view’ (materialization) is possible in PostgreSQL using pl/pgsql.
Cursors – cursors that allow to reduce the traffic between client and server, as well as memory on the client, if you want to get not the whole result of the query, but only part of it.
Table Inheritance – inheritance of tables, which allows you to create objects that inherit the structure of the parent object and add their specific attributes. Inheritance – inherits the default attribute values (DEFAULTS) and integrity restriction (CONSTRAINTS). The parent table search automatically enables child objects to be searched, while retaining the ability to search only (only). Inheritance can be used to work with very large tables for partitioning emulation.
Prepared Statements are server-side objects that represent the original query after the PREPARE command, which has already passed the parser, modify the query (rewriting rules), and create a planner for the query, so you can use the EXECUTE command, which no longer requires these stages. For complex queries, this can be a big win.
Stored Procedures – Server (stored) procedures allow you to implement the business logic of the application on the server side. In addition, they can greatly reduce traffic between client and server.
Savepoints(nested transactions) – unlike “flat transactions”, which do not have intermediate commit points, the use of savepoints allows you to cancel the operation of the part of the transaction, for example, due to the incorrectly entered command, without affecting the rest of the transaction. This can be very useful for transactions that deal with large amounts of data.
System object access rights based on the privilege system. The owner of the object or a superuser can both allow access (GRANT) and cancel (REVOKE).
Messaging system between processes – LISTEN and NOTIFY allow you to organize an event model of interaction between the client and the server (the client receives the name of the event, assigned by the command notify and process PID).
Triggers allow managing the system response to data changes (INSERT, UPDATE, DELETE), both before the operation itself (BEFORE) and after (AFTER). During a trigger, special variables NEW (a record to be inserted or updated) and OLD (a record before the update) are available.
Cluster table – arranges the table records on the disk according to an index, which sometimes speeds up the query execution by reducing the disk access.

A rich set of PostgreSQL

A rich set of PostgreSQL data types includes:

Character types (character(n)) as defined in SQL standard and text type with almost unlimited length.
Numeric type supports arbitrary accuracy, which is very popular in scientific and financial applications.
Arrays according to SQL:2003 standard.
Large Objects allow storing binary data up to 2Gb in a database.
Geometric types (point, line, circle, polygon, box,…) allow working with spatial data on the plane.
GIS types in PostgreSQL are proof of the extensibility of PostgreSQL and allow you to work effectively with three-dimensional data. Details can be found at the PostGis project website.
Network types support inet data types for IPV4, IPV6 as well as cidr (Classless Internet Domain Routing) blocks and macaddr.
Composite types combine one or more elementary types and allow users to manipulate complex objects.
Time types (timestamp, interval, date, time) are implemented with very high precision.
The serial and bigserial pseudotypes allow organizing an ordered sequence of integers (AUTO_INCREMENT in some DBMS).
PostgreSQL has a very rich set of built-in functions and operators for working with data, the full list of which can be found in the documentation.
Support for 25 different character sets (charsets), including ASCII, LATIN, WIN, KOI8 and UNICODE, as well as support for locale, which allows you to work correctly with data in different languages.
NLS(Native Language Support) – Documentation, error messages are available in various languages, including Japanese, German, Italian, French, Russian, Spanish, Portuguese, Slovenian, Slovak, and several Chinese dialects.
Interfaces in PostgreSQL are implemented to access the database from a number of languages (C,C++,C#,python,perl,ruby,php,Lisp and others) and data access methods (JDBC, ODBC).
Procedural languages allow users to develop their functions on the server side, thus transferring the application logic to the database side, using programming languages other than built-in SQL and C. By now, PL/pgSQL, pl/Tcl, Pl/Perl and pl/Python are supported (included in the standard distribution). Besides them, there is support for PHP, Java, Ruby, R, shell.
Ease of use has always been an important factor for developers.

The psql utility (included in the distribution package) provides a user-friendly interface for working with the database, contains a brief SQL reference, facilitates entering commands (using the arrows for repeating and the tabulator for extension), supports history and query buffer, and allows you to work both interactively and in streaming mode.

phpPgAdmin (GPL license) provides the ability to administer a PostgreSQL cluster using a web browser.

pgAdmin III (GNU Artistic license) provides a convenient interface for working with PostgreSQL databases and runs on Linux, FreeBSD and Windows 2000/XP.

PgEdit is an application development environment and SQL editor, available for Windows and Mac.

Data security is also a critical aspect of any DBMS. In PostgreSQL it is provided by 4 security levels:

PostgreSQL cannot be run under a privileged user – system context
SSL,SSH traffic encryption between client and server – network context
A sophisticated authentication system at the host or IP address/subnet level. The authentication system supports passwords, encrypted passwords, Kerberos, IDENT, and other systems that can connect using the authentication plugin mechanism.
Detailed system of access rights to all database objects, which together with the scheme that provides isolation of object names for each user, PostgreSQL provides a rich and flexible infrastructure.

PostgreSQL limits

Title	Meaning
Maximum database size	Unlimited
Maximum table size	32 TB
Maximum recording length	400Gb
Maximum attribute length	1 GB
Maximum number of records in the table	Unlimited
Maximum number of attributes in the table	250 – 1600 depending on the type of attribute
Maximum number of indexes per table	Unlimited

Summary table of major relational databases

Title	ASE	DB2	FireBird	InterBase	MS SQL	MySQL	Oracle	PostgreSQL
License	$$$	$$$	IPL²	$$$	$$$	GPL/$$$	$$$	BSD
ACID	Yes	Yes	Yes	Yes	Yes	Depends¹	Yes	Yes
Referential integrity	Yes	Yes	Yes	Yes	Yes	Depends¹	Yes	Yes
Transaction	Yes	Yes	Yes	Yes	Yes	Depends¹	Yes	Yes
Unicode	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes
Schema	Yes	Yes	Yes	Yes	No⁵	No	Yes	Yes
Temporary table	No	Yes	No	Yes	Yes	Yes	Yes	Yes
View	Yes	Yes	Yes	Yes	Yes	No	Yes	Yes
Materialized view	No	Yes	No	No	No	No	Yes	No³
Expression index	No	No	No	No	No	No	Yes	Yes
Partial index	No	No	No	No	No	No	Yes	Yes
Inverted index	No	No	No	No	No	Yes	Yes	Yes⁶
Bitmap index	No	Yes	No	No	No	No	Yes	No
Domain	No	No	Yes	Yes	No	No	Yes	Yes
Cursor	Yes	Yes	Yes	Yes	Yes	No	Yes	Yes
User Defined Functions	Yes	Yes	Yes	Yes	Yes	No⁴	Yes	Yes
Trigger	Yes	Yes	Yes	Yes	Yes	No⁴	Yes	Yes
Stored procedure	Yes	Yes	Yes	Yes	Yes	No⁴	Yes	Yes
Tablespace	Yes	Yes	No	?	No⁵	No¹	Yes	Yes
Title	ASE	DB2	FireBird	InterBase	MS SQL	MySQL	Oracle	PostgreSQL

Comments:

1 – InnoDB is required to support transactions and reference integrity (not a default table type).
2 – Interbase Public License
3 – Materialized views can be emulated on PL/pgSQL
4 – only in MySQL 5.0, which is the experimental version
5 – only in MS SQL Server 2005 (Yukon)
6 – GIN (Generalized Inverted Index) from version 8.2

What is expected in future versions of PostgreSQL

The full list of new features is given in the large TODO list, which has been supported by Bruce Momjian for many years, but priorities for version 8.1 have not yet been defined, moreover, the duration of the development cycle has not yet been defined. For now, it can be stated with sufficient confidence that in 8.1, in addition to bug fixes and improvements to existing functionality or the translation of the syntax to the SQL standard will:

Bitmap indexes (initial submit CVS)
Integration of autovacuum into the server process
Two phase commit JDBC driver
Support IN,OUT,INOUT parameters for pl/pgsql (CVS)
Increasing the limit of the maximum number of arguments for the function (100 by default) (CVS)
Optimization of MIN,MAX by using indices (CVS)
Support UTF-16
GiST Concurrency & Recovery ! (CVS)

Buffer management in PostgreSQL

Page caching, or storing the pages read from disk in memory, is very important for the efficient work of any DBMS, as the disk access and memory times differ by many orders of magnitude. Ideally, we want all pages that are accessed to be accessed to the memory, so that its subsequent use does not require access to the disk. However, since the amount of memory available is limited, there is a situation where you need to decide which page to release (replace) in order to put a new page in the cache. Almost all commercial systems use this or that variation of the LRU (Least Recently Used) algorithm, which frees up the page that has not been accessed for the longest time.

In its pure form this algorithm is not very good for using in the DBMS due to the large variety of query sequences, for example, it does not take into account the page access frequency, is not protected from “cache flooding” when only one single consecutive reading of a large number of pages (sequential scan) may fill the cache with pages that may not be accessed anymore, i.e., a complete loss of caching efficiency. Sometimes, the term “scan-resistant” is used when a good algorithm is said to be resistant to “cache flooding”.

PostgreSQL used a variant of this algorithm, known as LRU/K, implemented by Tom Lane. This algorithm uses the history of the K last accesses to the page (exactly the last, which allows this algorithm to adapt to changes in the query pattern, as opposed to the LFU algorithm), which allows you to distinguish popular pages from the long gone. To do this, build an ordered queue (priority queue) of pointers to pages in the cache based on the time to access the page by the rule: if the page P1 K-top reference (the penultimate, for the most important case K = 2, LRU/2 ) is more recent than P2, then P1 will be replaced after P2. The classic LRU algorithm can be seen as LRU/1, since it only used information about one (last) page access. The important thing is not that there was a single page access, but how popular this page was for some time. However, this algorithm required a non-trivial setting and the time for queuing grows logarithmically depending on the buffer size.

The ARC (Adaptive Replacement Cache) algorithm was attractive because it took into account not only how often the page was used, but also how recently this had happened and not how much CPU load it had, as was the case with the LRU/K algorithm. It dynamically maintains a balance between frequently used and recently used pages. This algorithm was implemented by Jan Wieck for version 7.5 (later 8.0), which was later slightly improved after the article describing the CAR (Clock with Adaptive Replacement) algorithm.

However, just two days before PostgreSQL 8.0 was released, it was discovered (see Neil Conway’s posting and subsequent discussion) that IBM had applied for the ARC algorithm back in 2002. Since it was already late to change anything, it was decided to release 8.0 as is, and then to deal with the problem. Despite the fact that IBM had not yet received a patent for the ARC algorithm, and that IBM had good practice of supporting OSS projects, and one could hope to get official permission to use it in PostgreSQL, as many suggested, it was decided to investigate the actual patent infringement and to find out if it was possible to replace the ARC algorithm with a “patent-pure” algorithm.

The main argument for replacing the algorithm was to keep PostgreSQL available for “any use” under a BSD license that allows commercial use of PostgreSQL without any license fees. In early February 2005, Tom Lane proposed a modified version of the ARC algorithm, close to 2Q and published in 1994 long before ARC, which solved the problem of “scan resistant” and did not require major changes in the code (mainly removing the code), which was implemented in version 8.0.2. The 2Q algorithm (Two Queue) is almost as efficient as LRU/K, but simpler, requires no configuration and faster. It achieves this by storing only “hot” pages in the main buffer, rather than clearing “cold” pages in the main buffer as LRU/2. Simplified algorithm looks like this: the first time the pointer to the page is placed in the queue A1 (FIFO), and if at the second turn the page was still in A1, the page is called hot and placed in the main buffer, which is already controlled as an LRU queue. If the page was not accessed while it was in A1, then the page is probably “cold” and the 2Q algorithm removes it from the buffer.