This is a collection of many performance tips that we've
gathered together at Affymetrix, and I thought it would be useful to post them
to the PostgreSQL news group.
The single most helpful trick has been the "Partial
index trick" at the bottom and the use of temp tables. Most of
these tricks came from either this news group, or from my colleagues in the
bioinformatics department, so I'd like to thank and acknowledge both
groups.
I'd like to thank Tom Lane, who
clearly has been working very hard on the Optimizer, and all the other people
who have worked on Postgres. Your efforts have been invaluable to us.
Keep up the good work!
We are currently working on a Transcriptome project,
which is a follow-on to the human genome project, in which we systematically
look across all parts of the genome to see what is expressed in the form of RNA.
It is publicly funded by the National Cancer Institute and the data is made
publicly available at: http://www.netaffx.com/transcriptome/
We currently have about 100GB of data and will soon
grow to a multi-terabyte system. We have tables of up to 1 billion rows
and have been able to get ~1 million row queries to run in about 5 min. We've
been very pleased with postgres. After a couple of major outages in our
server room, it came back up flawlessly each time. So it has been an
invaluable asset for this project. We run 7.2 on Red Hat on a 2-processor
machine with SAN, and we have a 128-node linux cluster which will make analysis
runs against the database.
Our main request is continued enhancement of the
optimizer for these heavy types of queries. Improved use of indexes,
ability to control execution plans explicitly, ability to use indexes for data
retrieval without touching the table in certain cases, and other such features
would be very useful. I'm also curious to hear about whether there
is any good clustering system for making a parallel postgres installation, and
if others have experience with creating such large databases.
We've been very happy and impressed with the
constant improvements to the system. Thank You!
This page
is a long detailed list of performance tips for doing heavy duty queries.
- Indexes 1. Indexes are critical. Create exactly
the combined (multi-field) indexes that are being joined in a particular
join. The order of fields in the index and in the join must match exactly.
- Indexes 2. Multi-Field Indexes. Having indexes on
individual columns as well as combinations of 2,3,and 4 columns can help.
Sometimes is uses the 3 version, and sometimes it uses one 2 and one
singlet index. This can be helpful, especially when seq scan is turned off
and you are using limit.
- Indexes 3. Remember that multiple-field indexes
must have the fields in the correct order as they are accessed in the
query. An index can only be used to the extent allowed by the keys. An index
over (A B C) can be used to find (A B), but not (B C).
- Vacuum. Always vacuum analyze the table(s) after
creating indices (or loading/deleting data).
- Limit and Order by. May have to use order by
and/or limit to use the indexes. May need to use order by with limit.
Sometimes order by increases speed by causing use of an index. Sometimes
it decreases speed because a Sort step is required. A where condition that
is sufficiently restrictive may also cause an index to be used.
- Join Order. Order of fields, joins, and order by
fields has a big impact.
- Casting 1. May have to explicitly cast things.
For instance where x=3 must become (where x=cast(3 as smallint)). This can
make a huge difference.
- Casting 2. Simply adding abs(destype)=(cast 111
as smallint) to my query and turning seq scans off seems to change the
query execution plan. Writing this as (destype=111 or destype=-111) makes
the cost over 7 times higher!!
- Seq Scans 1. Can you disable seq scans? Yes, you
can type "set enable_seqscan=no;" at the psql prompt and disable
it. Do not be surprised if this does not work though. You can also disable
merges, joins, nested loops, and sorts. Try this and attempt to enable the
correct combination that you want it to use.
- Seq Scans 2. In general you would like it to use
an index, but don't be afraid to try the seq scans if cost is say <
150,000 and see if it it finishes in a few minutes. For large joins with
no where clause, Postgres always uses seq scans. Try to add a where
clause, even a non-restrictive one, and use an index. However, remember
that postgres must go get the table data too, so this can be more costly. Postgres
cannot read data solely from an index (some commercial databases can).
- Seq Scans 3. Sometimes it is true that seq scans
are faster. It tries to use the analyzed statistics to decide which is
better. But don't always trust it, try it both ways. This is why analyzing
your table will produce different execution plans at after analysis -- The
analysis step will update the stats of the table. The change in estimated
costs might cause a different plan to be chosen.
- Explain Output. Reading the Explain output can be
confusing. In general, the numbers are a range. If you are trying to just
get some rows back, you'd like the left most number to be 0. This means
that the right-most number will probably not happen, because you will not
really have to search the entire table. The right-most number is an upper
bound. The numbers sum as you go up. What you don't want is a large number
for both the min and max. Sometimes a cost of about 100,000 takes about 3
minutes. Sometimes this is not accurate. Sometimes I was able to to see a
lower seq scan cost, but when I disable seq scans and used indexes, the
actual performance was faster. In general the cost is in
milliseconds. Use Explain Analyze which will run through they query
and produce actual times.
- SQL tricks. Remember the standard SQL tricks
which I will not cover here (get a good thick SQL book). For example using
Like, etc. can be slow. Remember that if there is no data in your table
for a given where clause, it must scan the entire result just to tell you
"no results found" so know your data in advance.
- Nested loops are probably the most expensive
operation.
- Having several merges and sorts can be way better
than having a single nestloop in your query.
- Explicit Joins. For more than 2 joined tables,
consider using explicit joins (see: http://www.ca.postgresql.org/users-lounge/docs/7.1/postgres/explicit-joins.html)
- News Groups. Try the postgres news groups: http://www.us.postgresql.org/users-lounge/index.html
- Hardware/Configuration changes. I won't go into a
lot of detail here as this page is more about the query optimizer, but you
can look at how much your CPU and memory is being taxed, and try running
postmaster with various flags to increase speed and memory. However, if
your query plan is not coming out right this will have little impact.
- Identities. You can try typing "and a.id=a.id"
and this will actually help encourage the query planner to use an index.
In one example, select with x=x and y=y order by x worked best (order by y
too made it worse!).
- Temp tables. You may want to explicitly control
the query by breaking it into several steps, with intermediate tables
being created along the way. You can make these true temp tables, which
will go away when you log out, or you may want to keep them around. You
might want to create a procedure or script that automates/hides this
process.
- Views. Views sometimes say that they are adding a
step to the query planner, but it does not seem to impact query speed. But
if you add more clauses to the view this may change the query plan in a
bad way, which is confusing to the user.
- Stored Procedures. Try writing a stored procedure
to more explicitly control the query execution. If you do this break out
SQL into many small cursors instead of 1 large cursor, otherwise you will
run up against the same problems.
- External programs. As above, breaking out a query
into a series of small, explicit nested loops in a C, Perl, or other
client program, may actually improve performance (especially if you want a
subset of results/tables).
- Monitor Query Progress. Alan Williams provided a
good trick to monitor the progress of a long running query. If you add to
the query a sequence (select nextval('sq_test'),...) then you can use
select currval('sq_test') to see how far the query has progressed.
- Partial Indices. You can use this feature to
force use of an index!!! (it is also useful as a true partial index).
Assume table1 below has no rows where field1=0. By doing the actions
below, it stores the clause field1<>0 in pg_index and when it sees
that predicate, it always uses the partial index. In this case we are
using it as a full index to trick it. Example:
create
index i on table1(field1) where field1 <> 0;
select * from table1 where field1<>0;
Shane Brubaker
BioInformatics Engineer
Affymetric, Inc.
|