Saturday, December 30, 2006

DISTINT 300ms grouby 1300ms for 10 entries

recentlycommentedposts = web.query('SELECT DISTINCT ON(comments.postid) comments.postid, comments.comment, posts.title, posts.id FROM comments JOIN posts on posts.id=comments.postid order by comments.postid DESC limit 10')
recentlycommentedposts = web.query('''select postid, max(created) as created
from comments
group by postid
order by created desc
limit 10;''')

select p.id, p.title, p.created as post_date, c.comment, a.created as comment_date
from posts p,
comments c,
(select postid, max(created) as created
from comments
group by postid
order by created desc
limit 10) a
where
p.id = a.postid
and c.postid = a.postid
and c.created = a.created
order by post_date desc;

"htpasswd" and /etc/httpd/conf/httpd.conf

there is probably a
so anywhere after athat

after that change, we need to make the /www/svn_new/main/auth.conf file
the "htpasswd" commmand does this
from shell,
cd /www/svn_new/main/
htpasswd -c auth.conf travis
(-c creates if not exists)
then to add future users, for example,
htpasswd auth.conf travis
5:27:38 pm
mark
htpasswd -c /etc/httpd/repo_passwd user

Postgresql VIEWs will save your day

oh yea, thats right mysql sucks ass for the complex queries, like even
select * from users where company_id in (select id from company where company_name='foo') ;
is that a join
Travis :
yea, it could be done with a join too
the above was running the output of one select into the input of a second.
:
i m doing a double join
i think it is super slow
2:04:33 pm
Travis
a join would be
select user.*
from users u, company c
where u.company_id = c.id
and c.company_name='foo'
i found with postgresql, if i was always running this sort of query
then i create a view
===================================
create or replace view v_company_users as
select user.* , c.company_name
from users u, company c
where u.company_id = c.id;
2:05:34 pm
mark
wat os a voew
2:05:40 pm
Travis
then my app does
select * from v_company_uses where company='foo'
so, instead of doing a join query and specifying parameters, create a view, and the view barfs out the values and you query the where on the view.

Rebuilding large GIN indexes

Rebuilding large GIN indexes

PostgreSQL
takes time. A lot of time. With rebuilding, I mean when you for example have to UPDATE all your tsvector fields due to a change in tsearch2 configuration. (Yes, I had to do that because I had a slightly incorrect tsearch2 configuration for the archives search database). So don't do it. Instead use the fact that PostgreSQL has nice transactional DDL and do something like this:


BEGIN TRANSACTION;

CREATE TABLE messages_new AS SELECT id,txt,to_tsvector(txt) AS fti

FROM messages;

CREATE INDEX messages_new_fti ON messages_new USING gin(fti);

ANALYZE messags_new;

ALTER TABLE messages RENAME TO messages_old;

ALTER TABLE messages_new RENAME TO messages;

COMMIT;

DROP TABLE messages_old;

ALTER INDEX messages_new_fti RENAME TO messages_fti;


(apologies for any typos, I didn't bother to actually type these commands into the database again, and I lost my cut-and-paste of what I ran)

This way, the messages table can still serve up searches without any disruption to the searches at all. And creating the new index is a lot faster than updating the existing one if you have to touch all rows.

Getting random rows from a database table

Getting random rows from a database table

Selecting random rows from a table in your database is generally useful for two things: grabbing one or more rows to display and/or use somehow, and for selecting a random subset of your rows and performing some sort of statistical analysis on the data. While the standard way of using ORDER BY RANDOM() is occassionally useful, it is very slow, it is non-repeatable, and it does not scale well. I'll demonstrate some better methods to get random rows.

For this article, I'll be using a table named mydata which contains ten million rows of data, and a primary key named id, which is of type bigint.

First, it's important to distinguish between random and unordered. If you simply pull rows from your table without an ORDER BY clause, they may appear random, but they are not: they are simply in an undefined order. In PostgreSQL, they will be roughly in an order related to the last time they were updated or inserted. However, by "random" we really mean that any row in the table has as much chance as appearing as another row within our SELECT statement. We'll need some way to accomplish this is SQL.


ORDER BY RANDOM()

How do we get something "random" into our query? Every modern computer has some way of generating a random number, and PostgreSQL has a way as well: the built in RANDOM() function, which generates a double precision number from from 0.0 to 1.0:

SELECT RANDOM() FROM generate_series(1,5);

Running the above yields something like this:

random
-------------------
0.739388620946556
0.706028273329139
0.150622034911066
0.396196706686169
0.412912936415523

(Note the use of the nifty new generate_series() function to repeat a SQL command a certain number of times).

PostgreSQL also allows you to use RANDOM() in the ORDER BY clause, which is one way to get a random row from the database. Let's pull out three random values from our test table:

SELECT id FROM mydata ORDER BY RANDOM() LIMIT 3;

id
---------
3635389
9417084
8385175

This appears to work just fine, but it has a major drawback - it does not scale, and gets extremely slow as the table size increases. This is a consequence of how ORDER BY RANDOM() works - it basically assigns a random number to every row in the database, then orders the entire table by the random numbers, and then returns the rows you want. For small tables, this is not much of a problem, but this is a terrible solution as the tables grow in size. Here's a breakdown on speeds on my system for grabbing a single row by using the query SELECT id FROM mydata ORDER BY RANDOM() LIMIT 1:

Number of rowsTime to run
One thousand (1000)3 milliseconds
Ten thousand (10,000)40 milliseconds
One hundred thousand (100,000)Half a second
One million (1,000,000)7 seconds
Ten million (10,000,000)149 seconds

Fortunately, there are much better ways to obtain random rows. There are two basic approaches to take - we can pick randomly from a range of values, or we can store a random number inside the table itself.


Range of Values

Let's keep using our mydata table, which has a primary key of id. If we know enough information about a column in the database, we can use that to get random rows by picking random values of that column. In this example, all we need to know is the minimum and maximum value of the id column, and we can have an external program generate a random number between the minimum and the maximum and put it into a query:

SELECT * FROM mydata WHERE id = '4012341';

We can also have the database help us choose the number, if we know there are a maximum of 10 million ids:

SELECT * FROM mydata
WHERE id =
(SELECT (RANDOM() * 10000000)::int OFFSET 0)
LIMIT 1;

Both run in under a second, as the primary key column id is indexed. (The use of OFFSET 0 is needed in the second query to force the planner to evaluate RANDOM() only one time).

There are a few problems with this approach, however. One obvious one is that the query above may fail if there are any "holes" in the range of numbers from min to max. Storing information about where the holes are is probably impractical, but we can get around it by finding the value that is closest to the random number we picked, like this:

SELECT * FROM mydata
WHERE id >= '4012341'
LIMIT 1;

While that query addresses the problem of holes, it has two additional problems: it does not guarantee that the same row is returned each time, and it sometimes runs very, very slow. Running an EXPLAIN plan shows us why the speed difference:

QUERY PLAN
----------------------------------------
Limit (cost=0.00..0.02 rows=1)
-> Seq Scan on mydata
(cost=0.00..223040.00 rows=9996117)
Filter: (id >= 4012341::bigint)

The index is not being used. As a good rule of thumb, never use a LIMIT without an ORDER BY clause. Let's add one in, which will solve both of our problems. The index will be used, and the results will be predictable:

SELECT * FROM mydata
WHERE id >= '4012341'
ORDER BY id LIMIT 1;

This strategy of using ">= (value) ORDER BY (column) LIMIT 1" is one which we will us a lot from this point forward.

Another problem is that we are not guaranteed to get the number of rows that we want. For example:

SELECT * FROM mydata
WHERE id >= '9999999'
ORDER BY id LIMIT 100;

This will only return 2 rows since our sample data has a maximum id of ten million. There are two ways around this problem: you can re-run the query with a new random number until you get the number of random rows you need, or you can adjust your random number (or your table) to make sure that you always have at least that many. For example, if your data has 100 rows and you want to pull 10 of them at random, then make sure you never ask for an id of more than 90. Alternatively, you could "pad" your table with 10 extra rows, and then safely use the numbers 1-100.

There is one final problem: picking our own random values from range will not produce truly random rows unless the data is perfectly uniformly distributed. Consider a table with two rows and values of 1 and 10. Our strategy above would cause the 10 value to appear more often than the 1 value, which is not the randomness we are looking for. In addition to the holes, if the values are not unique, then the distribution may not be uniform, and we once again lack true randomness. We need a way to combine the true randomness of ORDER BY RANDOM() with the speed of a Range of Values.


Random Column

The final and best solution is to create a new column in your database that stores random values. The table can then be sorted by this column, and get back random rows in a fast, repeatable, and truly random way. What we are basically doing is emulating the effect of ORDER BY RANDOM(), which as you recall creates a random value for each row in the database. Let's apply it to our test table.

First, we create a new column to hold the random values. Since RANDOM() returns the type "double precision", we create a new column of that type. We'll name it myrand:

ALTER TABLE mydata ADD myrand DOUBLE PRECISION;

Now we can populate that row with a random number from 0.0 to 1.0:

BEGIN;
UPDATE mydata SET myrand = RANDOM();
COMMIT;

This does take a non-trivial amount of time to run (372 seconds to populate all ten million rows), but it is a one-time cost. Since we'll be hitting this column to generate our random rows, we should put an index on it as well. But before we do that, we have to also ensure that our results are reproducible. In other words, the same query should return the same exact rows. Something like this is not guaranteed to get the same 10 rows each time it is run:

SELECT * FROM mydata
ORDER BY myrand LIMIT 10;

Why? Because there is no unique constraint on the myrand column, and it is possible (especially with our 10 million row example table) that two myrand columns contain the same value. As another rule of thumb, always make sure your ORDER BY clause specifies a unique set of rows. Our primary key, "id", is unique, so that makes a good backup for when our myrand column happens to have the same value. Our new query becomes:

SELECT * FROM mydata
ORDER BY myrand, id LIMIT 10;

Now we can create the index, on both of the columns in that ORDER BY. For good measure, we'll analyze the table as well:

CREATE INDEX myrand_randomhelp ON mydata(myrand,id);

ANALYZE VERBOSE mydata;

Before the index was in place, the query to grab a random row took over 180 seconds. Now that it is in place, the query runs in less than 1 second (126 milliseconds).

So that's our basic "Random Column" strategy: assign each row a random number, make sure it is linked to another unique column, and make an index across both of them. This allows us to get fast, repeatable, and truly random rows. You can also ensure that new rows get a new random value automatically added to them by doing this:

ALTER TABLE mydata ALTER myrand
SET NOT NULL DEFAULT RANDOM();

If you don't care about repeatability, and simply want to grab a random row, you can do this:

SELECT * FROM mydata
WHERE myrand >= (SELECT RANDOM() OFFSET 0)
ORDER BY myrand ASC LIMIT 1;

The ORDER BY clause is needed to ensure that our index is used. Note that Postgres has no problem using our previous index we created on both columns, because we put the myrand column first inside of that index. The above query is basically what Wikipedia uses when you click on the "Random Page" link.

Another advantage to using a Random Column is that not are the results reproducible, they are resettable. Let's say that you are using this method to pull 100 random rows at time out of a table with 1000 rows for statistical analysis. You also want to make sure that you never use the same row more than once, so you use an OFFSET:

SELECT * FROM mydata ORDER BY myrand, id
LIMIT 100 OFFSET 1;
SELECT * FROM mydata ORDER BY myrand, id
LIMIT 100 OFFSET 100;
SELECT * FROM mydata ORDER BY myrand, id
LIMIT 100 OFFSET 200;
etc...

(Note: although offset starts at 0, we ignore the first column as OFFSET 100 is easier to read then OFFSET 99). At some point, you want to run some more tests, but you don't want the same grouping as before. In other words, you want to reshuffle the deck of cards. Simple enough, just assign new values to the 'myrand' column:

UPDATE mydata SET myrand = RANDOM();

The only drawback to the whole Random Column strategy is the time and effort it takes to set it up, and the additional disk space needed to handle the extra column. Because of the extra column, INSERTS and UPDATES may run slightly slower.

Here's a summary of the three strategies to grab some random rows from a table:

TechniqueProsConsWhen to use
ORDER BY RANDOM()
  • Completely random
  • No table changes needed
  • Easy to append to existing queries
  • Very slow: does not scale
  • Non-repeatable result sets
  • Quick ad-hoc queries and very small tables that will not grow large
    Range of Values
  • Very fast
  • Uses existing columns
  • Not truly random
  • Must track minimum and maximum
  • Hard to get desired number of rows
  • Inserts can affect results
  • When data is very well-defined and stable (even then, be cautious)
    Random Column
  • Very fast
  • Truly random
  • Reproducible
  • Automatically maintained
  • Resettable
  • Column data type matches random() for easy use
  • Takes up disk space
  • Initial setup cost
  • Inserts may be slowed by the extra column and the default value
  • Whenever possible

    Wednesday, December 27, 2006

    index of max value in a list PYTHON

    the max function, it returns the maximum value in the
    list rather than the index associated with that value.

    How do I return the index?


    l.index(max(l)

    l.index(max(l)) will give you the index of the first occurrance of the
    maximum.

    m = max(l)
    [ i for i,v in enumerate(l) if v==m ]

    will give you a list of all indices where the max occurs. (Putting the
    'max(l)' outside the list comprehension prevents it from being evaluated
    for each loop element.)

    Wednesday, December 20, 2006

    autoVACUUM in postgres

    stats_start_collector = on
    stats_row_level = on
    autovacuum = on
    autovacuum_naptime = 120

    in postgresql.conf and
    kill -1 the pid for postgres
    using cat postmaster.pid

    How to reload postgres config files without restarting db on the LIVE

    you can change it on the live service and kill -1 to make it take effect
    we should not kill postgres right
    yea, if you edit postmaster.conf and change max_connections=512 then it should not require stopping and starting .
    cat postmaster.pid
    database]$ cat postmaster.pid
    32623
    /work/database
    database]$ cat postmaster.pid
    32623
    so pid 32623 is your postmaster pid f
    so if you kill -1 32623
    after you have done the changes to postgresql.conf file
    that will make postmaster re-read the config file without shutting it down
    i.e. existing connections dont get dropped.
    ur sure it wont kill the db like last time
    and how to check if it has read the new postgresql.conf
    i guess if you tail -f the log file.
    the active postgresql log file (in pg_log folder) will say "reloading.."

    received SIGHUP, reloading configuration files
    if you had a really bad typo , then the database would either ignore the changes, or safely shutitself down.

    Tuesday, December 19, 2006

    Fixing dbus problems with gaim

    rpm -ql dbus | grep lib
    /lib/dbus-1.0
    /lib/libdbus-1.so.3
    /lib/libdbus-1.so.3.2.0
    /var/lib/dbus
    cd /lib/
    sudo ln -s libdbus-1.so.3.2.0 libdbus-1.so.2
    sudo ldconfig
    gaim

    Sunday, December 17, 2006

    vi tab retab --- space to tab -- tab to space

    :set tabstop=4 " Force tabs to be displayed/expanded to 4 spaces (instead of default 8).
    :set softtabstop=4 " Make Vim treat key as 4 spaces, but respect hard Tabs.
    : " I don't think this one will do what you want.
    :set expandtab " Turn Tab keypresses into spaces. Sounds like this is happening to you.
    " You can still insert real Tabs as [Ctrl]-V [Tab].
    :set noexpandtab " Leave Tab keys as real tabs (ASCII 9 character).
    :1,$retab! " Convert all tabs to space or ASCII-9 (per "expandtab"),
    " on lines 1_to_end-of-file.
    :set shiftwidth=4 " When auto-indenting, indent by this much.
    " (Use spaces/tabs per "expandtab".)
    :help tabstop " Find out more about this stuff.
    :help vimrc " Find out more about .vimrc/_vimrc :-)

    Thursday, December 14, 2006

    Make python use UTF8 instead of ascii

    cd /usr/lib/python2.3/site-packages/
    vi sitecustomize.py

    (add the following lines to the file, save it)
    import sys, codecs
    sys.setdefaultencoding('utf-8')

    from
    http://plone.org/products/cmfcontentpanels/issues/1

    Monday, December 11, 2006

    OSCON 2005 PostgreSQL Presentations

    The OSCON 2005 PostgreSQL Presentations are up, but a bunch of them are in OpenOffice format. I went through the pain of installing OpenOffice to convert them to PDF. For completeness, I have also included the presentations that were already available as PDF.

    Chris Browne:

    Joe Conway:

    Lance Obermeyer:

    Bruce Momjian:

    Aaron Thul:

    Robert Treat:

    Sunday, December 10, 2006

    How I replace out divs in inserts to keep the layout

    i.Description =re.sub(r'<.div[^>]*?>', '', i.Description)#replaces div id tags

    Tuesday, December 05, 2006

    When div when span

    DIV is an arbitrary *block* element. It can contain other block
    elements (including other DIVs). P is block element for
    paragraphs. P cannot contain other block elements; it cannot
    contain other Ps; it cannot contain DIVs. DIV does not create a
    new P. But DIV will close any P that is open.
    [color=blue]
    > I tend to use SPAN because it does not generate any line break.[/color]

    SPAN is an *inline* element. It cannot contain a block element.
    SPAN is closed whenever the block containing it is closed.

    Sunday, December 03, 2006

    Faster Inserts with PostgreSQL

    Faster Inserts with PostgreSQL

    As I mentioned yesterday, I'm working on optimizing lots of inserts into a database, and I need solutions for different DB servers. Today I have been working with PostgreSQL. Although the PostgreSQL COPY command is recommended, I can't seam to get it to work with ColdFusion, and cfquery. I've tried lots of different ways of doing it.

    Fast Insert Solution on PostgreSQL:

    So far the fastest solution I have come up with is using PREPARE to create a temporary prepared statement (yes, I'm aware of cfqueryparam, and this method inserts with the highest speed). So here's how you do it:

    PREPARE preparedInsert (int, varchar) AS
    INSERT INTO tableName (intColumn, charColumn)
    VALUES ($1, $2);
    EXECUTE preparedInsert (1,'a');
    EXECUTE preparedInsert (2,'b');
    EXECUTE preparedInsert (3,'c');
    DEALLOCATE preparedInsert;

    Your basically creating a function that allows you to pass variables to your insert statement. Inside the first set of parenthesis you list the types of your variables, then variables are referred to as $1, $3, etc. inside the statement.

    Next you can EXECUTE the statement as many times as you need to (this can all be done inside one SQL statement, inside one cfquery tag).

    Finally when I'm done, I DEALLOCATE the function, otherwise if you try to PREPARE a statement named preparedInsert again during the same connection session you will get an error.