Saturday, March 05, 2011

Friday, March 04, 2011

Stack 2.0 - Mongodb, node.js, redis, ios

Node.js seems to be the viable backend for scalable chat and game services


Mongodb seems to be the best tool for storing information on any backend game or mobile apps

When you compare this to what was available in 2005 , postgres/mysql and twisted/ejabberd - this is definitely a quantum improvement and 10gen definitely has made Mongodb the defacto backend replacing cassandra and many other backends

I wonder where redis fits into this ecosystem, i might presume it would be good as a caching option?

Jinja2 is good for templating, usage of jinja2 is enhanced by snippets in textmate

iOS is the winner among the smart phone oses

Friday, March 19, 2010

Apache mod_fcgid is the way to go for webpy

1. automatically spawns more processes based on traffic - lighttpd automatically the spawns the max number of processes even when no traffic - which is not useful if your max-procs is 100


2. usually need to set PYTHONPATH or append to sys.path inside code.py manually
import os,sys
sys.path.append( os.path.dirname(os.path.abspath(__file__)))

3. This is how you hook app.run for webpy
if __name__ == '__main__':
if apache:
web.wsgi.runwsgi = lambda func, addr=None: web.wsgi.runfcgi(func, addr)
app.run(session_mw)
else:app.run(session_mw)


Wednesday, February 25, 2009

Python Accessing Caller **locals() from callee method using sys._getframe

This is my first attempt at memcaching html page using cheetah
since cheetah render needs locals() i use getCallerInfo() to get the locals() and send to memcached
let me know if it is possible to better do this
notice getCallerInfo

utils.py
@log_time_func
def renderpage(key, htmlfile, deleteafter=3600):
from globaldb import mc
try:page = mc.get(key)
except:
page=None
clogger.info('except error mc.get '+ key)
if not page:
clogger.info(key+ ' rendering cheetah page')
terms = getCallerInfo(1)
#print terms
page = str(web.render(htmlfile, asTemplate=True, terms=terms))
try:mc.set(key, page, deleteafter)
except:
clogger.info('except error mc.set '+ key)
return page

@log_time_func
@memcachethis
def mcrenderpage(key, htmlfile, deleteafter=3600):
terms = getCallerInfo(2)
#print terms
return str(web.render(htmlfile, asTemplate=True, terms=terms))

def getCallerInfo(decorators=0):
'''returns locals of caller using frame.optional pass number of decorators\nFrom Dig deep into python internals http://www.devx.com/opensource/Article/31593/1954'''
f = sys._getframe(2+decorators)
args = inspect.getargvalues(f)
return args[3]

Usage
key=facebookstuff.APP_NAME+'newstart'+str(uid)
return utils.renderpage(key, 'pick.html')

Saturday, January 10, 2009

Facebook sytle Notifications in couchdb

Goal: To prepare a facebook style notification using couchdb
schema
class Notification(Document):
unread = BooleanField(default=True)
activity = TextField()
appname = TextField()
image = TextField()
type = TextField(default='notification')
fromuser = TextField()
touser = TextField()
link = TextField()
message = TextField()
created = DateTimeField(default=datetime.datetime.now())
time =TimeField(default=datetime.datetime.now())
date = DateField(default=datetime.date.today()

def savenotifications(fromuser, tousers, link, message, activity, image, appname, *args, **kwargs):
for touser in tousers:
if fromuser == touser:continue
notidoc = Notification( fromuser = fromuser,touser=touser, link=link, message=message, activity=activity, image = image, appname = appname)
notidoc.store(fbcouchdb)
for touser in tousers:
r=fbcouchdb.view('_view/pop/unread',key=str(touser), count=1)
for i in r:i

Bold: I even try to prepare the views when some one gets a new notifications by precalling the view unread ahead of time when a new notification for that user is inserted inorder to cause forced view regeneration


def getnotifications(self, all=False):
try:
from couchmodel import fbcouchdb
if not all:
return [ (x.id, x.value) for x in fbcouchdb.view('_view/pop/unre
ad',key=str(self.uid), count=5) ]
else:
return [ (x.id, x.value) for x in fbcouchdb.view('_view/pop/noti
fications',key=str(self.uid)) ]
except:return[]


Problem:but whenever i do getnotification it is really slow.. And the page doesnt open forever. Anyways to make couchdb usable with this db? the situation is that notifications are constantly inserted into the db all the time as things happen to the user in real time. is this not a good use case for couchdb

Wednesday, January 07, 2009

Using couchdb to build a EMAIL messaging solution with threaded inline replies

Couchdb based Messaging system
The goal is to build an anonymous inbox with inline threaded replies and count of number of items in the inbox and sent. It works well, the only part that is hard is that the views are pretty slow the first time they are constructed

Schema

class Anonmessage(Document):
subject = TextField()
read = BooleanField()
fromread = BooleanField()
toread = BooleanField()
msg = TextField()
background = TextField()
type = TextField()
reply = TextField()
fromuid = LongField()
replycount = IntegerField()
touid = LongField()
created = DateTimeField(default=datetime.datetime.now())
time =TimeField(default=datetime.datetime.now())
date = DateField(default=datetime.date.today())


Permanent view for inbox:
function(d){ if(d.type != 'reply'){ emit(d.touid, d); if(d.touid != d.fromuid){ if (d.replycount>0) emit(d.fromuid, d); } } }

Count of number of messages in USER inbox

map:function(d){
if(d.type != 'reply'){
emit(d.touid, 1);
if(d.touid != d.fromuid){
if (d.replycount>0)
emit(d.fromuid, 1);
}
}
}

reduce:function(keys, values) { return sum(values)}

Count of Sent items
map
:function(doc) { emit(doc.fromuid, 1) }
reduce:function(keys, values) { return sum(values)}

View to get replies
map:function(d) { if (d.type=="reply" ) emit(d.reply,d); }

Usage from python
inbox_messages = [ (x.id, x.value) for x in anondb.view('_view/truthbox/inbox',key=z.uid) ]

Saturday, January 03, 2009

Ten Reasons why Couchdb is better than Mysql

  1. No schema - schema less db, which means you can develop at the speed of your thought, you dont need to do a db update everytime you add a column
  2. everything is over http, simple http, get post, put, delete requests which means works with varnish/squid out of the box (i prefer varnish because it is clean and simple)
  3. Attachments - you can store file attachments for eg., your greeting card to grandma can carry images, music, flash
  4. Map Reduce - no more sql queries, use amazingly scalable map-reduce based views. Views once saved are lighting fast
  5. Sexy Futon javascript interface, comes with a cool js interface for displaying and editing data
  6. javascript server using mozilla spidermonkey to construct views - means no need php in flash
  7. zero config replication - work from home with no internet
  8. python couchdb library
  9. bulk updates, deletes - you can store 100000 docs in one post request
  10. Each couchdb document is just a simple JSON compatible doc, no cruft just simple
  11. (Bonus) Uses Erlang, which means it is scalable for multicore multiprocessor machines
  12. (Extra Bonus) Low memory requirement Takes 150MB compared to 8Gig taken by Mysql for a similar db setup
  13. similar to zodb, but much more cleaner and intuitive
  14. Extremely friendly community and developers - Damien, Jchris, paul davis, noah slater, chris anderson, Jan

Friday, October 10, 2008

Thursday, October 02, 2008

The best torrent search engine

I accidentally discovered this new torrent search engine... It is very clean, googley and really fast.. When I say blazingly fast... Incredibly high speeds... .02 seconds



Do give it a spin

Recommendations : Highly Recommended

Sunday, September 16, 2007

python datetime guide

3. Dates and Times
Introduction

#-----------------------------
#introduction
# There are three common ways of manipulating dates in Python
# mxDateTime - a popular third-party module (not discussed here)
# time - a fairly low-level standard library module
# datetime - a new library module for Python 2.3 and used for most of these samples
# (I will use full names to show which module they are in, but you can also use
# from datetime import datetime, timedelta and so on for convenience)

import time
import datetime

print "Today is day", time.localtime()[7], "of the current year"
# Today is day 218 of the current year

today = datetime.date.today()
print "Today is day", today.timetuple()[7], "of ", today.year
# Today is day 218 of 2003

print "Today is day", today.strftime("%j"), "of the current year"
# Today is day 218 of the current year


Finding Today's Date

#-----------------------------
# Finding todays date

today = datetime.date.today()
print "The date is", today
#=> The date is 2003-08-06

# the function strftime() (string-format time) produces nice formatting
# All codes are detailed at http://www.python.org/doc/current/lib/module-time.html
print t.strftime("four-digit year: %Y, two-digit year: %y, month: %m, day: %d")
#=> four-digit year: 2003, two-digit year: 03, month: 08, day: 06

Converting DMYHMS to Epoch Seconds

#-----------------------------
# Converting DMYHMS to Epoch Seconds
# To work with Epoch Seconds, you need to use the time module

# For the local timezone
t = datetime.datetime.now()
print "Epoch Seconds:", time.mktime(t.timetuple())
#=> Epoch Seconds: 1060199000.0

# For UTC
t = datetime.datetime.utcnow()
print "Epoch Seconds:", time.mktime(t.timetuple())
#=> Epoch Seconds: 1060195503.0

Converting Epoch Seconds to DMYHMS

#-----------------------------
# Converting Epoch Seconds to DMYHMS

now = datetime.datetime.fromtimestamp(EpochSeconds)
#or use datetime.datetime.utcfromtimestamp()
print now
#=> datetime.datetime(2003, 8, 6, 20, 43, 20)
print now.ctime()
#=> Wed Aug 6 20:43:20 2003

# or with the time module
oldtimetuple = time.localtime(EpochSeconds)
# oldtimetuple contains (year, month, day, hour, minute, second, weekday, yearday, daylightSavingAdjustment)
print oldtimetuple
#=> (2003, 8, 6, 20, 43, 20, 2, 218, 1)

Adding to or Subtracting from a Date

#-----------------------------
# Adding to or Subtracting from a Date
# Use the rather nice datetime.timedelta objects

now = datetime.date(2003, 8, 6)
difference1 = datetime.timedelta(days=1)
difference2 = datetime.timedelta(weeks=-2)

print "One day in the future is:", now + difference1
#=> One day in the future is: 2003-08-07

print "Two weeks in the past is:", now + difference2
#=> Two weeks in the past is: 2003-07-23

print datetime.date(2003, 8, 6) - datetime.date(2000, 8, 6)
#=> 1095 days, 0:00:00

#-----------------------------
birthtime = datetime.datetime(1973, 01, 18, 3, 45, 50) # 1973-01-18 03:45:50

interval = datetime.timedelta(seconds=5, minutes=17, hours=2, days=55)
then = birthtime + interval

print "Then is", then.ctime()
#=> Then is Wed Mar 14 06:02:55 1973

print "Then is", then.strftime("%A %B %d %I:%M:%S %p %Y")
#=> Then is Wednesday March 14 06:02:55 AM 1973

#-----------------------------
when = datetime.datetime(1973, 1, 18) + datetime.timedelta(days=55)
print "Nat was 55 days old on:", when.strftime("%m/%d/%Y").lstrip("0")
#=> Nat was 55 days old on: 3/14/1973

Difference of Two Dates

#-----------------------------
# Dates produce timedeltas when subtracted.

diff = date2 - date1
diff = datetime.date(year1, month1, day1) - datetime.date(year2, month2, day2)
#-----------------------------

bree = datetime.datetime(1981, 6, 16, 4, 35, 25)
nat = datetime.datetime(1973, 1, 18, 3, 45, 50)

difference = bree - nat
print "There were", difference, "minutes between Nat and Bree"
#=> There were 3071 days, 0:49:35 between Nat and Bree

weeks, days = divmod(difference.days, 7)

minutes, seconds = divmod(difference.seconds, 60)
hours, minutes = divmod(minutes, 60)

print "%d weeks, %d days, %d:%d:%d" % (weeks, days, hours, minutes, seconds)
#=> 438 weeks, 5 days, 0:49:35

#-----------------------------
print "There were", difference.days, "days between Bree and Nat."
#=> There were 3071 days between bree and nat

Day in a Week/Month/Year or Week Number

#-----------------------------
# Day in a Week/Month/Year or Week Number

when = datetime.date(1981, 6, 16)

print "16/6/1981 was:"
print when.strftime("Day %w of the week (a %A). Day %d of the month (%B).")
print when.strftime("Day %j of the year (%Y), in week %W of the year.")

#=> 16/6/1981 was:
#=> Day 2 of the week (a Tuesday). Day 16 of the month (June).
#=> Day 167 of the year (1981), in week 24 of the year.

Parsing Dates and Times from Strings

#-----------------------------
# Parsing Dates and Times from Strings

time.strptime("Tue Jun 16 20:18:03 1981")
# (1981, 6, 16, 20, 18, 3, 1, 167, -1)

time.strptime("16/6/1981", "%d/%m/%Y")
# (1981, 6, 16, 0, 0, 0, 1, 167, -1)
# strptime() can use any of the formatting codes from time.strftime()

# The easiest way to convert this to a datetime seems to be;
now = datetime.datetime(*time.strptime("16/6/1981", "%d/%m/%Y")[0:5])
# the '*' operator unpacks the tuple, producing the argument list.

Printing a Date

#-----------------------------
# Printing a Date
# Use datetime.strftime() - see helpfiles in distro or at python.org

print datetime.datetime.now().strftime("The date is %A (%a) %d/%m/%Y")
#=> The date is Friday (Fri) 08/08/2003

High-Resolution Timers

#-----------------------------
# High Resolution Timers

t1 = time.clock()
# Do Stuff Here
t2 = time.clock()
print t2 - t1

# 2.27236813618
# Accuracy will depend on platform and OS,
# but time.clock() uses the most accurate timer it can

time.clock(); time.clock()
# 174485.51365466841
# 174485.55702610247

#-----------------------------
# Also useful;
import timeit
code = '[x for x in range(10) if x % 2 == 0]'
eval(code)
# [0, 2, 4, 6, 8]

t = timeit.Timer(code)
print "10,000 repeats of that code takes:", t.timeit(10000), "seconds"
print "1,000,000 repeats of that code takes:", t.timeit(), "seconds"

# 10,000 repeats of that code takes: 0.128238644856 seconds
# 1,000,000 repeats of that code takes: 12.5396490336 seconds

#-----------------------------
import timeit
code = 'import random; l = random.sample(xrange(10000000), 1000); l.sort()'
t = timeit.Timer(code)

print "Create a list of a thousand random numbers. Sort the list. Repeated a thousand times."
print "Average Time:", t.timeit(1000) / 1000
# Time taken: 5.24391507859

Short Sleeps

#-----------------------------
# Short Sleeps

seconds = 3.1
time.sleep(seconds)
print "boo"

Program: hopdelta

#-----------------------------
# Program HopDelta
# Save a raw email to disk and run "python hopdelta.py FILE"
# and it will process the headers and show the time taken
# for each server hop (nb: if server times are wrong, negative dates
# might appear in the output).

import datetime, email, email.Utils
import os, sys, time

def extract_date(hop):
# According to RFC822, the date will be prefixed with
# a semi-colon, and is the last part of a received
# header.
date_string = hop[hop.find(';')+2:]
date_string = date_string.strip()
time_tuple = email.Utils.parsedate(date_string)

# convert time_tuple to datetime
EpochSeconds = time.mktime(time_tuple)
dt = datetime.datetime.fromtimestamp(EpochSeconds)
return dt

def process(filename):
# Main email file processing
# read the headers and process them
f = file(filename, 'rb')
msg = email.message_from_file(f)

hops = msg.get_all('received')

# in reverse order, get the server(s) and date/time involved
hops.reverse()
results = []
for hop in hops:
hop = hop.lower()

if hop.startswith('by'): # 'Received: by' line
sender = "start"
receiver = hop[3:hop.find(' ',3)]
date = extract_date(hop)

else: # 'Received: from' line
sender = hop[5:hop.find(' ',5)]
by = hop.find('by ')+3
receiver = hop[by:hop.find(' ', by)]
date = extract_date(hop)

results.append((sender, receiver, date))
output(results)

def output(results):
print "Sender, Recipient, Time, Delta"
print
previous_dt = delta = 0
for (sender, receiver, date) in results:
if previous_dt:
delta = date - previous_dt

print "%s, %s, %s, %s" % (sender,
receiver,
date.strftime("%Y/%d/%m %H:%M:%S"),
delta)
print
previous_dt = date

def main():
# Perform some basic argument checking
if len(sys.argv) != 2:
print "Usage: mailhop.py FILENAME"

else:
filename = sys.argv[1]
if os.path.isfile(filename):
process(filename)
else:
print filename, "doesn't seem to be a valid file."

if __name__ == '__main__':
main()

Sunday, July 15, 2007

svn deps

Cheetah
Psycopg2
PIL
mx egenix
sudo python setup.py build --skip install
aspell.so

Sunday, July 01, 2007

How to update to Fedora 7

This is how I updated my FC6 to F7:

wget ftp://download.fedora.redhat.com/pub/fedora/linux/releases/7/Fedora/i386/os/Fedora/fedora-release-7-3.noarch.rpm
wget ftp://download.fedora.redhat.com/pub/fedora/linux/releases/7/Fedora/i386/os/Fedora/fedora-release-notes-7.0.0-1.noarch.rpm
rpm -U fedora-release-7-3.noarch.rpm fedora-release-notes-7.0.0-1.noarch.rpm
yum update

In theory that’s the only thing you need to do, in practice it’s not that easy. I had some dependency issues with pidgin and liferea so I just removed them:

rpm -e pidgin liferea

Then yum complained about libexif being “not signed”. I tried to edit /etc/yum.conf and setting gpgcheck=0, that didn’t work, yum was still checking it. So I had to manually set gpgcheck=0 in all the repos on /etc/yum.repos.d/. Then it worked.

That’s it!

Wednesday, June 27, 2007

Adding a fast random row to a table in postgresql

ALTER TABLE posts ADD myrand DOUBLE PRECISION;UPDATE posts SET myrand = RANDOM(); CREATE INDEX myrand_posts ON posts(myrand,id); ANALYZE VERBOSE posts;
ALTER TABLE posts ALTER myrand
SET DEFAULT RANDOM();
ALTER TABLE posts ALTER myrand
SET NOT NULL;
--SELECT * FROM posts WHERE myrand >= (SELECT RANDOM() OFFSET 0) ORDER BY myrand ASC LIMIT 1;

Tuesday, June 26, 2007

Boolean column in a table in postgres

CREATE INDEX polls_active_idx ON polls(done) WHERE done='t';

Monday, June 18, 2007

how to get auto align css centering ala mybloglog and facebook

body { width:99%;max-width:1024px; margin:auto;text-align:center; }
#wrap { margin: auto; width:85%; text-align:left; }
#main { float:left; width:74%; margin-top:1%; }
#sidebar { float:right;max-width: 25%;width:25%;min-width: 5%; padding:3px; }
#footer { background:#cc9; clear:both; }

Monday, April 09, 2007

Determining who's blocking who in Postgres

Determining who's blocking who in Postgres
If you have databases like I do with lots of concurrent queries, you can sometime run into situations where you issue a query and it just hangs there blocked. Or, more likely somebody or something issues a query and then comes calling when it doesn't seem to be doing anything.

Of course, you have the handy pg_stat_activity and pg_locks views at your disposal, but when it comes to determining exactly which queries are blocking which others and on what table, querying those alone is a tedious way to get the answer. What you really need is a query that sums it all up, in one neat and tidy bundle. Well, my friends here is such a query:

SELECT
bl.procpid as blocked_pid,
bl.usename as user,
bl.current_query as blocked_query,
bl.query_start,
relname as blocked_on ,
lq.procpid as blocking_pid,
lq.usename as user,
lq.current_query as blocking_query,
lq.query_start,
pgl2.mode as lock_type
FROM pg_stat_activity bl, pg_locks pgl1,
pg_stat_activity lq, pg_locks pgl2, pg_class
WHERE bl.procpid = pgl1.pid
AND not pgl1.granted
AND pg_class.oid = pgl1.relation
AND pgl2.relation = pgl1.relation
AND pgl2.granted
AND lq.procpid = pgl2.pid;

In extended mode (\x) psql returns something along these lines for this query:

blocked_pid | 21418
user | sueuser
blocked_query | insert values ('foo', 'bar', 'baz')
into extremely_large_table;
query_start | 2007-02-13 15:14:06.77606-08
blocked_on | extremely_large_table
blocking_pid | 21417
user | joeuser
blocking_query | delete from extremely_large_table;
query_start | 2007-02-13 14:45:34.637675-08
lock_type | AccessExclusiveLock

python spell checker

How to Write a Spelling Corrector
In the past week, two friends (Dean and Bill) independently told me they were amazed at how Google does spelling correction so well and quickly. Type in a search like [speling] and Google comes back in 0.1 seconds or so with Did you mean: spelling. What surprised me is that I thought Dean and Bill, being highly accomplished engineers and mathematicians, would have good intuitions about statistical language processing problems such as spelling correction. But they didn't, and come to think of it, there's no reason they should. I figured they and many others could benefit from an explanation, and so on the plane back from my trip I wrote a toy spelling corrector, which I now share.

Let's get right to it. I figured that in less than a plane flight, and in less than a page of code, I could write a spelling corrector that achieves 80 or 90% accuracy at a rate of at least 10 words per second. And in fact, here, in 20 lines of Python 2.5 code, is the complete spelling corrector:

import re, string, collections

def words(text): return re.findall('[a-z]+', text.lower())

def train(features):
model = collections.defaultdict(lambda: 1)
for f in features:
model[f] += 1
return model

NWORDS = train(words(file('Documents/holmes.txt').read()))

def edits1(word):
n = len(word)
return set([word[0:i]+word[i+1:] for i in range(n)] + ## deletion
[word[0:i]+word[i+1]+word[i]+word[i+2:] for i in range(n-1)] + ## transposition
[word[0:i]+c+word[i+1:] for i in range(n) for c in string.lowercase] + ## alteration
[word[0:i]+c+word[i:] for i in range(n+1) for c in string.lowercase]) ## insertion

def known_edits2(word):
return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)

def known(words): return set(w for w in words if w in NWORDS)

def correct(word):
return max(known([word]) or known(edits1(word)) or known_edits2(word) or [word],
key=lambda w: NWORDS[w])

This defines the function correct, which takes a word as input and returns a likely correction of that word. For example:

>>> correct('speling')
'spelling'
>>> correct('korrecter')
'corrector'

Tuesday, March 20, 2007

mencoder convert to flash

mencoder pasta_maken_311205.avi -o video.flv -of lavf -ovc lavc -oac lavc -lavcopts vcodec=flv:vbitrate=500:autoaspect:mbd=2:mv0:trell:v4mv:cbp:last_pred=3:predia=2:dia=2:precmp=2:cmp=2:subcmp=2:preme=2:turbo:acodec=mp3:abitrate=56 -vf scale=320:240 -srate 22050 -af lavcresample=22050

Monday, March 19, 2007

thread control clean

An example of an idiom for controling threads

Doug Fort
http://www.dougfort.net
"""

import threading

class TestThread(threading.Thread):
"""
A sample thread class
"""

def __init__(self):
"""
Constructor, setting initial variables
"""
self._stopevent = threading.Event()
self._sleepperiod = 1.0

threading.Thread.__init__(self, name="TestThread")

def run(self):
"""
overload of threading.thread.run()
main control loop
"""
print "%s starts" % (self.getName(),)

count = 0
while not self._stopevent.isSet():
count += 1
print "loop %d" % (count,)
self._stopevent.wait(self._sleepperiod)

print "%s ends" % (self.getName(),)

def join(self,timeout=None):
"""
Stop the thread
"""
self._stopevent.set()
threading.Thread.join(self, timeout)

if __name__ == "__main__":
testthread = TestThread()
testthread.start()

import time
time.sleep(10.0)

testthread.join()

Saturday, March 10, 2007

OpenPoker how it works

Writing low-pain massively scalable multiplayer servers

Another oldie. First penned for DevMaster over a year ago. Original article comes with a heated forum discussion!

I have since exited the poker business and removed related articles and links from this site. The Erlang source code is still available, though. I’ll dissect it in detail in a series of articles over the next few weeks.

Let me know if you have any trouble viewing this article. Use resizing page controls above to expand the page if you need to.

Introduction

This article describes an alternative approach to building massively scalable online multiplayer systems using my OpenPoker project as an example. OpenPoker is a massively multiplayer poker server with fault-tolerance, load balancing and unlimited scalability built-in. The source code to OpenPoker is available from my site under the GPL and comes in under 10,000 lines of code of which about 1/3 are dedicated to testing.

I prototyped extensively before coming up with the final version of OpenPoker and tried Delphi, Python, C#, C/C++ and Scheme. I also wrote a full-blown poker engine in Common Lisp. While I did spend over 9 months on research and prototyping, the final rewrite only took about 6 weeks of coding. I attribute most of the time and cost savings to choosing Erlang as my platform.

By comparison, it took a team of 4-5 people about 9 months to build the old OpenPoker. The original team also built a Windows poker client but even if I cut development time in half to account for this 1.5 month, it is far from 18 months that I will end up with. In today's world of bloated game development budgets such savings are nothing to sneeze at!

What is Erlang

I suggest you browse through the Erlang FAQ before continuing but I'll give you a quick summary here…

Erlang is a functional, dynamically typed language with built-in support for concurrency. It was specifically designed by Ericsson for telecommunications applications such as controlling a switch or converting protocols, and thus is particularly suitable for building distributed, soft real-time concurrent systems.

Applications written in Erlang are often composed of hundreds or thousands of lightweight processes communicating via message passing. Context switching between Erlang processes is typically one or two orders of magnitude cheaper than switching between threads in a C program.

It's easy to write distributed applications in Erlang because its distribution mechanisms are transparent: programs need not be aware that they are distributed.

The Erlang runtime environment is a virtual machine (VM), much like the Java virtual machine. This means that code compiled on one architecture runs anywhere. The runtime system also allows code in a running system to be updated without interrupting the program and the byte code can be compiled to native code when you need that extra boost.

Please head to the Erlang site and check out the excellent resources in the Getting started, Documentation and Examples sections.

Why Erlang

The concurrency model built into Erlang makes it particularly suitable for writing online multiplayer servers.

A massively scalable multiplayer backend in Erlang is built as a "cluster" with different "nodes" dedicated to different tasks. An Erlang node is an instance of the Erlang VM and you can run multiple Erlang nodes/VMs on your desktop, laptop or server. One node per CPU is recommended.

Erlang nodes track all other nodes connected to them. All you need to do to add a new node to the cluster is point it to an existing node. As soon as the two nodes establish contact all other nodes in the cluster become aware of the new node.

Erlang processes send messages to other processes using a process id which encodes information about the node where the process is running. Processes need not be aware of where other processes are located to communicate with them. A bunch of Erlang nodes linked together can be viewed as a grid or supercomputing facility.

Players, NPCs and other entities in massively multiplayer games are best modelled as concurrently running processes but concurrency is notoriously hard to work with. Erlang makes concurrency easy.

Erlang's bit syntax∞ makes it trivial to work with binary data and bests the structure packing/unpacking facilities of Perl and Python. This makes Erlang particularly suitable for handling binary network protocols.

The OpenPoker architecture

Everything in OpenPoker is a process. Players, bots, games, pots, etc. are all processes. For every poker client connected to OpenPoker there's a player "proxy" handling network messages. Depending on whether the player is logged in, some messages are ignored while others are passed to the process handling card game logic.

The card game process is an uber-state machine composed of state machine modules for every stage of the game. This lets me treat card game logic as a Lego constructor and add new card games by putting together the state machine building blocks. Take a look at the start function in cardgame.erl if you want to learn more about my approach.

The card game state machine lets different messages through depending on the game stage. It also uses a separate game process to handle the machinery common to all games such as keeping track of players, pots, limits and so on. When simulating 27,000 poker games on my laptop I found that I had about 136,000 players and close to 800,000 processes in total.

That said, I would like to focus on how Erlang makes it simple to implement scalability, fault tolerance and load balancing using OpenPoker as an example. My approach is not particular to poker or card games. The same approach can be used to quickly put together massively scalable multiplayer backends, do it cheaply and with a minimum amount of pain.

Scalability

I implement scalability and load-balancing by means of a multi-tier architecture. The first tier is represented by gateway nodes. Game server nodes form tier two and Mnesia "master" nodes can be thought of as the third tier.

Mnesia is the Erlang real-time distributed database. The Mnesia FAQ has a good explanation but Mnesia is basically a fast, replicating, in-memory database. There are no objects in Erlang but Mnesia can be thought of as object-oriented as it can store any Erlang data.

There are two types of Mnesia nodes: those that write to disk and those that do not. Regardless of this, all Mnesia nodes keep their data in memory. Mnesia master nodes in OpenPoker are nodes that write to disk. Gateways and game servers pick up their database from Mnesia masters upon startup and are memory-only nodes.

There's a handy set of command-line arguments that you can give to the Erlang VM and interpreter when starting up to tell Mnesia where the master database is located. After the new local Mnesia node establishes contact with the master Mnesia node, the new node becomes part of the master node’s cluster.

Assuming that the master nodes are located on hosts apple and orange, adding a new gateway, game server, etc. node to your OpenPoker cluster is as simple as

erl -mnesia extra_db_nodes \['db@apple','db@orange'\] -s mnesia start

where

-s mnesia start

is equivalent to starting Mnesia from the erlang shell like this

erl -mnesia extra_db_nodes \['db@apple','db@orange'\]
Erlang (BEAM) emulator version 5.4.8 [source] [hipe] [threads:0]

Eshell V5.4.8 (abort with ^G)
1> mnesia:start().
ok

OpenPoker keeps configuration information in Mnesia tables and this information is automatically downloaded by new nodes as soon as Mnesia starts. Zero configuration required!

Fault tolerance

OpenPoker lets me grow as high as I want by adding cheap Linux boxes to my server farm. Put together a couple of racks of 1U servers and you can easily handle 500,000 or even 1,000,000 players online. This would work just as well for a MMORPG as for poker.

I can dedicate some boxes to run gateway nodes and some to be database masters that write database transactions to disk. I can dedicate the rest of my boxes to run my game servers. I can limit game servers to accept a maximum of, say, 5000 simultaneous players so that no more than 5000 players are affected when my game server box crashes.

It's important to note that no information is lost when a game server crashes since all the Mnesia database transactions are replicated in real-time to all other nodes running Mnesia, game server nodes included.

In case of errors some assistance from the game client is required for the player to smoothly reconnect to the OpenPoker cluster. As soon as the poker client notices a network error it should connect to the gateway, receive a new game server address in a hand-off packet and reconnect to the new game server. What happens then is a little tricky as different types of reconnect scenarios need to be handled.

OpenPoker will handle the following reconnect scenarios:

  1. The game server crashed
  2. The client crashed or timed out due to a network error
  3. The player is online on a different connection
  4. The player is online on a different connection and is in a game

The most common scenario will probably be a poker client that disconnected due to a network error. A less likely but still possible scenario is a client reconnecting from one computer while already playing at another.

Each OpenPoker game buffers packets sent to players and every reconnecting poker client will first receive all the game packets since the game started before starting to receiving packets as usual. OpenPoker uses TCP connections so I don't need to worry about packet ordering – packets will simply arrive in proper order.

Every poker client connection is represented by two OpenPoker processes: the socket process and the actual player process. A visitor process with restricted functionality is used until the player logs in. Visitors cannot join games, for example. The socket process will be dead after a poker client disconnects while the player process will still be alive.

A player process can notice a dead socket when attempting to forward a game packet and should put itself into auto-play mode or fold the hand. The login code will check for the combination of a dead socket and live player process when reconnecting. The code to determine the condition looks like this:

login({atomic, [Player]}, [_Nick, Pass|_] = Args)
when is_record(Player, player) ->
Player1 = Player#player {
socket = fix_pid(Player#player.socket),
pid = fix_pid(Player#player.pid)
},
Condition = check_player(Player1, [Pass],
[
fun is_account_disabled/2,
fun is_bad_password/2,
fun is_player_busy/2,
fun is_player_online/2,
fun is_client_down/2,
fun is_offline/2
]),
...

whereas the conditions themselves will be determined by the following code:

is_player_busy(Player, _) ->
{Online, _} = is_player_online(Player, []),
Playing = Player#player.game /= none,
{Online and Playing, player_busy}.

is_player_online(Player, _) ->
SocketAlive = Player#player.socket /= none,
PlayerAlive = Player#player.pid /= none,
{SocketAlive and PlayerAlive, player_online}.

is_client_down(Player, _) ->
SocketDown = Player#player.socket == none,
PlayerAlive = Player#player.pid /= none,
{SocketDown and PlayerAlive, client_down}.

is_offline(Player, _) ->
SocketDown = Player#player.socket == none,
PlayerDown = Player#player.pid == none,
{SocketDown and PlayerDown, player_offline}.

Notice that the first thing the login function does is to fix up dead process ids. This makes processing simple down the road and is accomplished with the following bits of code:

fix_pid(Pid)
when is_pid(Pid) ->
case util:is_process_alive(Pid) of
true ->
Pid;
_ ->
none
end;

fix_pid(Pid) ->
Pid.

and

-module(util).

-export([isprocessalive/1]).



isprocessalive(Pid)
when is_pid(Pid) ->
rpc:call(node(Pid), erlang, isprocessalive, [Pid]).


A process id in Erlang includes the id of the node where the process is running. is_pid(Pid) tells me if its argument is a process id (pid) but cannot tell me if the process is alive or dead. Erlang’s built-in erlang:is_process_alive(Pid) tells me whether a local process (running on the same node) is dead or alive. There's no variant of is_process_alive for checking remote nodes.

Fortunately, I can use the Erlang rpc facility together with node(pid) to call is_process_alive() on the remote node. In fact, this will work just as well on the local node so the code above functions as a universal distributed process checker.

All that is left to do is to act on the various login conditions. In the simplest case where the player is offline I start a player process, connect the player to the socket and update the player record.

login(Player, player_offline, [Nick, _, Socket]) ->
{ok, Pid} = player:start(Nick),
OID = gen_server:call(Pid, 'ID'),
gen_server:cast(Pid, {'SOCKET', Socket}),
Player1 = Player#player {
oid = OID,
pid = Pid,
socket = Socket
},
{Player1, {ok, Pid}}.

Should the player login information not match I can return an error and increase the number of bad login attempts. If this number exceeds a predefined maximum I disable the account like this:

login(Player, bad_password, _) ->
N = Player#player.login_errors + 1,
{atomic, MaxLoginErrors} =
db:get(clusterconfig, 0, maxlogin_errors),
if
N > MaxLoginErrors ->
Player1 = Player#player {
disabled = true
},
{Player1, {error, ?ERRACCOUNTDISABLED}};
true ->
Player1 = Player#player {
login_errors = N
},
{Player1, {error, ?ERRBADLOGIN}}
end;

login(Player, account_disabled, _) ->
{Player, {error, ?ERRACCOUNTDISABLED}};


Logging out the player involves finding the player process id using their Object ID (which is just a number), stopping the player process and updating the player record in the database. This is accomplished by the following bit of code:

logout(OID) ->
case db:find(player, OID) of
{atomic, [Player]} ->
player:stop(Player#player.pid),
{atomic, ok} = db:set(player, OID,
[{pid, none},
{socket, none}]);
_ ->
oops
end.

With logout out of the way I can address the various reconnect conditions. If the player is online but idle, i.e. hanging out in the lobby or watching a game (drinking a Bud? Wazzup!) and is reconnecting from a different computer, I can just log them out and log them back in as if they were offline:

login(Player, player_online, Args) ->
logout(Player#player.oid),
login(Player, player_offline, Args);

If the player was idle when their poker client disconnected then all I need to do is replace the socket process id in the player record and tell the player process about the new socket.

login(Player, clientdown, [, _, Socket]) ->
gen_server:cast(Player#player.pid, {'SOCKET', Socket}),
Player1 = Player#player {
socket = Socket
},
{Player1, {ok, Player#player.pid}};

If the player was in a game then we run the code above and tell the game to resend the event history.

login(Player, player_busy, Args) ->
Temp = login(Player, client_down, Args),
cardgame:cast(Player#player.game,
{'RESEND UPDATES', Player#player.pid}),
Temp;

Overall, a combination of a real-time replicating database, a poker client that knows to reconnect to a different game server and some crafty login code allows me to provide a high degree of fault tolerance transparently to the player.

Load balancing

I can build my OpenPoker cluster from as many game server nodes as I want . I might want to allocate, say, 5000 players per game server and spread the load among the active game servers in my cluster. I can add new game servers at any time and they will automatically make themselves available to accept new players.

Gateway nodes spread the player load among the active game servers in the OpenPoker cluster. The job of a gateway node is to pick a random game server, ask it for the number of players connected and its address, host and port number where the game server is running. As soon as the gateway finds a game server where the number of players connected is less than the preset maximum it will return the address of that game server to the connected poker client and close the connection.

There's absolutely no load on gateway nodes and connections to them are extremely short-lived. You can have any cheap box acting as your gateway node.

Nodes should generally come at least in pairs so that if one node fails another one can take over. You would need a mechanism like Round-robin DNS to employ more than a single gateway node.

How do gateways learn about game servers?

OpenPoker uses the Erlang Distributed Named Process Groups facility to group game servers. The group is globally visible on all nodes, this happens automatically. New game servers join the game server group and when a game server node goes down it's automatically deleted.

This is what the code to find a game server with a maximum capacity of MaxPlayers looks like:

find_server(MaxPlayers) ->
case pg2:getclosestpid(?GAME_SERVERS) of
Pid when is_pid(Pid) ->
{Time, {Host, Port}} = timer:tc(gen_server, call, [Pid, 'WHERE']),
Count = gen_server:call(Pid, 'USER COUNT'),
if
Count <>
io:format("~s:~w: ~w players~n", [Host, Port, Count]),
{Host, Port};
true ->
io:format("~s:~w is full...~n", [Host, Port]),
find_server(MaxPlayers)
end;
Any ->
Any
end.

pg2:get_closest_pid() returns a random game server process id since a gateway node is not expected to run any game servers. If a process id of the game server is returned I ask the game server for its address (host and port) as well the number of players connected. So long as the number of players connected is less than the maximum I return the game server address to the caller, otherwise I keep looking.

Multiple-outlet powerstrip middleware

OpenPoker is open source software and I have been pitching it to various poker vendors recently. All the vendors have the same problem with scalability and fault tolerance, even after several years of development. Some have recently finished major rewrites of their server software while others are just embarking on this journey. All of the vendors are heavily invested in their Java infrastructure and, understandably, do not want to switch to Erlang.

Still, it sounds to me like there is a need to be filled. The more I think about it the more it looks like Erlang can still be used to provide a cost-efficient solution while keeping things simple and straightforward. I see this solution as a multiple-outlet electrical power strip, just like the one you are probably using right now.

You write your game server as a simple socket-based server that uses a database backend. In fact, more likely than not this is how your game server is written now. Your game server is the standard electrical plug and multiple instances of your game server are plugged into my power outlets while players flow in through the other end.

You supply the game servers and I provide you with scalability, load balancing, and fault tolerance. I keep players connected to the power strip and monitor your game servers, restarting them as needed. I switch your players to another game server when one goes down and you can plug in as many game servers as you like into my power outlets.

The power strip middleware is a black box sitting between your players and your servers and likely won't even require any changes to your code. You will get all the benefits of a highly scalable, load-balanced, fault-tolerant solution while keeping your investment and modifying little of your existing infrastructure.

You can write this middleware in Erlang today, run it on a Linux box with a kernel specially tuned for a high number of TCP connections and put this box in a demilitarized zone while keeping your game servers behind a firewall. Even if you don't, I suggest that you take a close look at Erlang today and think about using it to simplify your massively multiplayer server architectures. And I will be here to help!

Sunday, March 04, 2007

10 useful bash utils

Bash Cures Cancer
Learn the UNIX/Linux command line
Home Search Man Pages RSS Feed Email: SpamDefeator

Title: 10 Linux commands you've never used
Published: 02-19-2007


It takes years maybe decades to master the commands available to you at the Linux shell prompt. Here are 10 that you will have never heard of or used. They are in no particular order. My favorite is mkfifo.

  1. pgrep, instead of:
    # ps -ef | egrep '^root ' | awk '{print $2}'
    1
    2
    3
    4
    5
    20
    21
    38
    39
    ...

    You can do this:

    # pgrep -u root
    1
    2
    3
    4
    5
    20
    21
    38
    39
    ...
  2. pstree, list the processes in a tree format. This can be VERY useful when working with WebSphere or other heavy duty applications.
    # pstree
    init-+-acpid
    |-atd
    |-crond
    |-cups-config-dae
    |-cupsd
    |-dbus-daemon-1
    |-dhclient
    |-events/0-+-aio/0
    | |-kacpid
    | |-kauditd
    | |-kblockd/0
    | |-khelper
    | |-kmirrord
    | `-2*[pdflush]
    |-gpm
    |-hald
    |-khubd
    |-2*[kjournald]
    |-klogd
    |-kseriod
    |-ksoftirqd/0
    |-kswapd0
    |-login---bash
    |-5*[mingetty]
    |-portmap
    |-rpc.idmapd
    |-rpc.statd
    |-2*[sendmail]
    |-smartd
    |-sshd---sshd---bash---pstree
    |-syslogd
    |-udevd
    |-vsftpd
    |-xfs
    `-xinetd
  3. bc is an arbitrary precision calculator language. Which is great. I found it useful in that it can perform square root operations in shell scripts. expr does not support square roots.
    # ./sqrt
    Usage: sqrt number
    # ./sqrt 64
    8
    # ./sqrt 132112
    363
    # ./sqrt 1321121321
    36347

    Here is the script:

    # cat sqrt
    #!/bin/bash
    if [ $# -ne 1 ]
    then
    echo 'Usage: sqrt number'
    exit 1
    else
    echo -e "sqrt($1)\nquit\n" | bc -q -i
    fi
  4. split, have a large file that you need to split into smaller chucks? A mysqldump maybe? split is your command. Below I split a 250MB file into 2 megabyte chunks all starting with the prefix LF_.
    # ls -lh largefile
    -rw-r--r-- 1 root root 251M Feb 19 10:27 largefile
    # split -b 2m largefile LF_
    # ls -lh LF_* | head -n 5
    -rw-r--r-- 1 root root 2.0M Feb 19 10:29 LF_aa
    -rw-r--r-- 1 root root 2.0M Feb 19 10:29 LF_ab
    -rw-r--r-- 1 root root 2.0M Feb 19 10:29 LF_ac
    -rw-r--r-- 1 root root 2.0M Feb 19 10:29 LF_ad
    -rw-r--r-- 1 root root 2.0M Feb 19 10:29 LF_ae
    # ls -lh LF_* | wc -l
    126
  5. nl numbers lines. I had a script doing this for me for years until I found out about nl.
    # head wireless.h
    /*
    * This file define a set of standard wireless extensions
    *
    * Version : 20 17.2.06
    *
    * Authors : Jean Tourrilhes - HPL
    * Copyright (c) 1997-2006 Jean Tourrilhes, All Rights Reserved.
    */

    #ifndef _LINUX_WIRELESS_H
    # nl wireless.h | head
    1 /*
    2 * This file define a set of standard wireless extensions
    3 *
    4 * Version : 20 17.2.06
    5 *
    6 * Authors : Jean Tourrilhes - HPL
    7 * Copyright (c) 1997-2006 Jean Tourrilhes, All Rights Reserved.
    8 */

    9 #ifndef _LINUX_WIRELESS_H
  6. mkfifo is the coolest one. Sure you know how to create a pipeline piping the output of grep to less or maybe even perl. But do you know how to make two commands communicate through a named pipe?

    First let me create the pipe and start writing to it:

    pipe" title="Making the pipe and writing to it" class="screenshot" src="http://bashcurescancer.com/media/10-linux-commands-youve-never-used/mkfifo-write-to-pipe.png" height="357" width="640">

    Then read from it:

    cat pipe
  7. ldd, want to know which Linux thread library java is linked to?
    # ldd /usr/java/jre1.5.0_11/bin/java
    libpthread.so.0 => /lib/tls/libpthread.so.0 (0x00bd4000)
    libdl.so.2 => /lib/libdl.so.2 (0x00b87000)
    libc.so.6 => /lib/tls/libc.so.6 (0x00a5a000)
    /lib/ld-linux.so.2 (0x00a3c000)
  8. col, want to save man pages as plain text?
    # PAGER=cat
    # man less | col -b > less.txt
  9. xmlwf, need to know if a XML document is well formed? (A configuration file maybe..)
    # curl -s 'http://bashcurescancer.com' > bcc.html
    # xmlwf bcc.html
    # perl -i -pe 's@
    @
    @g' bcc.html
    # xmlwf bcc.html
    bcc.html:104:2: mismatched tag
  10. lsof lists open files. You can do all kinds of cool things with this. Like find which ports are open:
    # lsof | grep TCP
    portmap 2587 rpc 4u IPv4 5544 TCP *:sunrpc (LISTEN)
    rpc.statd 2606 root 6u IPv4 5585 TCP *:668 (LISTEN)
    sshd 2788 root 3u IPv6 5991 TCP *:ssh (LISTEN)
    sendmail 2843 root 4u IPv4 6160 TCP badhd:smtp (LISTEN)
    vsftpd 9337 root 3u IPv4 34949 TCP *:ftp (LISTEN)
    cupsd 16459 root 0u IPv4 41061 TCP badhd:ipp (LISTEN)
    sshd 16892 root 3u IPv6 61003 TCP badhd.mshome.net:ssh->kontiki.mshome.net:4661 (ESTABLISHED)

    Note: OpenBSD 101 pointed out that "lsof -i TCP" a better way to obtain this same information. Thanks!


    Or find the number of open files a user has. Very important for running big applications like Oracle, DB2, or WebSphere:

    # lsof | grep ' root ' | awk '{print $NF}' | sort | uniq | wc -l
    179

    Note: an anonymous commenter pointed out that you can replace sort | uniq with "sort -u". This is true, I forgot about the -u flag. Thanks!

Thursday, March 01, 2007

SQL queries from javascript

http://trimpath.com/project/wiki/TrimQuery
that is really cool..
if this is combined with berkeley db.. it ll be awesome

Monday, February 26, 2007

How to do form.fill

suppose you have a form like:

form = web.form.Form(
web.form.Textbox('name', description='Name'),
web.form.Textbox('subject', description='Subject'),
web.form.Textarea('body', description='Message'),
)

You can pass a dictionary mapping values to form objects like:

form.fill(
{
'subject':'This is my subject!',
'body':t'This is the body!',
}
)

and these values will fill your textboxes.

Tuesday, February 20, 2007

X-Sendfile for large file transfers

I read Lighttpd’s weblog (Lighty’s Life) regularly and I remember Jan talking about X-Sendfile. I thought it was interesting, but never really thought about using it…. Until today!

Basically, if you have a Ruby on Rails (or other environment) page that transfers a really massive file to the client, you should use X-Sendfile.

Here is what you’ll need:

*Lighttpd Configuration*

To the FastCGI section of your lighty config, along with host, port, etc., add “allow-x-send-file” => “enable”


fastcgi.server = ( ".fcgi" =>
( "server_1" => ( "host" => "10.0.1.1", "port" => 8000, "allow-x-send-file" => "enable" )
)

Monday, February 19, 2007

Find shortest distance path postgis postgres linedata

You could consider using pgRouting (http://www.postlbs.org/) .

This extension to Postgres/PostGIS needs some topology and topology support for PostGIS is still in its infancy and barely documented… L



Now to implement ‘topology’ the *easy* way… ;-)

Use OpenJump! (http://openjump.org)

There’s a tool there called ‘Planar Graph’…

You can use it to get every line segment documented (start and end nodes for each line).

Add a ‘length’ field to your dataset.

OpenJump also has a tool to calculate areas and lengths… ;-)



Save your dataset from OpenJump into PostGIS.

The ‘length’ field acts as an initial cost for turning….



Then use the shortest_path() function from pgRouting and you’re on your way!



Also heard from the developers of pgRouting that support for turn restrictions is somewhere in the near future… ;-)



(OJ Peopleà I’m posting this to the list as I think it’s useful ;-) )



HTH,

Pedro Doria Meunier

Sunday, February 18, 2007

Making all columns in a table lower case

psql -c "\d tablename" > 1.txt
get the table structure
cat 1.txt | awk '{print $1}' > 2.txt
get the 1st word (column name)
for x in `cat 2.txt`; do
echo "alter table tablename rename \"$x\" to $x;" >> 3.txt
done
build a file of sql commands to run
then, paste these into psql

import zcta zip data to postgres postgis

Often you will receive data in a non-spatial form such as comma delimited data with latitude and longitude fields. To take full advantage of PostGIS spatial abilities, you will want to create geometry fields in your new table and update that field using the longitude latitude fields you have available.

General Note: All the command statements that follow should be run from the PgAdminIII Tools - Query Tool or any other PostGreSQL Administrative tool you have available. If you are a command line freak - you can use the psql command line tool packaged with PostGreSQL.

Getting the data

For this exercise, we will use US zip code tabulation areas instead of just Boston data. The techniques here will apply to any data you get actually.

First step is to download the data from US Census. http://www.census.gov/geo/www/gazetteer/places2k.html

Importing the Data into PostGreSQL

PostGreSQL comes with a COPY function that allows you to import data from a delimited text file. Since the ZCTAs data is provided in fixed-width format, we can't import it easily without first converting it to a delimited such as the default tab-delimited format that COPY works with. Similarly for data in other formats such as DBF, you'll either want to convert it to delimited using tools such as excel, use a third party tool that will import from one format to another, or one of my favorite tools Microsoft Access that allows you to link any tables or do a straight import and export to any ODBC compliant database such as PostGreSQL.

Create the table to import to

First you will need to create the table in Postgres. You want to make sure the order of the fields is in the same order as the data you are importing.


CREATE TABLE zctas
(
state char(2),
zcta char(5),
junk varchar(100),
population_tot int8,
housing_tot int8,
water_area_meter float8,
land_area_meter float8,
water_area_mile float8,
land_area_mile float8,
latitude float8,
longitude float8
)
WITHOUT OIDS;

Convert from Fixed-width to Tab-Delimited

For this part of the exercise, I'm going to use Microsoft Excel because it has a nice wizard for dealing with fixed-width and a lot of windows users have it already. If you open the zcta file in Excel, it should launch the Text Import Wizard. MS Access has a similarly nice wizard and can deal with files larger than excels 65000 some odd limitation. Note there are trillions of ways to do this step so I'm not going to bother going over the other ways. For non-MS Office users other office suites such as Open-Office probably have similar functionality.

  1. Open the file in Excel.
  2. Import Text Wizard should launch automatically and have Fixed-Width as an option
  3. Look at the ZCTA table layout spec http://www.census.gov/geo/www/gazetteer/places2k.html#zcta and set your breakouts the same as specified. For the above I broke out the Name field further into first 5 for zcta and the rest for a junk field.
  4. Next File->Save As ->Text (Tab delimited)(*.txt) -give it name of zcta5.tab
  5. Copy the file to somewhere on your PostGreSQL server.
  6. The COPY command

    Now copy the data into the table using the COPY command. Note the Copy command works using the PostGreSQL service so the file location must be specified relative to the Server.


    COPY zctas FROM 'C:/Downloads/GISData/zcta5.tab';

    Creating and Populating the Geometry Field

    Create the Geometry Field

    To create the Geometry field, use the AddGeometryColumn opengis function. This will add a geometry field to the specified table as well as adding a record to the geometry_columns meta table and creating useful constraints on the new field. A summary of the function can be found here http://postgis.refractions.net/docs/ch06.html#id2526109.

    SELECT AddGeometryColumn( 'public', 'zctas', 'thepoint_lonlat', 4269, 'POINT', 2 );

    The above code will create a geometry column named thepoint_longlat in the table zctas that validates to make sure the inputs are 2-dimensional points in SRID 4269 (NAD83 longlat).

    Populate the Geometry Field using the Longitude and Latitude fields


    UPDATE zctas
    SET thepoint_lonlat = PointFromText('POINT(' || longitude || ' ' || latitude || ')',4269)

    The above code will generate a Text representation of a point and convert this representation to a PostGis geometry object of spatial reference SRID 4269.

    There are a couple of things I would like to point out that may not be apparently clear to people not familiar with PostGreSQL or PostGis

    • || is a string concatenator. It is actually the ANSI-standard way of concatenating strings together. In MySQL you would do this using the CONCAT function and in Microsoft SQL Server you would use +. Oracle also uses ||. So what the inner part of the code would do is to generate something that looks like POINT(-97.014256 38.959448).
    • You can't just put any arbitrary SRID in there and expect the system to magically transform to that. The SRID you specify has to be the reference system that your text representation is in.

    Transforming to Another spatial reference system

    The above is great if you want your geometry in longlat spatial reference system. In many cases, longlat is not terribly useful. For example if you want to do distance queries with your data, you don't want your distance returned back in longlat. You want it in a metric that you normally measure things in.

    In the code below, we will create a new geometry field that holds points in the WGS 84 North Meter reference system and then updates that field accordingly.


    SELECT AddGeometryColumn( 'public', 'zctas', 'thepoint_meter', 32661, 'POINT', 2 );

    UPDATE zctas
    SET thepoint_meter = transform(PointFromText('POINT(' || longitude || ' ' || latitude || ')',4269),32661) ;

    Index your spatial fields

    One of the number one reasons for poor query performance is lack of attention to indexes. Putting in an index can make as much as a 100 fold difference in query speed depending on how many records you have in the table. For large updates and imports, you should put your indexes in after the load, because while indexes help query speed, updates against indexed fields can be very slow because they need to create index records for the updated/inserted data. In the below, we will be putting in GIST indexes against our spatial fields.


    CREATE INDEX idx_zctas_thepoint_lonlat ON zctas
    USING GIST (thepoint_lonlat);

    CREATE INDEX idx_zctas_thepoint_meter ON zctas
    USING GIST (thepoint_meter);

    ALTER TABLE zctas ALTER COLUMN thepoint_meter SET NOT NULL;
    CLUSTER idx_zctas_thepoint_meter ON zctas;

    VACUUM ANALYZE zctas;

    In the above after we create the indexes, we put in a constraint to not allow nulls in the thepoint_meter field. The not null constraint is required for clustering since as of now, clustering is not allowed on gist indexes that have null values. Next we cluster on this index. Clustering basically physically reorders the table in the order of the index. In general spatial queries are much slower than attribute based queries, so if you do a fair amount of spatial queries, you get a huge gain.

    In the above we vacuum analyze the table to insure that index statistics are updated for our table.

Friday, February 16, 2007

urlparse(url, scheme='', allow_fragments=1)

this is what we should use urlparse(url, scheme='', allow_fragments=1)
Parse a URL into 6 components:
:///;?#
Return a 6-tuple: (scheme, netloc, path, params, query, fragment).
Note that we don't break the components up in smaller bits
(e.g. netloc is a single string) and we don't expand % escapes.

Wednesday, February 14, 2007

pgsql users, and schemas

so have you discovered pgsql users, and schemas yet
psql -U postgres
psql> create user newuser password 'newuser';
psql> create schema newuser authorization newuser;
psql> \q
# psql -U newuser
psql> create table foo();
psql> \d
the table is then owned by newuser user, in its own schema (like a namespace). this is somewhat how oracle user/schema owned tables are done.
this lets you have one database and then have many users within that database, each in their own schema.
this is same effect for having a different database for each project i guess.

Adding a new id column Primary key for an existing table

10k rows. varchar types, no indexes. takes a while it seems.
ok j00 ready?
this is what i did:
alter table localeze_amacai_business add id integer;
create sequence localeze_amacai_business_id_seq;
create or replace function assign_localize_pk ()
returns integer as $_$
DECLARE
_id integer;
_count integer;
_row record;
BEGIN

for _row in select * from localeze_amacai_business LOOP
select into _id nextval('localeze_amacai_business_id_seq');
update localeze_amacai_business
set id = _id
where "PERSISTENTRECORDID" = _row."PERSISTENTRECORDID";
END LOOP;
return _count;
END;
$_$ language plpgsql;
select assign_localize_pk();
drop function assign_localize_pk();
alter table localeze_amacai_business alter id set not null;
alter table localeze_amacai_business add constraint localeze_amacai_business_pk primary key (id);



but how 2 set that sequence to this new table
11:18:30 am
Travis
so now it was built using the sequence
select last_value from localeze_amacai_business_id_seq
coup-# ;
last_value
------------
9258
oh, thats e-z
just:
alter table localeze_amacai_business alter id set default nextval('localeze_amacai_business_id_seq');
so now new inserts will invoke the sequence and you dont have to specify the id value

or you could always do
select into _id nextval('localeze_amacai_business_id_seq');
insert into. ... (id, ...) values (_id, ..)

table
i do not understand
i do not want to specify id values
11:20:33 am
Travis
i mod the existing table , added that "id" column to it
and it defaults to the sequence now
so how do you relate to the table if you dont care what its id is?

autoincrement sequence id

postgresql now has the insert into .. working
so create table2 table with all the columns that table1 (original has)
hm, actually select into needs the table to not exist
i guess it is possible in postgresql too, wher you can have the original table, and a new table with the id pk column and then make a plpgsql function that
for _row in select * from table LOOP
select into _id nextval('a_sequence');
insert into new_table(id, ....) values (_id, .....);
LOOP

strip_phone_number

create or replace function strip_phone_number(
_in varchar
) returns varchar as $_$
DECLARE
_len integer;
_i integer;
_chr varchar;
_test varchar;
_result varchar;
BEGIN

select into _len length(_in);

_i := 0;
select into _result '';
while _i <= _len LOOP
_i := _i + 1;
select into _chr substring(_in from _i for 1);
select into _test substring(_chr from '[0-9]$');
--_test := _chr;
if (_test is not null) then
select into _result _result || _test;
end if;
raise debug '%:%:%:%', _i, _chr, _test, _result;
END LOOP;

return _result;
END;

$_$ language plpgsql;

(u will need to fix the tabs thing in msn pastes)
select strip_phone_number('(123) 456-8909 x 1234');
strip_phone_number
--------------------
12345689091234
(1 row)
so that dumps the "not digit" characters from a string in pl/pgsql
but what good does that do?
dont you need the ui to undo that?

Sunday, February 11, 2007

Import TIGER database to PostGIS

wget http://www.gdal.org/dl/fwtools/FWTools-linux-1.2.0.tar.bz2
tar xjf FWTools-linux-1.2.0.tar.bz2
cd FWTools-1.2.0/
./install.sh
export LD_LIBRARY_PATH=/www/ask/work/ogr/FWTools-1.2.0/lib
export GDAL_DATA=/www/ask/work/ogr/FWTools-1.2.0/share
./bin/ogr2ogr -update -append -f "PostGreSQL" PG:"host=localhost user=postgres dbname=mydbname password=password" /www/ask/work/tiger/TGR06001.RT1 layer CompleteChain -nln masuf -a_srs "EPSG:4269"

Saturday, February 10, 2007

Installing postgis

yum install proj proj-devel

#get postgres source
cd contrib
svn co http://svn.refractions.net/postgis/trunk postgis
cd postgis
./autogen.sh
./configure --with-pgsql=/usr/local/pgsql/bin/pg_config
make && sudo make install
/usr/local/pgsql/bin/createlang plpgsql coupon
/usr/local/pgsql/bin/createlang plpgsql coupon -U postgres
/usr/local/pgsql/bin/psql -d coupon -f lwpostgis.sql -U postgres
/usr/local/pgsql/bin/psql -d coupon -f spatial_ref_sys.sql -U postgres

Vi Commands


 

Why multithreaded design was avoided

Multithreaded environments can be a headache. Experienced programmers know that and try to avoid threads, while on the other hand inexperienced programmers find them quite attractive and usually make applications a mess. It all boils down to synchronization. Synchronization of threads can be very hard to get right and is wet ground for a great number bugs to grow. Add to that, that race conditions and thread-related bugs can be extremely hard to hunt down, since the condiitons to reproduce them may be unknown. The efficiency of threads is also a concern. The scripting engine for a game must be fast. The game world contains many actors that need to be updated at least every frame. You don’t want a scheduler to take up half of your CPU trying to decide which - of many, many actors - to run next. Also, if you have to spawn and delete bullet actors in the game (coming from a fast machine gun), you should start looking for thread pools and other techniques since spawning each bullet thread can take too long.

To sum it up: below is the list of reasons that multithreaded environments where overlooked by game developers :

* Scheduling overhead
* Memory cost per thread
* Inefficient thread creation
* Synchronization problems
* More bug prune
* Difficult to debug


The main feature that makes Stackless Tasklets

The main feature that makes Stackless so attractive for use as a scripting language is the support for tasklets. Tasklets make it possible to create “micro-threads”, allowing the programmer to switch among several execution threads that only exist in the python environment and have no dependencies on the underlying OS threads. Some would call these threads”green-threads”. These threads has very small footprint on memory and CPU. You can actually create hundreds of threads with almost no overhead. Every tasklet has only a few bytes of memory overhead. And the scheduling of threads takes O(1) time with a simple Round-Robin scheduling algorithm. If we where talking about native threads we would have almost 1MB of memory per thread and high cost scheduling to do things we don’t need. To all that add that the engine would behave very differently on different operating systems. Even on different versions of the same operating system.

Coroutines vs generators

Coroutines have a completely separate stack which is saved when they yield,
so you have a load of nested function calls and yield from deep in the
middle of them.

Generators save only a single stack frame, so all yields must come directly
from the generator, not from functions which it calls.

You can use generators to get a similar effect to coroutines by nesting
generators and propogating the yields back up the chain, but this has to be
done explicitly at every level.

--
Duncan Booth duncan@rcp.co.uk

Friday, February 09, 2007

Posting Flash Videos with FFmpeg and FlowPlayer

Posting Flash Videos with FFmpeg and FlowPlayer

Anna showing on FlowPlayer Last night I have posted my very first flash video on the web — and it was Anna sitting there watching, her own video for 2 minutes (which probably would only interest the parents and grand-parents). Anna’s video aside, I was also having fun figuring out getting that video online.

There are many ways putting videos online. You can either:

1. Upload your AVI/QuickTime/WMV files onto a folder somewhere inside your hosting account.
2. Use a third party video hosting service like Google Video or YouTube.

Personally I don’t like (2). You need to upload your videos to that 3rd party, and you have little control over how the final outcome will be encoded (bit rate, frame rate, quality, etc). Moreover, there are terms and conditions that you need to read through, let along agreeing to. At the end, who owns the rights to uploaded video?

Being a control freak (well, only over the systems that I need to manage), I have always preferred option (1) by hosting video files inside my own accounts, which has some crazy amount of space and data transfer anyway. Except you don’t get that nice Flash applet which you can embed into your own pages, so visitors can and watch the video without leaving the page. They don’t need to worry about saving onto the desktop, which media player to use, whether codec has been installed, etc. They Just WorksTM — perfect for the grand-parents :)

With a bit of time wasted on research and mocking around, it turns out that you can easily achieve the effect of embedded flash video, and yet host the video files on your own server. And there’s zero penny you need to spend — all can be done via these open source software, FFmpeg and FlowPlayer.
The Basis

Here’s a summary of what needs to be done.

1. Convert the video file into a suitable format for Flash players.
2. Upload the converted file onto hosted account.
3. Upload the Flash player if hasn’t been done.
4. Paste HTML code snippet into the web page.

Flash players can only play video files encoded into the FLV (Flash Video) format, which is also the format used by Google Video and YouTube. To do so the open source way is use the universal encoder, FFmpeg.
FFmpeg

Installing FFmpeg is trivial — at least on my Gentoo boxes :) Make sure appropriate USE flags are used during emerge. For example I have:

USE="aac amr encode ogg vorbis x264 xvid zlib" emerge ffmpeg

Other Linux distribution? Not using Linux? Err. Good luck.

To convert a movie using FFmpeg, do the following:

$ ffmpeg -i movie.avi movie.flv

It will then convert the AVI file into FLV Flash Video. FFmpeg can also handle many different container types, for example QuickTime, WMV1 (not WMV3 at the moment), MPEG4, etc, so just throw the video at it and see whether it handles it.

There are many command line options that you can use to alter the encoding behaviour. For example if I wish to rescale the movie to 320×240, with 15 frame/sec, at video at 250kbps and audio down-sampling to 22,050Hz at 48kbps, I just tell FFmpeg to do it on the command line:

$ ffmpeg -i movie.avi -s 320x240 -r 15 -b 250 -ar 22050 -ab 48 movie.flv

There are many more options so do check out their manual if you are interested.

There is another thing that we need to do — create a JPEG thumbnail for previewing. This will be displayed in the otherwise empty canvas of the flash player, before [Play] is pressed. For convenience sake, we’ll take the very first frame of the video.

$ ffmpeg -i movie.avi -f mjpeg -t 0.001 movie.jpg

FLVTool2

FLVTool2 is needed to calculate and update meta data in the FLV file. Well, you don’t really need it as you can already play the FLV file spill out from FFmpeg, but because of the missing info, Flash player cannot show the buffering status and current playing position, etc.

I was hesitated to install FLVTool2 because (1) it depends on Ruby which I need to emerge (2) it does not have an ebuild for it. But anyway, having it running is still trivial.

1. Make sure you already have Ruby installed.
2. Download the latest FLVTool2
3. Unpack the tarball, change into its directory, and run ruby setup.rb all as root.

Now just run

$ flvtool2 -U movie.flv

Well, installation is actually optional. You can pretty much run FLVTool2 from inside its unpacked directory, for example.

$ RUBYLIB=lib ruby bin/flvtool2 -U /movie.flv

Your FLV is ready to go! Upload both FLV and generated JPEG thumbnail onto your web hosting account. Make sure they are in the same folder.
FlowPlayer

FlowPlayer is an open source Flash video player that is light-weight (at around 22kb), and pretty easy to configure. Download the latest version from SourceForge.

Unpack the ZIP will give you the player file FlowPlayer.swf. Upload it somewhere on your website.

Now you need to cut and paste this HTML code snippet onto the web page you wish to show the video:










[your site] is the URL to where you keep the FlowPlayer.swf. [base URL] is the directory where you keep the FLV and JPEG files. For example, the final URL to FLV file will be [base URL]/movie.flv.

Paste that onto your website, or into your blog post, and check whether it works!

Please check FlowPlayer documentation on the options going to flashvars.
Conclusion

In fact those steps can be easily automated with a bit of scripting. I shall be posting more movies on Anna’s website.

If your hosting companies are not very generous quota (i.e. small timers who can’t really oversell), or if you think your video will get digged and slashdotted and become overnight hit, then maybe having Google Video or YouTube to host for you is a wiser idea, just in case a huge hosting bill landing on your credit card statement.

Otherwise, you might choose to host those videos on your own account, and regain a bit of control.

Monday, February 05, 2007

scaling rails

I've said it before, but it bears repeating: There's nothing interesting about how Ruby on Rails scales. We've gone the easy route and merely followed what makes Yahoo!, LiveJournal, and other high-profile LAMP stacks scale high and mighty.

Take state out of the application servers and push it to database/memcached/shared network drive (that's the whole Shared Nothing thang). Use load balancers between your tiers, so you have load balancers -> web servers -> load balancers -> app servers -> load balancers -> database/memcached/shared network drive servers. (Past the entry point, load balancers can just be software, like haproxy).

In a setup like that, you can add almost any number of web and app servers without changing a thing.

Scaling the database is the "hard part", but still a solved problem. Once you get beyond what can be easily managed by a descent master-slave setup (and that'll probably take millions and millions of pageviews per day), you start doing partitioning.

Users 1-100K on cluster A, 100K-200K on cluster B, and so on. But again, this is nothing new. LiveJournal scales like that. I hear eBay too. And probably everyone else that has to deal with huge numbers.

So the scaling part is solved. What's left is judging whether the economics of it are sensible to you. And that's really a performance issue, not a scalability one.

If your app server costs $500 per month (like our dual xeons does) and can drive 30 requests/second on Rails and 60 requests/second on Java/PHP/.NET/whatever (these are totally arbitrary numbers pulled out of my...), then you're faced with the cost of $500 for 2.6 million requests/day on the Rails setup and $250 for the same on the other one.

Now. How much is productivity worth to you? Let's just take a $60K/year programmer. That's $5K/month. If you need to handle 5 million requests/day, your programmer needs to be 10% more productive on Rails to make it even. If he's 15% more productive, you're up $250. And this is not even considering the joy and happiness programmers derive from working with more productive tools (nor that people have claimed to be many times more productive).

Of course, the silly math above hinges on the assumption that the whatever stack is twice as fast as Rails. That’s a very big if. And totally dependent on the application, the people, and so on. Some have found Rails to be as fast or faster than comparable “best-of-breed J2EE stacks” — see http://weblog.rubyonrails.com/archives/2005/04/04/justingehtland-is-back-with-numbers-to-back-it-up/

The point is that the cost per request is plummeting, but the cost of programming is not. Thus, we have to find ways to trade efficiency in the runtime for efficiency in the “thought time” in order to make the development of applications cheaper. I believed we’ve long since entered an age where simplicity of development and maintenance is where the real value lies.
David Heinemeier Hansson
Tuesday, July 12, 2005