One of the nastiest things about Google App Engine is its limited functionality in certain areas. Now, I’m a huge fan of the service, but I often tell friends that I don’t think it’s feasible for very large applications to the use it (not that it isn’t brilliantly scalable, but I’m talking about an ‘application’ like twitter, or a huge site like Amazon.) An example of an area with such lack of functionality, although remarkable in its own right, is the App Engine datastore.

The datastore is based on Google’s own, and quite brilliant, BigTable. The thesis is over here if you’re into that sort of thing – but the gist of is that it allows a database of, literally, petabytes of information to be sharded across thousands of computers in multiple locations, compressed, and accessed accurately and faster than any kind of data management available on the web. It’s less useful to think of BigTable as a database – if you’re designing for the App Engine, think of it as being split up like a Rubik’s cube. The rows that go across are defined by a key, which is only a few bytes in size (this is important to know later), the columns map out the different properties of the data, and another dimension is used for time-stamps. An example scenario could be somebody commenting on this blog post. There is one row for the entity – the row name for this entity is the entity’s key (accessed in GQL as “__key__”). Then there are the properties, such as the name of the commenter, the link to the commenter’s website, and the comment itself. If this comment were to be updated, then there would be two versions of the entity, each with its own time-stamp. By using the time-stamp in another dimension, it allows for simultaneous updates, so two people could be changing the comment at the same time, which is a very useful feature for an application like Google Maps where people are concurrently editing tags on table elements. So, what this means for the Google App Engine user, is that because the tables aren’t split up into distinct columns like in a relational database and like most programmers are used to, there is some lost functionality. One such functionality, is the LIKE constraint for database queries.

Where in SQL we might write, “SELECT content FROM posts WHERE title LIKE %App Engine%”, in Google App Engine, this is not possible. First of all, one cannot query for a distinct column like ‘content’, and secondly, the ‘LIKE’ filter is not available. In this post I’m going to write about how I have managed to get around these limitations. I’d like you to notice that I recently installed a ‘Search’ function on this blog (just to your right), and since I’m writing all my own apps with App Engine in Python, I had to come up with some interesting workarounds (in fact, I’m going to explain exactly how I created that search function in this post.) While I was working on pythoughts.com, I decided to make a function to check for new public messages for a user, while the user was online, so that the user didn’t ever have to futilely check the public folder looking for new messages. I came up with this idea when I was using the application to chat with a friend of mine late at night. Because I hadn’t implemented the function yet, I had to wait for GTalk’s email notification pop-up to tell me whenever I had a new message – this would not do. Instead, I made a javascript function that sent an AJAX request to query the database for new posts every five seconds. Because it was so late, and my thoughts may have been slightly unfocused , I quickly wrote a python script similar to the one that follows:

def updatePublic:

query = db.GqlQuery ("SELECT * FROM Thought WHERE label='public' AND user=:1", users.get_current_user())

results = query.fetch(100)

num = len(results)

self.response.out.write(num)

Then I fed this integer back to my webpage, where the javascript checked it against an integer that it was storing, and if the response integer was greater than the old one, it gave the appropriate notification. As you can imagine, I woke up the next morning and I had already exceeded my datastore access quota, before midday! By the way, you may be wondering why I use the GQL Query instead of the fairly simpler filter methods available (I’m not going to explain the other method here, as I prefer the GQL way, especially for beginners), and that is because it actually uses less CPU time this way. Now, to solve the problem above, although you can’t select just one property of an element, you can select its key. As stated before, the key is roughly between 10 and 100 bytes in size – very small when you compare it to a whole bunch of properties, in my case including Blob Properties, Text Properties, etc.. Another idea is to further specify the content. Although this takes more coding and makes for a slightly larger table, it’s worth it in the long run after your app gets a decent amount of users and you start approaching your daily quotas. The correct way to solve this problem, and the way I quickly fixed it to do very early that morning , is as follows:

# create another property in the datastore to keep as a 'read' or 'unread' status

# this is quite over-simplified, but I'll illustrate some datastore design techniques

class Thought(db.Model):

content = db.TextProperty()

time = db.DateTimeProperty(auto_now = True)

label = db.StringProperty(multiline = False)

read = db.BooleanProperty(default = False)

def updatePublic (webapp.RequestHandler):

s_query = "SELECT __key__ FROM Thought WHERE user=:1 AND label='public' AND read='False'"

query = db.GqlQuery(s_query)

results = query.fetch(100)

num = len(results)

self.response.out.write(num)

See, here I query only for the __key__ property, thus dramatically limiting the size of the request, and I also added an extra filter. Every time a user sends a message, the default property ‘read’ is set to False, and every time the ‘public’ label is accessed, the element is set to True. The script then outputs the length of the array of unread messages for the javascript function to dynamically display on the webpage.

The best way to get around the missing ‘LIKE’ filter is to use inequalities. Because of the structure of the App Engine datastore, it actually allows for you to query inequlities with regards to strings (in fact, this is actually a hack that App Engine uses for ‘not equal to’, as != shouldn’t technically be available in the GQL language according to BigTable.) This ‘hack’ can be used as follows:
Where in sql you usually would query:

(“SELECT * FROM thought WHERE user LIKE gra%”)

We now have to use:

(“SELECT * FROM thought WHERE user >= ‘gra’ AND < :1″, u’gra’ + u’\ufffd'”)

Imagine the table as being sorted alphabetically, user >= ‘gra’ would include any user-names that come after, or equal, ‘G’, then ‘R’, then ‘A’ in the alphabet. The next filter is a less-than inequality, so it’s going to filter out any values that are greater than the specified term, which we set as u’gra’ + u’\ufffd’. The unicode character and ‘\ufffd’ code specify that the result must include the first three character, ‘gra’ (formatted as unicode in order to concatenate with “u’\ufff'”), as well as any other unicode characters following that. This will limit the query enough to return a value such as ‘graeme’, if the search term is ‘grae’, but will not return ‘graham’, or, say, ‘tagrae’.

The next option is to filter out the content using python. Although this method is probably your most reliable, you must keep in mind that as your application becomes more popular, you’re going to be more keen to keep datastore queries as efficient as possible to stay within quota. When I wrote the functionality for the search function for this blog (in order to blog about it ), I decided to do it completely in Python, but I also knew that there would not be such a large amount of requests that my daily quota wouldn’t handle. Common sense will help you here in deciding how to combine methods for optimal functionaliy vs. efficiency .

Now, Python comes with a vast array of string comparing tools, and thank goodness for that because it actually makes coding with series of strings a pleasure (unlike, ahem, Java for instance), but the following code is very understandable even for the intermediate, or even beginner, python programmer (the expected readership of this blog). Besides, I will be with you all the way commenting on every line that needs it . Okay, here we go:

def search_function:

# you might want to store all search queries in your datastore, to use for improving content, etc.. -

new_search = Search ()

new_search.query = term

new_search.put()

import re

# split the term into seperate words.

terms = term.split()

# get all the posts to search from in the database, we can't use LIKE, so this will have to suffice.

query = db.GqlQuery ("SELECT * FROM Post")

results = query.fetch(10)

# define found, this is the final results string, and partial - for partial matches.

found = "<h3>Exact Matches</h3><br />"

partial = "<h3>Partial Matches</h3><br />"

# go through all posts.

for result in results:

# split posts into words, as well as periods, commas and dashes (so that "word." is split into ["word", "."]).

words = re.split(r'[ .,-]', result.content)

# set count to 0 (count is used to keep track of the current index in the words list).

count = 0

# once the function has found a match in this result, we must break the loop.

do_query = True

for x in range(len(words)):

# create a string that holds a sequence of words the lenth of the search query for exact matches.

sear = words[x: x + len(terms)]

# temp is used to compile this sequence into a string, for it to be comparable with the query.

temp = ""

# compile temp -

for p in sear:

temp += p + " "

# check to see if the compiled temp is equal to the query -

if do_query:

if temp.lower() == (term.lower() + " "):

# format the result in to a string, with 10 extra words (as description of the post) on either side of the result; make query term bold.

formatted = ""

#  if the query has been found in this post -

do_query = False

for i in range(20):

if i < 10:

# words before the matched term; try statement in case there are less than 10 words before (hacky, but acceptable)

try:

formatted += words[count - (10-i)] + " "

except:

pass

elif i > 10:

try:

formatted += words[count + (i-10)] + " "

except:

pass

if i == 10:

# make the matched query bold for the results page -

formatted += "<b> " + words[count] + " "

if i == (9 + len(terms)):

formatted += "</b>"

# add the results to a formatted string of exact matches -

found += ("<a href='?p="+result.title+"'>"+result.title + "</a>" + "<p>..." +formatted + "...</p>")

count += 1

# now let's do some partial matches (will match any words, in any sequence). Adds only title to the formatted string.

do_query = True

for x in terms:

for y in words:

if x == y:

if do_query:

do_query = False

partial += "<a href='?p="+result.title+"'>"+result.title + "</a><br />"

# a formatted string to output to HTML including all matches, exact and partial -

res = ""

# check if there are any matches, if none return a friendly message -

if partial == "<h3>Partial Matches</h3><br />":

return "No matches were found for your search query."

else:

if found != "<h3>Exact Matches</h3><br />":

res += found

res += partial

#return the results! -

return res

That covers most of what I consider important for beginning Google App Engineers to note when using the datastore. Hopefully it provides some value (I see on my Google Analytics that this blog’s readership has started to grow considerabley across 17 countries this week – props for supporting Google App Engine!)

Go in peace and productivity,

PyThoughts

Join a community of 3000+ users by signing up to our weekly newsletter!
Or catch us on one of these social networks