One of the nastiest things about Google App Engine is its limited functionality in certain areas. Now, I’m a huge fan of the service, but I often tell friends that I don’t think it’s feasible for very large applications to the use it (not that it isn’t brilliantly scalable, but I’m talking about an ‘application’ like twitter, or a huge site like Amazon.) An example of an area with such lack of functionality, although remarkable in its own right, is the App Engine datastore.
The datastore is based on Google’s own, and quite brilliant, BigTable. The thesis is over here if you’re into that sort of thing – but the gist of is that it allows a database of, literally, petabytes of information to be sharded across thousands of computers in multiple locations, compressed, and accessed accurately and faster than any kind of data management available on the web. It’s less useful to think of BigTable as a database – if you’re designing for the App Engine, think of it as being split up like a Rubik’s cube. The rows that go across are defined by a key, which is only a few bytes in size (this is important to know later), the columns map out the different properties of the data, and another dimension is used for time-stamps. An example scenario could be somebody commenting on this blog post. There is one row for the entity – the row name for this entity is the entity’s key (accessed in GQL as “__key__”). Then there are the properties, such as the name of the commenter, the link to the commenter’s website, and the comment itself. If this comment were to be updated, then there would be two versions of the entity, each with its own time-stamp. By using the time-stamp in another dimension, it allows for simultaneous updates, so two people could be changing the comment at the same time, which is a very useful feature for an application like Google Maps where people are concurrently editing tags on table elements. So, what this means for the Google App Engine user, is that because the tables aren’t split up into distinct columns like in a relational database and like most programmers are used to, there is some lost functionality. One such functionality, is the LIKE constraint for database queries.
def updatePublic: query = db.GqlQuery ("SELECT * FROM Thought WHERE label='public' AND user=:1", users.get_current_user()) results = query.fetch(100) num = len(results) self.response.out.write(num)
# create another property in the datastore to keep as a 'read' or 'unread' status # this is quite over-simplified, but I'll illustrate some datastore design techniques class Thought(db.Model): content = db.TextProperty() time = db.DateTimeProperty(auto_now = True) label = db.StringProperty(multiline = False) read = db.BooleanProperty(default = False) def updatePublic (webapp.RequestHandler): s_query = "SELECT __key__ FROM Thought WHERE user=:1 AND label='public' AND read='False'" query = db.GqlQuery(s_query) results = query.fetch(100) num = len(results) self.response.out.write(num)
The best way to get around the missing ‘LIKE’ filter is to use inequalities. Because of the structure of the App Engine datastore, it actually allows for you to query inequlities with regards to strings (in fact, this is actually a hack that App Engine uses for ‘not equal to’, as != shouldn’t technically be available in the GQL language according to BigTable.) This ‘hack’ can be used as follows:
Where in sql you usually would query:
(“SELECT * FROM thought WHERE user LIKE gra%”)
We now have to use:
(“SELECT * FROM thought WHERE user >= ‘gra’ AND < :1″, u’gra’ + u’\ufffd'”)
Imagine the table as being sorted alphabetically, user >= ‘gra’ would include any user-names that come after, or equal, ‘G’, then ‘R’, then ‘A’ in the alphabet. The next filter is a less-than inequality, so it’s going to filter out any values that are greater than the specified term, which we set as u’gra’ + u’\ufffd’. The unicode character and ‘\ufffd’ code specify that the result must include the first three character, ‘gra’ (formatted as unicode in order to concatenate with “u’\ufff'”), as well as any other unicode characters following that. This will limit the query enough to return a value such as ‘graeme’, if the search term is ‘grae’, but will not return ‘graham’, or, say, ‘tagrae’.
The next option is to filter out the content using python. Although this method is probably your most reliable, you must keep in mind that as your application becomes more popular, you’re going to be more keen to keep datastore queries as efficient as possible to stay within quota. When I wrote the functionality for the search function for this blog (in order to blog about it ), I decided to do it completely in Python, but I also knew that there would not be such a large amount of requests that my daily quota wouldn’t handle. Common sense will help you here in deciding how to combine methods for optimal functionaliy vs. efficiency .
Now, Python comes with a vast array of string comparing tools, and thank goodness for that because it actually makes coding with series of strings a pleasure (unlike, ahem, Java for instance), but the following code is very understandable even for the intermediate, or even beginner, python programmer (the expected readership of this blog). Besides, I will be with you all the way commenting on every line that needs it . Okay, here we go:
def search_function: # you might want to store all search queries in your datastore, to use for improving content, etc.. - new_search = Search () new_search.query = term new_search.put() import re # split the term into seperate words. terms = term.split() # get all the posts to search from in the database, we can't use LIKE, so this will have to suffice. query = db.GqlQuery ("SELECT * FROM Post") results = query.fetch(10) # define found, this is the final results string, and partial - for partial matches. found = "<h3>Exact Matches</h3><br />" partial = "<h3>Partial Matches</h3><br />" # go through all posts. for result in results: # split posts into words, as well as periods, commas and dashes (so that "word." is split into ["word", "."]). words = re.split(r'[ .,-]', result.content) # set count to 0 (count is used to keep track of the current index in the words list). count = 0 # once the function has found a match in this result, we must break the loop. do_query = True for x in range(len(words)): # create a string that holds a sequence of words the lenth of the search query for exact matches. sear = words[x: x + len(terms)] # temp is used to compile this sequence into a string, for it to be comparable with the query. temp = "" # compile temp - for p in sear: temp += p + " " # check to see if the compiled temp is equal to the query - if do_query: if temp.lower() == (term.lower() + " "): # format the result in to a string, with 10 extra words (as description of the post) on either side of the result; make query term bold. formatted = "" # if the query has been found in this post - do_query = False for i in range(20): if i < 10: # words before the matched term; try statement in case there are less than 10 words before (hacky, but acceptable) try: formatted += words[count - (10-i)] + " " except: pass elif i > 10: try: formatted += words[count + (i-10)] + " " except: pass if i == 10: # make the matched query bold for the results page - formatted += "<b> " + words[count] + " " if i == (9 + len(terms)): formatted += "</b>" # add the results to a formatted string of exact matches - found += ("<a href='?p="+result.title+"'>"+result.title + "</a>" + "<p>..." +formatted + "...</p>") count += 1 # now let's do some partial matches (will match any words, in any sequence). Adds only title to the formatted string. do_query = True for x in terms: for y in words: if x == y: if do_query: do_query = False partial += "<a href='?p="+result.title+"'>"+result.title + "</a><br />" # a formatted string to output to HTML including all matches, exact and partial - res = "" # check if there are any matches, if none return a friendly message - if partial == "<h3>Partial Matches</h3><br />": return "No matches were found for your search query." else: if found != "<h3>Exact Matches</h3><br />": res += found res += partial #return the results! - return res
That covers most of what I consider important for beginning Google App Engineers to note when using the datastore. Hopefully it provides some value (I see on my Google Analytics that this blog’s readership has started to grow considerabley across 17 countries this week – props for supporting Google App Engine!)
Go in peace and productivity,