22 Aug 2011

MVC With Express and Node.js

Coming from a background in PHP and writing MVC-based applications using the Zend Framework, I was rather accustomed to the clean and organized separation of concerns: Models contained business logic, Views contained all the presentational HTML, and of course Controllers contained actions which facilitated the flow of data between Models and Views. Now in my move to node.js I've noticed a lack of support for the "controller" part of the MVC pattern (possibly due more to the nature of Javascript).

So far I've been very impressed with Express, which out of the box supports view rendering (complete with layout and partial support), and with the addition of Mongoose you could also have robust models. The one thing that seems to be missing is a well-formed system for supporting controllers and actions.

The example Express applications seem to place control logic (and perhaps some business as well) in the app.js file.  Take a look at the code below (as seen in a typical app.js file) and notice how the route callback serves as a controller for a url:

UPDATE: It was only after I wrote this blog post that I finally found the MVC example, but I hope that this post can still serve to help those transitioning from Zend Framework.

// myapp.com/users
app.get('/users',function(req,res){
        var users = ["Joe","Bob"];
        res.render("users", {"users":users});
});

// myapp.com/users/123
app.get('/users/:id',function(req,res){
        var user = "Joe";
        res.render("user",{"user":user});
});

This might be fine for supporting a few routes, but when supporting potentially dozens of routes this file can become very large, difficult to maintain, and virtually impossible to edit in a team environment. This approach also does not support "actions"—at least not in the way you may be accustomed to coming from a Zend Framework or Rails background.

The solution proved to be a simple matter of leveraging the node.js module system. By placing the control logic in a module, we now have controllers and their actions completely separated out into their own files. The code below demonstrates how to separate the controller/action logic from the route logic:

/* controllers/users.js
   this file holds all control logic
*/

// list action
exports.list = function(req,res){
        var users = ["Joe","Bob"];
        res.render("users", {"users":users});
};

// detail action
exports.detail = function(req,res){
        var user = "Joe";
        res.render("user",{"user":user});
};



/* app.js
   this file now only sets configuration and defines routes
*/ 
var users = require('./controllers/users');
app.get('/users', users.list); // map to list action of users module
app.get('/users/:id', users.detail); // map to detail action

For anyone new to node.js you may have wondered about the "exports" as seen in the users.js file. Using the exports object simply attaches the list and detail functions to the users module as seen applied in app.js. You can read more about node.js module system here.

It is worth noting that the semantics here may be a litte confusing for anyone coming from Zend Framework which has its own support for "modules." The thing to remember is that a node.js module is completely different. It is basically a way to package things into an object (a Javascript object that is) to be used for various purposes. Node.js modules do not map to a uri like they do in Zend Framework. At least not out of the box. In the example given above our "users" module is functioning as our controller. Other modules may function as session management or database abstraction.

10 May 2011

CSS Positioning Explained - Part 1

CSS positioning may seem arcane at first, and sometimes even daunting, but like most things it's the lack of understanding that makes it seem this way. Many students from my class have asked for this explanation posted, so here I will attempt to show you the fundamental concepts of CSS positioning and explain its rather anti-intuitive nature so that you too can love it like I do. It really is quite useful, but only if you truly understand the various behaviors.

Position Values

The position property may take one of the following values:

  • static (default)
  • relative
  • absolute
  • fixed

Static

Every CSS property defaults to something, and in this case, the position property defaults to static. A static setting is what allows elements to follow page flow. If you are not already aware, "page flow" is the CSS priniciple by which the browser renders elements in the same order in which they were written in the HTML. So if in your HTML you have a heading followed by a paragraph, the browser will then display the paragraph below the heading. Static positioning simply means that the browser will render an element's position based on the size and position of the element(s) that preceded it in the HTML. It is also important to note that an element with static positioning will ignore top, right, bottom and left properties. Positioning with values other than static allows you to override page flow and force an element to be displayed in another position on the page.

Relative

The other three positions, relative, absolute, and fixed, make use of the top, right, bottom, and left properties. The most important thing to make these properties clear is this pivotal question: "From what point(s) are these values measured?" 

You can think of relative positioning as a kind of "offset," or a way to shift an element from its normal position. Relative positioning behaves just like static in that the position of an element is first determined by page flow (see above), but now you can use the top, right, bottom, and left properties to shift the element in a particular direction. Herein lies one of the anti-intuitive idisyncrasies of CSS: with relative positioning, setting a positive value for any of the top, right, bottom, or left properties will shift the element in the opposite direction. Take a look at the following code:

<style>
    div {border:solid 1px green; width: 400px; margin:0 auto; padding:10px;}
    p {
        position:relative;
        border:solid 1px red;
        margin-bottom:30px;
    }
</style>
<div>
    <p>This paragraph still follows page flow.</p>
    <p style="right:40px;">This paragraph is <strong>shifted leftward</strong> by setting its <em>right</em> property to a <strong>positive</strong> value.</p>
    <p style="right:-40px;">This paragraph is <strong>shifted rightward</strong> by setting its <em>right</em> property to a <strong>negative</strong> value.</p>
</div>

In this example all three paragraphs still follow page flow; they are rendered in the same order that they written in the HTML. The second one however is shifted by setting the right property to a positive value, while the third is shifted by setting the right property to a negative value. This will produce the following effect:

Screen_shot_2011-05-09_at_9

Copy the code above, paste it into a HTML document, and experiment by changing the right property to one of top, left, or bottom. Also try setting the values to positive or negative pixels (px) or percentages (%). You'll see that by using top instead of right for example, the paragraph will shift downward or upward depending on whether you use postive or negative values respectively.

So to answer our key question from earlier: from what point(s) are these values measured? For relative positioning, you could say that the values for top, right, bottom, and left are measured from those respective sides of the element being positioned. Setting a right property will shift the element from the right side of said element. The direction will depend on positive or negative values.

One More Idiosyncrasy

There is one more thing to keep in mind when using relative positioning to shift elements: when an element using relative positioning is shifted, the browser reserves the space the positioned element would have taken up under normal page flow. Therefore a relative positioned element is never truly removed from page flow. Take a look at the following code and example:

<style>
    div {border:solid 1px green; width: 400px; margin:0 auto; padding:10px;}
    p {
        position:relative;
        border:solid 1px red;
        margin-bottom:40px;
    }
</style>
<div>
    <p>This paragraph still follows page flow.</p>
    <p>As does this paragraph.</p>
    <p style="top:-40px;">This paragraph is <strong>shifted upward</strong> by setting its <em>top</em> property to a <strong>negative</strong> value.</p>
    <p>This paragraph also follows page flow.</p>
</div>

In this code you can see that only the third paragraph has its top property set. Without any directional properties set, the other paragraphs just continue to follow normal page flow, and will look something like this:

Screen_shot_2011-05-10_at_6

Notice the two gaps highlighted by the black arrows. The topmost gap is simply the first paragraph's bottom margin, which is set at 40 pixels. The bottommost gap however is 80 pixels. This is because the browser positioned the fourth paragraph according to where the third would have fallen had it been positioned with a static setting. Since the third paragraph also has a 40 pixel margin and was shifted upward by 40 pixels, the effective visual gap is 80 pixels. The browser reserved that space.

Also notice the gap, or lack thereof, between the second and third paragraphs. In this case, because the third paragraph is shifted upward, the margin of the second paragraph has no effect.

So that's that for relative positioning. Almost. It does actually have yet one more caveat, but only when using absolute positioning on children elements. We'll cover that in part 2.

24 Apr 2011

I'm a "How" Kinda Guy

I'm not terribly original. in today's web industry where so many are looking for or working on the latest startup, "unoriginal" is probably not a trait you'd hear many admit to. But I'm ok with being unoriginal because I've realized that my strengths lie in the "how," rather than in the "what." As a developer it is my job, my passion, my innate talent, to make manifest any idea. So often in conversation someone will express an idea, and instantly my mind is at work breaking it down in to all its constituent parts, forming data relationships, writing schemas and stepping through code. I see a design and instantly I'm looking at each piece determining the most appropriate HTML elements, and how those elements would be grouped, floated, positioned, or otherwise styled. It's a modern symbiotic relationship: you are the visionary saying, "This is what we need to do." I am the engineer saying, "OK, this is how we do it." You dream in visions of a better product, a better world, an easier app, of problems solved, and millions of users. I dream in code.
5 Oct 2010

A More Elegant Solution to Search Engine Friendly AJAX Browsing

AJAX browsing is a beautiful thing; it provides the user with a much more responsive, fluid, and engaging browsing experience. But is it searchable? That is, can a search engine get the same content as would be delivered to users via an AJAX link? Google is very aware of the need to do something about this little conundrum, and has indeed offered a solution. It is however a bit ugly, so here is my slightly more elegant solution as seen on Callisto.fm and soon to be seen on other site I'm finishing.

The Problem

The user wants to navigate through the site, but why reload the entire page? Why interrupt playback of audio or video? Static elements such as the header, nav, and footer are already there, so why waste time and bandwidth serving them again?

Pulling content into the browser via AJAX is easy, but how do we make that same content visible to search engines? In effect, the following two URLs need to deliver the same content:

http://www.callisto.fm/#/browse/by/channel/
http://www.callisto.fm/browse/by/channel/

The Solution

The solution is comprised of three key steps:

1. Serve traditional relative URLs so that search engines can reach them
2. Use javascript to make select URLs AJAX driven on the client side
3. Have the server respond with only unique content for any AJAX request 

Let's first take a look at a portion of the raw HTML as sent by the server for a plain ol' GET of http://www.callisto.fm/ . (Apologies in advance for the wierd spaces inside anchor tags in code view. For some reason Posterous was refactoring the HTML entities and actually showing a link instead of code.)

<ul class="nav main">
    <li class="listen">< a id="btnListen" class="hash selected" href="/">Listen</a ></li>
    <li class="browse">< a id="btnBrowse" class="hash" href="/browse/by/channel/">Browse</a > </li>
    <li class="search">< a id="btnSearch" class="hash" href="/search/">Search</a ></li>
</ul>

Notice the anchors in the code above: the href attributes do not contain a hash tag. This allows search engines to reach that URL and get the appropriate content. Of course that's old stuff we learned in HTML 101 which we need, but we also need those URLs to be AJAX driven as far as the user in concerned.

Now notice that the anchors have the class "hash" assigned to them. We use jQuery and the following bit of javascript to dynamically insert the hash tag:

$('a.hash').each(function()
{
    this.href='/#'+$(this).attr('href');
});

I'm sure you've already figured out what this code doing: javascript is grabbing every anchor with the class="hash" and prepending the hash tag to the URL. Because search engines dont execute javascript, this solution falls neatly into place when the code is rendered in a browser.

The Client-Side AJAX Request

Now that our href's have been modified to include the necessary hash, we need to implement the AJAX request and display the content. Take a look at the following code:

if("onhashchange" in window)
{
    $(window).bind('hashchange',function(){Ajax.onHashChange();});
}

Modern browsers make this really easy by supporting the onhashchange event of the window object. (The final code will also support older browsers by polling window.location). In the code above we are binding the onhashchange event to the onHashChange() function of the Ajax object. (The complete code is available at the end of this post.)

With this in place, any time the hash tag of the browsers address bar changes, our code will be fired. It is this code that requests a hashed URL via AJAX:

onHashChange:function()
{
    this.hash=decodeURIComponent(window.location.hash); // safari returns the hash encoded
    if (this.hash!='') // only make a request if the hash is not empty
    {
        var page = this.hash.replace('#/','/');
        $.ajax(
        {
            type: "GET",
            url: page,
            dataType: "html",
            success:function(data, stat, Xhr)
            {
                $('#content').html(data);
                document.title=Xhr.getResponseHeader("X-XHR-Title");
            }
        });
    }
}

Here we are only making a request if the hash is not empty. The success callback is where the content gets displayed. In this example we are injecting the new HTML into an element with id="content". (Note: it is possible to make that dynamic using custom HTTP headers). Now that our HTML has been requested and inserted, we need to update the browser's title bar, and we do that with a custom HTTP header called "X-XHR-Title". Setting those server-side is easy enough as you'll see below.

So, given this URL:

http://www.callisto.fm/#/browse/by/channel/

the hash value would be:

/browse/by/channel/

thus the AJAX request would perform a GET for:

http://www.callisto.fm/browse/by/channel/

The Server-Side Response to an AJAX Request

Herein lies the trick that ties it all together. A normal request for the previous URL would deliver the entire page—header, nav, footer and all. But we dont want all those elements delivered again; we only want the content unique to that page, so we need to make the server aware of AJAX requests and modify our output accordingly.

The Zend Framework makes this very easy, as I'm sure does other frameworks like Symfony or Rails, so the concept here is mainly what I'm speaking to.

In a framework that utilizes a MVC approach and layouts, the unique content per page would be the view, and the more static elements such as header and footer would be a part of the layout. So when handling an AJAX request, we simply disable the layout and allow the view to be rendered and served all by itself. Voila. That's it. Here it is as written in PHP as part of a Zend Framework-based application:

if ($this->getRequest()->isXmlHttpRequest())
{
    $this->getHelper('layout')->disableLayout();
}

This little, but crucial piece is usually placed in App_Controller_Action::init() but could easily be placed in the bootstrap or a front controller plugin.

As for the title HTTP header, have a look at this function as found in App_Controller_Action::init() :

protected function setAjaxTitle($title)
{
    // replace weird characters that can cause issues in delivery and display of the title
    $title = preg_replace('/[^(\x20-\x7F)]*/','', $title);
    // set the response header
    $this->getResponse()->setHeader('X-XHR-Title',$title,true);
}

And you would call this function from a controller action, setting the title specifically for the associated view.

One Last Thing—The Bow on Top

All of this works perfectly well, if the user enters the site from the home page. But we know well that they absolutely need to enter from any URL, including those AJAX permalinks they're sure to bookmark. So how do we get the site to behave properly in those cases? Take a look at the Ajax.init() function below:

// run when page is first loaded from the server, not when content is pulled via AJAX
init:function()
{
    this.convertLinks();
    this.hash=decodeURIComponent(window.location.hash); // get the initial hash value
    
    if("onhashchange" in window)
    {
        $(window).bind('hashchange',function(){Ajax.onHashChange();});
    }
    else
    {
        // requires the jquery timer plugin
        this.HashPoller=$.timer(20,function() // check the address every 20 miliseconds
        {
            if(Ajax.hash != decodeURIComponent(window.location.hash))
            {
                Ajax.onHashChange();
            }
        });
    }
    
    if(this.hash=='')
    {
        // load the home page
        window.location.hash='#/';
    }
    else
    {
        // there WAS a hash value when the page was first loaded. The user came in through an AJAX permalink
        this.onHashChange();
    }
}

Here we check to see if on the initial page load, the browser's address bar already contained a hash. If so, then we fire the onhashchange event handler.

Ah, but what if the user enters through an indexed page such as http://www.callisto.fm/browse/by/channel/ ? In that case, to make things all nice and neat, we would want to redirect them to the home page, with their requested page included as the hash. However, because search engines need to crawl these pages we cannot issue that redirect via the server generated HTTP location header; we must have the client redirect itself.

To do this, the server side would need to check for two things: that the page is not the homepage (or any other non redirected page), and that the request was not an AJAX request. If this is the condition, then the server can append a dynamic javascript enclosure. Here is the server side code as found in the dispatchLoopStartup method of a front controller plugin (note: you'll need Mojito_JsInit ):


if (!$this->getRequest()->isXmlHttpRequest())
{
    // an array of page requests that are never redirected to a hashed URL. Always the home page and those that are served via modal or iframe
    $nonRedirects = array ('index/index','account/signup','auth/login');
    
    $requestedPage = $this->getRequest()->getControllerName().'/'.$this->getRequest()->getActionName();
    
    if (!in_array($requestedPage,$nonRedirects))
    {
         Mojito_JsInit::getInstance()
        ->addMethod('Ajax.redirect','/#'.$this->getRequest()->getRequestUri()) // add the javascript call to redirect to the hashed url
        ->lock(); // prevent further methods from being added on this request
    }
}

and the related javascript component:

redirect:function(url)
{
    window.location.href=url;
}

And there you have it. Take a look at http://callisto.fm and see for yourself. Every URL that is served, is also fully visible to search engines with the same exact content, all while keeping the URLs nice and pretty. Oh and yes those AJAX pageviews are still tracked in Google analytics by calling pageTracker._trackPageview(). The javascript engine for this is posted below, but remember to include the jquery timer plugin. As for the server-side, you'll need to piece that together yourself, but I have every confidence that you will!

Ajax:
{
    HashPoller:null,
    hash:'',
    init:function()
    {
        this.convertLinks();
        this.hash=decodeURIComponent(window.location.hash);
        
        if ("onhashchange" in window) // use the onhashchange event
        {
            $(window).bind('hashchange',function(){Ajax.onHashChange();});
        }
        else this.HashPoller=$.timer(20,function() // otherwise poll the address
        {
            if (Ajax.hash!=decodeURIComponent(window.location.hash))
            {
                Ajax.onHashChange();
            }
        });
        
        if (this.hash=='')
        {
            window.location.hash='#/';
        }
        else
        {
            this.onHashChange();
        }
    },
    convertLinks:function(parent)
    {
        // use parent arg to define a parent element, thereby limiting the scope of anchor manipulation
        if (parent!=undefined)
        {
            parent=parent+' '; else parent='';
        }
        var selector=parent+'a.hash';
        $(selector).each(function()
        {
            this.href='/#'+$(this).attr('href');
        });
    },
    onHashChange:function()
    {
        this.hash=decodeURIComponent(window.location.hash); // safari returns hash encoded
        if (this.hash!='')
        {
            var page = this.hash.replace('#/','/');
            $.ajax(
            {
                type: "GET",
                url: page,
                dataType: "html",
                success:function(data, stat, Xhr)
                {
                    // set the X-XHR-Container header server side to determine which element will receive the requested content
                    // or just hard code it here. Must be a jQuery selector and might be something like '#content'
                    var container=Xhr.getResponseHeader("X-XHR-Container");
                    var $container = $(container);
                    $container.html(data);
                    document.title=Xhr.getResponseHeader("X-XHR-Title");
                    
                    // fire google analytics
                    if (typeof window.pageTracker=='object') pageTracker._trackPageview(page);
                    // woopra analytics
                    if (typeof window.woopraTracker=='object') woopraTracker.track(page,document.title);
                    // in case the content contains the FBML fb:like tag
                    if (typeof window.FB=='object') FB.XFBML.parse($container[0]);    
                }
            });
        }
    },
    redirect:function(url) // used when the client entered through other than the home page
    {
        window.location.href=url;
    }
}

 

13 May 2010

Simple Models with Zend_Db_Table

For simple data models Zend_Db_Table is a fine choice, as a full-blown ORM like Doctrine might be overkill if all you have is 2 or 3 resources. Here is an example of my implementation of Models based on the Zend_Db package:


class App_Model_Users extends Mojito_Model_Abstract
{    
    protected $_dbTableClass='App_Model_Users_Table';
    
    public function getByEmail($email)
    {
        $Select=$this->_DbTable->select()->where(new Zend_Db_Expr('LOWER(usrEmail)=?'),strtolower($email));
        $User=$this->_DbTable->fetchRow($Select);
        return $this->verifyRow($User);
    }

}

class App_Model_Users_Table extends Zend_Db_Table_Abstract
{
    protected $_name = 'users';
    protected $_primary = 'user_id';
    protected $_rowsetClass = 'App_Model_Users_Rowset';
    protected $_rowClass = 'App_Model_Users_Row';
}

class App_Model_Users_Rowset extends Zend_Db_Table_Rowset_Abstract
{
}

class App_Model_Users_Row extends Zend_Db_Table_Row_Abstract
{
    protected function _insert()
    {
        // pre instert logic such as:
        $this->password = sha1($this->password);
    }
    
    protected function _postInsert()
    {
        // email user a welcome
    }
    
    protected function _postDelete()
    {
        // delete related files such as avatar
        // can also get a rowset of related many's to delete
    }
    
}

So this would go into application/models/Users.php and in my controller, I'd do something like this:


class UsersController extends Zend_Controller_Action
{
    public function init()
    {
        $this->Users = Mojito_Model::get('Users');
    }
    
public function signupAction()
{
    $Form = Mojito_Form::get('Signup');
    if ($this->getRequest()->isPost())
    {
         if ($Form->isValid($this->getRequest()->getPost()))
         {
             $User=$this->Users->create($Form->getValues());
             if ($User!==false) $this->_redirect('/welcome');
         }
    }
    $this->view->Form = $Form;
}

Voila, you have a Users model, and are creating users directly from form data. You'll need the Mojito_Model library for it to work, and be sure to call Mojito_Model::setOptions() from your bootstrap so it knows where to find your model files. I'll flush this example out a bit more later when I have more time.

Get the Mojito library from Google Code

 

7 May 2010

JSON Configuration Files for Zend Framework with TMJ_Config_Json

I love JSON, and prefer writing it over both INI and YAML. So here is my class that offers a Zend_Config object from a json encoded text file. Note that there is no significant difference in performance between parse_ini_file() and json_decode (file_get_contents($file))

 

class TMJ_Config_Json extends Zend_Config
{
    /** * Create Zend_Config object from json formatted file
     *
     * @param string $file
     * @param string [$section=null]
     * @param array [$options=array()]
     */
    public function __construct($file,$section=null,$options=array())
    {
        if (!file_exists($file)) throw new Zend_Config_Exception($file.' could not be found.');
        else
        {
            $jsonString = file_get_contents($file);
            $dataArray = Zend_Json::decode($jsonString);
            if ($dataArray==null) throw new Zend_Config_Exception($file.' is not formatted correctly.');
            
            // figure out inheritance
            foreach ($dataArray as $key => &$val)
            {
                if (isset($val['_inherits']))
                {
                    if (!isset($dataArray[$val['_inherits']])) throw new Zend_Config_Exception('Section '.$key.' could not find section '.$val['_inherits'].' from which to inherit.');
                    else $val=array_merge_recursive($dataArray[$val['_inherits']],$val); unset($val['_inherits']);
                }
            }
            
            // determine section to return
            if (!empty($section))
            {
                if (!array_key_exists($section,$dataArray)) throw new Zend_Config_Exception($section.' was not found in '.$file);
                else $dataArray = $dataArray[$section];
            }
            
            $allowModifications = isset($options['allowModifications'])?(bool) $options['allowModifications']:false;
            parent::__construct($dataArray, $allowModifications);
        }
    }
}

And here is an example of routes defined in JSON:

{
    "production":
    {
        "login":
        {
            "type":"Zend_Controller_Router_Route_Static",
            "route":"/login",
            "defaults":
            {
                "controller":"index",
                "action":"index"
            }
        },
        "signup":
        {
            "type":"Zend_Controller_Router_Route_Static",
            "route":"/signup",
            "defaults":
            {
                "module":"api",
                "controller":"students",
                "action":"post"
            }
        }
    },
    
    "staging":
    {
        "_inherits":"production"
    },

    "testing":
    {
        "_inherits":"production"
    },
    
    "development":
    {
        "_inherits":"production"
    }
}

Please try it out and let me know of any issues you find. I suspect there will be some issues regarding inheritance, and I'm quite certain that inhertiance can be handled in a much better way.

14 Apr 2010

Super Fast Fulltext Search with Sphinx and Zend Framework

As mentioned in earlier posts about installing and configuring Sphinx, I love it! All of it is easy and fast, and here is my approach to search as seen on callisto.fm :

The Sphinx PHP API

Sphinx ships with a decent PHP API that allows you to query searchd and returns an array of matched documents. Beware that it does not conform to Zend best practices and coding standards, but I have seen a mention of adding a proper search adapter to the Zend_Search core. In fact, I may just do it myself.

Anyway, the file sphinxapi.php is found in the api folder of the sphinx distribution you downloaded. I simply added this to my library which is a part of the include path and used require instead of depending on autoloading since it does not conform to the Zend Framework naming conventions either.

The Search Route

You can see in the route below that we are getting the search term from a parameter in the URL. This allows specific searches to be linked to from bookmarks, tweets, etc., and more importantly, we can track those search URLs with Google analytics, and with Woopra see what people are searching for in real time.

The route for this is stupidly simple:

$Router->addRoute('search',new Zend_Controller_Router_Route('search/:term/*',array('controller'=>'search','action'=>'index')));

With that route all search requests are routed through the search index action, and the parameter "term" is made available. Notice the /* at the end of the route. That allows us to append other parameters like sorting and so forth as you will see in the controller below.

The Search Controller

<?

class SearchController extends Zend_Controller_Action
{
    public function indexAction()
    {
        $params=$this->getRequest()->getParams();
        
        if (!empty($params['term']))
        {
            // Prepopulate the input field.
            $this->view->term=$params['term'];
            
            // Setup the Sphinx client.
            require_once 'sphinxapi.php';
            $Sphinx = new SphinxClient();
            $Sphinx->SetServer('127.0.0.1',9312);
            $Sphinx->SetConnectTimeout(1);
            $Sphinx->SetArrayResult(true);
            $Sphinx->SetMatchMode(SPH_MATCH_BOOLEAN);
            
            // Set sort mode and field.
            $sort_mode = SPH_SORT_RELEVANCE;
            $sort_by = '';
    
            if (!empty($params['sort-published']))
            {
                $sort_mode=$params['sort-published']=='newest'?SPH_SORT_ATTR_DESC:SPH_SORT_ATTR_ASC;
                $sort_by='post_timestamp';
            }
            
            $Sphinx->SetSortMode($sort_mode, $sort_by);
            
            //  Pagination size and offset.
            $offset = 0;
            $limit = 10;
            if (!empty($params['page'])&&is_numeric($params['page'])) $offset = ($params['page']-1)*$limit;
            $Sphinx->SetLimits($offset,$limit);
            
            // Set the index. Perform the search.
            $index = strtolower(APPLICATION_ENV);
            $search = $Sphinx->Query($params['term'],$index);
            
            if ($search!==false)
            {
                if (!empty($search['matches']))
                {
                    $post_ids=array();

                    $this->view->Paginator = new Zend_Paginator(new Zend_Paginator_Adapter_Null($search['total']));
                    $this->view->Paginator->setItemCountPerPage($limit);
                    $this->view->Paginator->setCurrentPageNumber($params['page']);

                    foreach ($search['matches'] as $match) $post_ids[]=$match['id'];

                    $this->view->Posts=$PostsModel->getByArray($post_ids);
                }
            }
        }
    }
}

As you can see the controller is rather slim and simple, containing only the index action. Basically we see if there is a term available and if so, setup the client, determine how to sort the results, specify pagination rules, get the results, and finally get database records based on an array of IDs returned from search. For a longer, more detailed explanation, keep reading:

Setup the Sphinx Client

Most of this should be self explanatory, but I'd like to note a few things. Namely that searchd defaults to listening on port 9312, but you can change that in sphinx.conf. Also, for this example I set the IP to the localhost which might be fine for smaller sites with one server running the entire LAMP stack. For callisto.fm however, since our PHP servers are load balanced and autoscaled, we actually use a central search server. We do this because it would be too easy for indexes on servers behind the load balancer to be out of sync, and we dont want a user's subsequent requests, like in the case of paginating, to return conflicting search results.

Lastly, we've set the match mode to boolean (SPH_MATCH_BOOLEAN) in order to support searches containing & or -, but we plan to support allowing the user to determine the match mode. Please see the sphinx documentation for more on match modes.

Set sort mode and field

Sphinx will default to sorting by relevance (SPH_SORT_RELEVANCE) but if you recall in the configuration, we defined "post_timestamp" as a timestamp (integer) attribute so that we could sort the results by the date posts were created. Here we are setting relevance as the default, but then checking to see if the parameter "sort-published" was set, and if so we set the order depending on the value of sort-published ("newest" sorts descending, "oldest" sorts ascending), and we set the post_timestamp to be the attribute on which the sorting is done.

Pagination Size and Offset

Pagination with sphinx works exactly like it does with MySQL. You pass it the offset, which is the number of records to skip, and the limit, which is the number of records to return.

Set the Index. Perform the Search.

In the configuration for sphinx we defined two indexes: "production" and "search". I named them that specifially so that I could refer to them by the particular environment being ran. So here we set and tell Sphinx which index to use for this search, and we call the Query method to actually peform the search and return a result. The result will only be false if an error occured, and for the sake of keeping this simple I left out error handling, but it would be a good idea to trap and record those. If there were no errors, Query will return an array, even if no matches were found.

If matches were actually returned we setup Zend_Paginator, and then loop through the matches to create an array of post IDs. With this array we can have our model return a rowset based on a SQL statement with a clause like "WHERE IN ()"

From there you have a rowset just like anywhere else in your site and can be displayed in the view just like you would normally.

14 Apr 2010

Configuring Sphinx Search

The really cool thing about Sphinx search, besides being easy to install, is that you don't need to write any code to build an index—it will connect to MySQL directly, perform the queries you specifiy, and index those results automatically. Let's cd over to /usr/local/etc and edit our sphinx config files:

:~/sphinx-0.9.9$ cd /usr/local/etc
:/usr/local/etc$ sudo cp sphinx.conf.dist sphinx.config
:/usr/local/etc$ sudo vim sphinx.conf

Defining the Data Sources

Take a look at the following definition of a data source that I've named "production":

source production
{
    #MySQL connection information   
   
    sql_host        = localhost
    sql_user        = SphinxUser
    sql_pass        = Sph!nxP@ss
    sql_db          = myDb_production
    sql_port        = 3306
   
    # Main document fetch query
    sql_query       = SELECT posts.id, posts.body, posts.keywords, unix_timestamp(posts.created) AS post_timestamp, users.id AS user_id, users.name FROM posts INNER JOIN users on posts.user=users.id

    # Range Query
    sql_query_range = SELECT MIN(id),MAX(id) FROM posts
    sql_range_step  = 1000

    #Index Attributes
    sql_attr_uint = user_id
    sql_attr_timestamp = post_timestamp
}

In the "production" source we've defined the following items:

  1. MySQL Connection Information
    Tells Sphinx how to connect to the database
  2. Main document fetch query
    This pulls in all the fields that you wish to be indexed. Note that only TEXT fields will be indexed, and other types will need to be converted to INTEGERS to be used for sorting and filtering only.
  3. Range Query
    The sql_query_range allows you to index records by sets defined in size by sql_range_step. This is helpful when you have tens of thousands of records or more and pulling all records at once would place the server under too high of a load.
  4. Index Attributes
    Here we use "user_id" as an unsigned integer so that we can later filter search matches by a user's ID, and we use "post_timestamp" as a timestamp attribute so we can sort search results based on publication date.

Note that I've created a user specifically for Sphinx to use that has select privileges only on the entire "myDb" database. Also note that this source definition does not at all represent the entire set of options available—please see the Sphinx Documentation for more information on available options.

At this point we have our production server source, but what about a development server source? Sphinx configuration allows us to easily inherit options defined in other sources. Here is an example of defining a source with inherited options while overriding others.

source development : production
{
    sql_db = myDb_development
}

It's just that simple! Now with only 3 more lines, we've added an additional source that inherits all options from the "production" source, but overrides the database option. Now on to indexes!

Defining the Indexes

A source tells an index where to finds its data, whereas an index tells Sphinx where to store the index data as well as how to handle things like morphology, stemming, and character sets. Take a look at the following definition for our production index:

index production
{
    # Data Source
    source            = production

    # Index Path
    path              = /var/data/indexes/production

    # Morphology
    morphology        = stem_en
    min_stemming_len  = 3

    # Word Length
    min_word_len      = 3

    # Character Set
    charset_type      = utf-8

    # Strip HTML tags
    html_strip        = 1
}

In the "production" index we've defined the following:

  1. Data Source
    Tells which source definition to use for this index
  2. Index Path
    Tells the indexer where to write the files that will contain all the index data
  3. Morphology replaces different forms of the same word with the base form. For example a search for "dogs" will match "dog" as well.
  4. Word Length
    This tell the indexer how many characters to index before considering something a word. I chose 3 because in English anything less than three characters would be a word like "a", "an", or "in" and thus not words that would be included in or affect a search
  5. Character Set
    Here we specified UTF8 which is also specified as the default character set in our database and PHP code.
  6. Strip HTML
    In this case we arent searching for tags or elements, but rather just plain English, so to keep our indexes smaller and faster, I've chosen to strip all HTML tags from the data before it is indexed.

Again, as with the source, this index definition does not represent all of the available options. Please refer to the index configuration section of the Sphinx documentation for more information.

And of course we need a "development" index to go along with the development source. Just like the development source inherited from the production source, so too can the index inherit and override specific options like this:

index development:production
{
    source = development
    path = /var/data/indexes/development
}

Of course the sphinx.conf configuration file also contains settings for the indexer and the search daemon, but I found that the default settings worked fine. You however may want to double check on the available indexer options and searchd options.

Now to index our data. It's as simple as this (be sure that the paths listed in the above index definitions exist):

sudo indexer --all

And if everything ran fine, you should see something like this:

Sphinx 0.9.9-release (r2117)
Copyright (c) 2001-2009, Andrew Aksyonoff

using config file '/usr/local/etc/sphinx.conf'...
indexing index 'development'...
collected 13 docs, 0.0 MB
sorted 0.0 Mhits, 100.0% done
total 13 docs, 986 bytes
total 0.021 sec, 46040 bytes/sec, 607.02 docs/sec
indexing index 'production'...
collected 10570 docs, 1.0 MB
sorted 0.1 Mhits, 100.0% done
total 10570 docs, 952546 bytes
total 0.576 sec, 1651781 bytes/sec, 18329.11 docs/sec
total 4 reads, 0.002 sec, 206.3 kb/call avg, 0.5 msec/call avg
total 14 writes, 0.005 sec, 143.6 kb/call avg, 0.3 msec/call avg

And to start the search deamon:

/usr/local/bin/searchd

Keep in mind though, that you'll need to rebuild your indexes regularly to keep them from being out of sync with database records. You wouldn't want new records to be left out, or IDs for deleted records returned from search. Indexing can be done at anytime but to ensure that the search deamon is not offline while the indexes rebuild, use the following command which will build new indexes alongside the currently running indexes, and restart the search deamon to use them:

sudo indexer --all --rotate

So that's it for configuration. Now on to using PHP to perform a search with Sphinx.

20 Mar 2010

Installing Sphinx on Linux (Ubuntu 9.10 Karmic)

After using this installation process and the following configuration process and PHP code, I found Sphinx to be incredibly easy to setup and use, and best of all, incredibly fast! Sphinx is made up of three pieces of sofware: the indexer which pulls data from a defined source and indexes all text-based fields, the search command line tool for testing and debugging, and searchd which is the search deamon running in the background listening for search requests.

Installation:

At the time of this writing the latest version is Sphinx 0.9.9, which can be found here: http://www.sphinxsearch.com/downloads/sphinx-0.9.9.tar.gz

First download, extract and change directories:

:~$ wget http://www.sphinxsearch.com/downloads/sphinx-0.9.9.tar.gz
:~$ tar zxf sphinx-0.9.9.tar.gz
:~$ cd sphinx-0.9.9

Now we start the installation process by running ./configure, but the documentation says to run it with --help to list all of the options. I will only list the options I'm using, but you should use the options most appropriate to your particular situation.

First be sure you have mysql-devel package installed so that Sphinx installation can use the C header files. Keep in mind that mysql-devel has many names, and the one to use depends on your particular flavor of linux. On Ubuntu, it looks like this:

:~/sphinx-0.9.9$ sudo apt-get install libmysql++-dev

Now let's try to run configure

:~/sphinx-0.9.9$ ./configure --with-mysql

If it runs successfully, you'll get a message like this:

configuration done
------------------

You can now run 'make' to build Sphinx binaries,
and then run 'make install' to install them.

Otherwise you'll get an error, most likely one saying that a particular library was not found. If so, try apt-get to install that library.

Now we run the make and make install. Be sure you have g++ installed.

:~/sphinx-0.9.9$ make
:~/sphinx-0.9.9$ make install

The binaries should now be in /usr/local/bin and you can verify by entering "search" and getting the following response:

:~/sphinx-0.9.9$ search
Sphinx 0.9.9-release (r2117)
Copyright (c) 2001-2009, Andrew Aksyonoff

Usage: search [OPTIONS] <word1 [word2 [word3 [...]]]>

Options are:
-c, --config <file>     use given config file instead of defaults
...
...

Thats it! Sphinx is installed! Now we need to configure Sphinx to connect to Mysql and index records based on a SQL statement we define.

22 Dec 2009

MySQL ERROR 1018 (HY000): Can't read dir

Yep, that crazy message removed a few more hairs from my already balding head, and this is how it happened, and how I fixed it:

I wanted to upgrade Zend Server CE, so I backed up the mysql data folder like so:

$ sudo cp -rf  /usr/local/zend/mysql/data ~/mysqldata

I then uninstalled Zend Server:

$ sudo /usr/local/zend/bin/uninstall.sh

Copied the mysql data back, removing the old error logs first:

$ sudo rm -r ~/mysqldata/*.err
$ sudo cp ~/mysqldata/* /usr/local/zend/mysql/data/

Started Zend Server:

$ sudo /usr/local/zend/bin/zendctl.sh start

To ensure that my DBs were all there I opened phpmyadmin through the Zend Server admin tool,
Selected "mydb" , lo and behold NO TABLES!

WHAT?!
<insert string of random explicatives here>
OH MY GOD MY SCHEMA'S GONE!
<and here>
IT'LL TAKE ME HOURS TO REBUILD IT!
<and here>

Ok calm down, let's see whats really going on here:

$ sudo /usr/local/zend/mysql/mysql
mysql> use mydb;
...
Database changed (ok the DB is there, but why arent the tables showing up?)
mysql>show tables;
ERROR 1018 (HY000): Can't read dir ...

<insert hair pulling session here>

A quick skim through the Googles and I wonder about file permissions:

$ sudo ls -la /usr/local/zend/mysql/data/

Sure enough! Those folders that contain my DBs are owned by root! AHA!

$ sudo chown -Rf zend:wheel /usr/local/zend/mysql/data/

Restart Zend:

$ sudo /usr/local/zend/bin/zendctl.sh start

Back to phpmyadmin, and voila!
My tables are back!
Oh my glorious schema, how much I feared I had lost thine precious optimization!