In addition to all the
built-in SQL functions, such as lower( ) and
upper( ), you can extend SQLite to include
functions of your own written in PHP. These are known as
user-defined functions, or
UDFs for short. With a UDF, you embed logic
into SQLite and avoid doing it yourself in PHP. This way, you take
advantage of all the features inherent in a database, such as sorting
and finding distinct entries.
There are two types of UDFs: standard and aggregate.
Standard UDFs
are one-to-one: when given a single row of data, they return a single
result. Functions that change case, calculate cryptographic hashes,
and compute the sales tax on an item in a shopping cart are all
standard functions. In contrast, aggregate functions are many-to-one: when
using an aggregate function, SQLite passes it multiple rows and
receives only a single value.
Although it is not a UDF, the most popular aggregate function is
count( ), which returns the number of rows passed
to it. Besides count( ), most aggregate functions
are related to statistics: finding the average, standard deviation,
or the maximum or minimum value of a set of data points.
4.6.1 Standard Functions
UDFs are good for chopping up strings so
you can perform nonstandard collations and groupings. For example,
you want to sort through a list of URLs, maybe from a referrer log
file, and create a list of unique hostnames sorted alphabetically.
So, http://www.example.com/directory/index.html
and http://www.example.com/page.html would both
map to one entry: http://www.example.com.
To do this in PHP, you need to retrieve all the URLs, process them
inside your script, and then sort them. Plus, somewhere in all that,
you need to do the deduping. However, if it weren't
for that pesky URL-conversion process, this could all be done in SQL
using the DISTINCT and ORDER BY
keywords.
With a UDF like the one shown in Example 4-3, you
foist all that hard work back onto SQLite where it belongs.
Example 4-3. Retrieving unique hostnames using an SQLite UDF
// CREATE table and INSERT URLs
$db = sqlite_open('/www/support/log.db');
$sql = 'CREATE TABLE access_log(url);';
$urls = array('http://www.example.com/directory/index.html',
'http://www.example.com/page.html');
foreach ($urls as $url) {
$sql .= "INSERT INTO access_log VALUES('$url');";
}
sqlite_query($db, $sql);
// UDF written in PHP
function url2host($url) {
$parts = parse_url($url);
return "$parts[scheme]://$parts[host]";
}
// Tell SQLite to associate PHP function url2host( ) with the
// SQL function host( ). Say that host( ) will only take 1 argument.
sqlite_create_function($db, 'host', 'url2host', 1);
// Do the query
$r = sqlite_query($db, 'SELECT DISTINCT host(lower(url)) AS clean_host
FROM access_log ORDER BY clean_host;');
// Loop through results
while ($row = sqlite_fetch_array($r)) {
print "$row[clean_host]\n";
}
http://www.example.com
To use a UDF, you first write a regular function in PHP. The
function's arguments are what you want to pass in
during the SELECT, and the function should
return a single value. The url2host(
) function takes a URL; calls the built-in PHP function
parse_url( ) to
break the URL into its component parts; and returns a string
containing the scheme, ://, and the host. So,
http://www.example.com/directory/index.html gets
broken apart into many pieces. http is stored into
$parts['scheme'] and
www.example.com goes into
$parts['host']. This
creates a return value of http://www.example.com.
The next step is to register url2host( ) with
SQLite using sqlite_create_function(
). This function takes four parameters: the database
handle, the name you want the function to be called inside SQLite,
the name of your function written in PHP, and the number of arguments
your function expects. The last parameter is optional, but if you
know for certain that your function accepts only a specific number of
parameters, providing this information helps SQLite optimize things
behind the scenes. In this example, the SQL function is
host( ), while the PHP function is
url2host( ). These names can be the same;
they're different here to make the distinction
between them clear.
Now you can use host( ) inside any SQL calls using
that database connection. The SQL in Example 4-3
SELECTs host(lower(url)) AS
clean_host. This takes the URL stored in the
url column, converts it to lowercase, and calls
the UDF host( ).
The function is not permanently registered with the database, and
goes away when you close the database. If you want to use it when you
reopen the database, you must reregister it. Also, the function is
registered only for that database; if you open up a new database
using sqlite_connect( ), you need to call
sqlite_create_function( ) again.
The returned string is then named AS clean_host;
this lets you refer to the results later on in the SQL query and also
access the value in PHP using that name. Since
you're still in SQLite, you can take advantage of
this to sort the list using ORDER BY host. This
sorts the results in alphabetical order, starting at
a.
Now that's cool, but it's not
that cool. What is cool is
SQLite's ability to call UDFs in the ORDER
BY clause. If you use the default alphabetical sort,
http://php.example.org and
http://www.example.org won't be
near each other, because "p" and
"w" aren't next to
each other in the alphabet. Yet both hosts are located under the
example.org domain, so it makes sense that they
should be listed together. Not surprisingly, another UDF saves the
day.
function reverse_host($url) {
list ($scheme, $host) = explode('://', $url);
return join('.',array_reverse(explode('.',$host)));
}
sqlite_create_function($db, 'reverse', 'reverse_host', 1);
The reverse_host( ) function takes a URL and chops
it into two bits, the scheme and host, by explode(
)ing on ://. You can do this because the
previous UDF, host( ), has specifically created
strings in this manner. Next, $host is passed
through a series of three functions that splits it up into its
component parts, reverses those parts, and then glues them back
together. This flips around the pieces of the host separated by
periods, but doesn't actually reverse the text. So,
www.example.org becomes
org.example.www and not
gro.elpmaxe.www or
www.elpmaxe.gro.
This reversed hostname is perfect for sorting. When you alphabetize
org.example.www, it nicely sits next to all its
brethren in the .org top-level domain, then sorts
by the other hosts inside example.org, and finally
orders the remaining subdomains. And that's exactly
what you want.
You then register reverse_host( ) in the exact
same way you registered url2string( ), using
sqlite_create_function( ).
Once that's done, you can call reverse(
) inside your SQL query:
$r = sqlite_query($db, 'SELECT DISTINCT host(lower(url)) AS clean_host
FROM access_log ORDER BY reverse(clean_host);');
Given the following list of URLs as input:
http://www.example.com
http://php.example.org
http://www.example.org
you get the following as output:
http://www.example.com
http://php.example.org
http://www.example.org
The URL containing php.example.com has filtered
down in the list below www.example.com, even
though php comes before www in
the alphabet.
In contrast, Example 4-4 shows what you need to do
to implement this in PHP without UDFs.
Example 4-4. Sorting unique hostnames without using SQLite UDFs
function url2host($url) {
$parts = parse_url($url);
return "$parts[scheme]://$parts[host]";
}
function reverse_host($url) {
list ($scheme, $host) = explode('://', $url);
return join('.',array_reverse(explode('.',$host)));
}
function host_sort($a, $b) {
$count_a = $GLOBALS['hosts'][$a];
$count_b = $GLOBALS['hosts'][$b];
if ($count_a < $count_b) { return 1; }
if ($count_a > $count_b) { return -1; }
return strcasecmp(reverse_host($a), reverse_host($b));
}
$hosts = array( );
$r = sqlite_unbuffered_query($db, 'SELECT url FROM access_log');
while ($url = sqlite_fetch_single($r)) {
$host = url2host($url);
$hosts[$host]++ ;
}
uksort($hosts, 'host_sort');
This process breaks down into many steps:
Make a database query for urls.
Retrieve url into $url using
sqlite_fetch_single( ).
Convert $url into a host and store it in
$host.
Place $url as a new element in the
$hosts array and increment that element by
1. This tracks the number of times each URL has
appeared.
Perform a user-defined key sort on the $hosts
array.
The sqlite_fetch_single( ) function returns the
first (and in this case only) column from the result as a string.
This allows you to skip the step of saving the result as an array and
then extracting the element, either by using list
or as the 0th index.
Doing $hosts[$host]++ is a old trick that allows
you to easily count the number of times each key appears in a list.
Since uksort( ) only passes array keys to the
sorting function, host_host( ) is not very
elegant, because it requires using a global variable to determine the
number of hits for each element.
Overall, compared to a UDF, this method requires more memory,
execution time, and lines of code, because you're
replicating database functionality inside PHP.
4.6.2 User-Defined Aggregate Functions
As discussed earlier, most aggregate
functions are statistical functions, such as AVG(
) or STDEV( ). People usually use
aggregate functions to return a single row from their query, but
that's not a requirement. You can use them to link
together a set of related rows, to compact your query and return one
row per group.
This extension to the earlier referrer log sorting example shows how
to use an aggregate function to provide the total number of hits per
hostname, in addition to everything in the previous section:
SELECT DISTINCT host(lower(url)) AS clean_host, COUNT(*) AS hits
FROM access_log GROUP BY clean_host ORDER BY hits DESC, reverse(clean_host)
The COUNT(*) function sums the total number of
rows per host. However, this won't work without
adding the GROUP BY host clause.
GROUPing rows allows COUNT(*)
to know which sets of entries belong together. Whenever you have an
aggregate function?such as COUNT( ),
SUM( ), or any function that takes a set of rows
as input and returns only a single value as its output?use
GROUP BY when you want your query to return
multiple rows. (If you're just doing a basic
SELECT COUNT(*) FROM host to find the total number
of rows in the table, there's no need for any
GROUPing.)
COUNT(*) is aliased to hits,
which allows you to refer to it in the ORDER BY
clause. Then, to sort the results first by total hits, from most to
least, and then alphabetically within each total, use ORDER
BY hits DESC, reverse(host). By putting
hits first, you prioritize it over
reverse(clean_host) and the
DESC keyword flips the sorting order to descending
(the default is ascending).
Using that query, this set of sites:
http://www.example.org
http://www.example.org
http://www.example.com
http://php.example.org
and this PHP code:
while ($row = sqlite_fetch_array($r)) {
print "$row[hits]: $row[clean_host]\n";
}
gives:
2: http://www.example.org
1: http://www.example.com
1: http://php.example.org
Furthermore, to restrict results to sites with more hits than a
specified amount, use a HAVING clause:
SELECT DISTINCT host(lower(url)) AS clean_host, COUNT(*) AS hits
FROM access_log
GROUP BY clean_host
HAVING hits > 1
ORDER BY hits DESC, reverse(clean_host)
You cannot use WHERE here, because
WHERE can only operate on data directly from a
table. Here the restriction hits > 1 compares
against the result of a GROUP BY, so you need to
employ HAVING instead.
You can define your own aggregate functions for SQLite in PHP. Unlike
standard UDFs, you actually need to define two functions: one
that's called for each row and one
that's called after all the rows have been passed
in.
The code in Example 4-5 shows how to create a basic
SQLite user-defined aggregate function that calculates the average of
a set of numbers.
Example 4-5. Averaging numbers using an SQLite aggregate function
// CREATE table and INSERT numbers
$db = sqlite_open('/www/support/data.db');
$sql = 'CREATE TABLE numbers(number);';
$numbers = array(1, 2, 3, 4, 5);
foreach ($numbers as $n) {
$sql .= "INSERT INTO numbers VALUES($n);";
}
sqlite_query($db, $sql);
// average_step( ) is called on each row.
function average_step(&$existing_data, $new_data) {
$existing_data['total'] += $new_data;
$existing_data['count']++;
}
// average_final( ) computes the average and returns it.
function average_final(&$existing_data) {
return $existing_data['total'] / $existing_data['count'];
}
sqlite_create_aggregate($db, 'average', 'average_step', 'average_final');
$r = sqlite_query($db, 'SELECT average(number) FROM numbers');
$average = sqlite_fetch_single($r);
print $average;
3
First, you define the two aggregate functions in PHP, just as you do
for regular UDFs. However, the first parameter for both functions is
a variable passed by reference that is used to keep track of the
UDF's state. In this example, you need to track both
the running sum of the numbers and how many rows have contributed to
this total. That's done in average_step(
).
In average_final( ), the final sum is divided by
the number of elements to find the average. This is the value
that's returned by the function and passed back to
SQLite (and, eventually, to you).
To formally create an aggregate UDF, use
sqlite_create_aggregate(
). It works like sqlite_create_function(
), but you pass both PHP function names instead of just
one.
4.6.3 Binary Data
SQLite is not binary safe by default.
Requiring PHP to automatically protect against problems caused by
binary data causes a significant reduction in speed, so you must
manually encode and decode data when it might be anything other than
plain text. If your UDFs only operate on text, this
isn't a problem.
Inside a UDF, use sqlite_udf_binary_decode(
) to convert data stored in SQLite into usable strings in
PHP:
function udf_function_encode($encoded_data) {
$data = sqlite_udf_binary_decode($encoded_data);
// rest of the function...
}
When you're finished, if the return value might also
be binary unsafe, re-encode it using
sqlite_udf_binary_encode(
):
function udf_function_decode($encoded_data) {
// rest of the function...
return sqlite_udf_binary_encode($return_value);
}