R SQLDF 详细教程 1
(2013-12-19 22:01:31)
转 载 ▼
To write it, it took three months; to conceive it – three minutes; to collect the data in it – all my
life. F. Scott Fitzgerald
Latest News There is now an sqldf discussion group to discuss sqldf (and other of my
packages).
sqldf is an R package for runing SQL statements on R data frames, optimized for
convenience. The user simply specifies an SQL statement in R using data frame names in
place of table names and a database with appropriate table layouts/schema is automatically
created, the data frames are automatically loaded into the database, the specified SQL
statement is performed, the result is read back into R and the database is deleted all
automatically behind the scenes making the database's existence transparent to the user who
only specifies the SQL statement. Surprisingly this can at
times be even faster than the corresponding pure R calculation (although the purpose of the
project is convenience and not speed). This link suggests that for aggregations over highly
granular columns that sqldf is faster than another alternative tried. sqldf is free software
published under the GNU General Public License that can be downloaded from CRAN.
sqldf supports (1) the SQLite backend database (by default), (2) the H2 java database, (3)
the PostgreSQL database and (4) sqldf 0.4-0 onwards also supports MySQL. SQLite, H2,
MySQL and PostgreSQL are free software. SQLite and H2 are embedded serverless zero
administration databases that are included right in the R driver packages, RSQLite and RH2,
so that there is no separate installation for either one. A number of high profile projects use
SQLite. (Also see this lecture.) H2 is a java database which contains a large collection of SQL
functions and supports Date and other data types. PostgreSQL is a client/server database
and unlike SQLite and H2 must be separately installed but it has a particularly powerful
version of SQL, e.g. its window functions, so the extra installation work can be worth it. sqldf
supports the RPostgreSQL driver in R. Like PostgreSQL, MySQL is a client server database
that must be installed independently so its not as easy to install as SQLite or H2 but its very
popular and is widely used as the back end for web sites.
The information below mostly concerns the default SQLite database. The use of H2 with sqldf
is discussed in FAQ #10 which discusses differences between using sqldf with SQLite and H2
and also shows how to modify the code in the Examples section to use sqldf/H2 rather than
sqldf/SQLite. There is some information on using PostgreSQL with sqldf in FAQ #12 and an
example in Example 17. Lag . The unit tests provide examples that can work with all five data
base drivers (covering four databases) supported by sqldf. They are run by loading whichever
database is to be tested (SQLite is the default) and running: demo("sqldf-unitTests")
Overview
Citing sqldf
For Those New to R
News
Troubleshooting
FAQ
1. How does sqldf handle classes and factors?
2. Why does sqldf seem to mangle certain variable names?
3. Why does sqldf("select var(x) from DF") not work?
4. How does sqldf work with "Date" class variables?
5. I get a message about the tcltk package being missing.
6. Why are there problems when we use table names or column names that are the
same except for case?
7. Why are there messages about MySQL?
8. Why am I having problems with update?
9. How do I examine the layout that SQLite uses for a table? which tables are in the
database? which databases are attached?
10. What are some of the differences between using SQLite and H2 with sqldf?
11. Why am I having difficulty reading a data file using SQLite and sqldf?
12. How does one use sqldf with PostgreSQL?
13. How does one deal with quoted fields in read.csv.sql ?
14. How does one read files where numeric NAs are represented as missing empty
fields?
15. Why do certain calculations come out as integer rather than double?
16. How can one read a file off the net or a csv file in a zip file?
Examples
Example 1. Ordering and Limiting
Example 2. Averaging and Grouping
Example 3. Nested Select
Example 4. Join
Example 5. Insert Variables
Example 6. File Input
Example 7. Nested Select
Example 8. Specifying File Format
Example 9. Working with Databases
Example 10. Persistent Connections
Example 11. Between and Alternatives
Example 12. Combine two files in permanent database
Example 13. read.csv.sql and read.csv2.sql
Example 14. Use of spatialite library functions
Example 15. Use of RSQLite.extfuns library functions
Example 16. Moving Average
Example 17. Lag
Example 17. MySQL Schema Information
Overview
sqldf is an R package for running SQL statements on R data frames, optimized for
convenience. sqldf works with the SQLite, H2, PostgreSQLor MySQL databases. SQLite
has the least prerequisites to install. H2 is just as easy if you have Java installed and also
supports Date class and a few additional functions. PostgreSQL notably supports Windowing
functions providing the SQL analogue of the R ave function. MySQL is a particularly popular
database that drives many web sites.
More information can be found from within R by installing and loading the sqldf package and
then entering ?sqldf and ?read.csv.sql. A number ofexamples are on this page and more
examples are accessible from within R in the examples section of the ?sqldf help page.
As seen from this example which uses the built in BOD data frame:
library(sqldf)
sqldf("select * from BOD where Time > 4")
with sqldf the user is freed from having to do the following, all of which are automatically
done:
database setup
writing the create table statement which defines each table
importing and exporting to and from the database
coercing of the returned columns to the appropriate class in common cases
It can be used for:
learning SQL if you know R
learning R if you know SQL
as an alternate syntax for data frame manipulation, particularly for purposes of speeding
these up, since sqldf with SQLite as the underlying database is often faster than
performing the same manipulations in straight R
reading portions of large files into R without reading the entire file (example 6b and
example 13 below show two different ways and examples 6e, 6f below show how to read
random portions of a file)
In the case of SQLite it consists of a thin layer over the RSQLite DBI interface to SQLite itself.
In the case of H2 it works on top of the RH2 DBI driver which in turn uses RJDBC and JDBC
to interface to H2 itself.
In the case of PostgreSQL it works on top of the RPostgreSQL DBI driver.
There is also some untested code in sqldf for use with the MySQL database using
the RMySQL DBI driver.
Citing sqldf
To get information on how to cite sqldf in papers, issue the R commands:
library(sqldf)
citation("sqldf")
For Those New to R
If you have not used R before and want to try sqldf with SQLite, google for single letter R,
download R, install it on Windows, Mac or UNIX/Linux and then start R and at R console
enter this:
# installs everything you need to use sqldf with SQLite
# including SQLite itself
install.packages("sqldf")
# shows built in data frames
data()
# load sqldf into workspace
library(sqldf)
sqldf("select * from iris limit 5")
sqldf("select count(*) from iris")
sqldf("select Species, count(*) from iris group by Species")
# create a data frame
DF <- data.frame(a = 1:5, b = letters[1:5])
sqldf("select * from DF")
sqldf("select avg(a) mean, variance(a) var from DF") # see example 15
To try it with H2 rather than SQLite the process is similar. Ensure that you have
the java runtime installed, install R as above and start R. From within R enter this ensuring
that the version of RH2 that you have is RH2 0.1-2.6 or later:
# installs everything including H2
install.packages("sqldf", dep = TRUE)
# load RH2 driver and sqldf into workspace
library(RH2)
packageVersion("RH2") # should be version 0.1-2-6 or later
library(sqldf)
#
sqldf("select * from iris limit 5")
sqldf("select count(*) from iris")
sqldf("select Species, count(*) from iris group by Species")
DF <- data.frame(a = 1:5, b = letters[1:5])
sqldf("select * from DF")
sqldf("select avg(a) mean, var_samp(a) var from DF")
News
March 28, 2012. sqldf 0.4-6.4 has been uploaded to CRAN. See NEWS file.
December 19, 2011. sqldf 0.4-6.1 has been uploaded to CRAN. It fixes a minor bug.
December 10, 2011. sqldf 0.4-6 has been uploaded to CRAN. See NEWS file.
December 1, 2011. Some changes to FAQ #4 have been made to incorporate the
improvements in RSQLite 0.11.0 .
November 28, 2011. RH2 0.1-2.8 has been uploaded to CRAN. It includes a new version,
1.3.162, of H2.
November 22, 2011. RPostgreSQL support has been added to sqldf in the sqldf development
version.
November 21, 2011. sqldf 0.4-5 is now on CRAN and should propagate to the mirrors shortly.
See NEWS.
November 15, 2011. sqldf 0.4-4 has been uploaded to CRAN. The primary new feature is the
inclusion of a gawk program, csv.awk, which can transform input files by removing quotes
surrounding fields, unescaping embedded quotes and replacing field separators with different
separators. See the example here and also see ?sqldf from within R. Added later: Note that
a bug was found in this awk program -- try thecsvfix program instead.
November 5, 2011. sqldf 0.4-3 has been uploaded to CRAN. This version allows
the file argument to be omitted in read.csv.sql if filter is specified and no file input is
needed. (Previously it had to be specified as "NUL" or "/dev/null" depending on OS.) Also, if
the file argument begins with "http:" or "ftp:" in those commands then it first downloads the
file before reading it into sqlite. See FAQ #16.
October 20, 2011. RH2 0.1-2.7 has been uploaded to CRAN. This version is a bug fix release.
August 8, 2011. sqldf 0.4-2 has been uploaded to CRAN. This version adds
the nrows and field.types arguments to read.csv.sql andread.csv2.sql.
July 30, 2011. RH2 0.1-2.6 has been uploaded to CRAN. This version corrects a
documentation bug.
July 23, 2011. RH2 0.1-2.5 is on on CRAN. It should appear on the mirrors shortly. A
significant change in RH2 is that it includes H2 1.3.158 which no longer requires that built in
function names be upper case.
July 23, 2011. sqldf 0.4-1.2 is on on CRAN. It should appear on the mirrors shortly. This
version is a bug fix version.
June 28, 2011. sqldf 0.4-1 is on CRAN. See NEWS for changes.
June 15, 2011. sqldf 0.4-0 is on CRAN. See NEWS for a list of changes.
May 24, 2011. The development version of sqldf now has MySQL support. It now also has a
unit test suite that can be used with svUnit. The test suite works with any of RSQLite, RH2,
RMySQL and RpgSQL driver packages.
May 11, 2011. A new version of the RpgSQL postgresql driver supported by sqldf is now on
CRAN. See the RpgSQL NEWS file.
March 7, 2011. A new version of the RH2 driver, version 0.1-2.3, has been uploaded to
CRAN. It includes a workaround for the problem that the RJDBC driver which RH2 uses reads
NULLs into R in numeric database fields as 0. This change fixes that so that they are read
into R as NA.
December 16, 2010. A new example has been added below. See Example 17. Lag .
October 2, 2010. A new version of the RpgSQL postgresql driver supported by sqldf is now on
CRAN. See the RpgSQL NEWS file.
August 30, 2010. The development source allows the to.df argument of sqldf to be a
function or the character string "name__class" (as well as the previously allowed values of
NULL, "raw" and "auto"). If "name__class" is specified then instead of the usual class
assignment heuristic sqldf uses the column names to determine class. Any column name of
the form "x__y" where y is some R class, e.g."mydate__Date", is converted to that class
and the suffix is removed. If a function is used as the value of the method argument then it is
called by sqldf passing the data frame prior to class conversion as its first argument. This
provides a way for user transformations to hook intosqldf. e.g.
> library(sqldf)
>
> DF <- data.frame(a_Date = 0:1, b_POSIXct = 0:1, c = 0:1)
> sqldf("select * from DF", method = "name_class")
a b c
1 1970-01-01 1970-01-01 00:00:00 0
2 1970-01-02 1970-01-01 00:00:01 1
>
> ## same
> options(sqldf.method = "name_class")
> sqldf("select * from DF")
a b c
1 1970-01-01 1970-01-01 00:00:00 0
2 1970-01-02 1970-01-01 00:00:01 1
>
> processDates <- function(data, ...) {
+ ix <- grepl("_date$", names(data))
+ names(data)[ix] <- sub("_date$", "", names(data)[ix])
+ data[ix] <- lapply(data[ix], as.Date, origin = "1970-01-01")
+ data
+ }
> DF2 <- data.frame(a_date = 0:1, c = 0:1)
> sqldf("select * from DF2", method = processDates)
a c
1 1970-01-01 0
2 1970-01-02 1
August 21, 2010. A new example has been added below. See Example 16. Moving Average .
June 5, 2010. A new example has been added below. See Example 15. Use of
RSQLite.extfuns package library functions .
June 5, 2010. Version 0.3-5 of sqldf has been uploaded to CRAN. See NEWS file.
April 16, 2010. Added example 4j Per Group Min and Max on this page.
March 16, 2010. gsubfn which sqldf depends on has come out with a new version, gsubfn
0.5-1, that can run without tcltk. That means sqldf can also run without tcltk now if tcltk is not
found. tcltk is still suggested and parsing of the SQL command will be faster if tcltk is
available.
March 15, 2010. sqldf discussed in this January 2010 Spanish language blog post (English
translation) .
March 12, 2010. this link has an sqldf example using SQLite and this link solves the same
problem also using sqldf but this time with PostgreSQL making use of PostgreSQL's
windowing functions.
February 13, 2010. New versions: sqldf version 0.3-4, and RH2 version 0.1-2 (DBI/RJDBC
driver for H2 database) have been uploaded to CRAN. Also a new package RpgSQL version
0.1-1 (DBI/RJDBC driver for PostgreSQL database) has been uploaded to CRAN. The default
action of sqldf (if sqldf's drv= argument is not used and if the "sqldf.driver" global option
is not used) is to use PostgreSQL if RpgSQL is loaded or H2 if RH2 is loaded or SQLite
otherwise. The main change in sqldf is that all H2 statements are now supported, not just
those statements that return results. The packages should become accessible from
the CRAN main site and the mirrors shortly.
February 7, 2010. New versions of sqldf version 0.3-3, and RH2 version 0.1-1 (R driver for H2
database) have been uploaded to CRAN. They are primarily bug fix versions. Notable bugs
that were eliminated were associated with the use of the persistence feature (using sqldf
without any arguments) and the use of the filter= argument.
Feburary 6, 2010. Added example Example 13c illustrating use of filter= argument
with read.csv.sql.
February 1, 2010. sqldf 0.3-2 is now in the svn repository and has been uploaded to CRAN. It
now also supports the H2 embedded java database. This database has some SQL
functions not available in SQLite. For more info see FAQ #10.
January 27, 2010. Added FAQ #9 on examining table layouts.
January 26, 2010. Added FAQ #8 on update.
January 24, 2010. Added FAQ #7 on MySQL.
January 22, 2010. Added FAQ #6 on case sensitivity.
January 15, 2010. sqldf listed in Drew Conway's top 10 Must-Have R Packages for Social
Scientists in a December 2009 post on his Zero Intelligence Agents blog. sqldf was also
mentioned in November in dataspora by Michael E. Driscoll and is the subject of a blog post
in Cerebral Mastication by J. D. Long. sqldf is also recommended for a particular application
in stackoverflow and Juliet Jacobson discusses why it fits in with her work flow here. Also
some recent tweets on sqldf can be found here and here.
December 28, 2009. New bug fix release sqldf 0.2-1 on CRAN. See NEWS file.
December 26, 2009. Folded the Bugs section into FAQ #4 since this is more of an
explanation of how to use dates in SQLite than a bug. That section has been further
expanded to show how to use SQLite date and time functions to solve some problems
involving the R Date class.
December 22, 2009. sqldf 0.2-0 has been released and is available on CRAN. It now
works with the latest version of DBI, DBI 0.2-5 (which quotes column names that are SQL
reserved words instead of appending __1 to their name so the mangling of column names
that are SQL reserved words is gone). Also sqldf 0.2-0 supports the libspatial-
1.dll SQLite loadable extension which gives the user access to several dozen new SQL
functions listed here: http://www.gaia-gis.it/spatialite/spatialite-sql-2.3.1.html. The user must
download this dll and place it in their path if they want to use these functions. (If this is not
done sqldf will still work but without those new functions.) Also newfilter= arg
on read.csv.sql and new read.csv2.sql command. For more details see
this announcement and the NEWS file.
December 9, 2009. Titus von der Malsburg posted on r-help peformance results of a problem
with about 8,000 rows comparing an sqldfsolution to 4 other solutions
using aggregate, summmaryBy, by and tapply, respectively, and found that
the sqldf solution was the fastest. Marek Jared posted a variation on the problem, which
included making it self-contained, and reached the same conclusion. (Added later: there are
also some performance results here.) Since sqldf must build a database, transfer data
frames to it, perform the operations, transfer the result back and destroy the database it
created we would not expect it to be the fastest possible solution nevertheless as these
performance tests show it is remarkably good and in those cases was actually faster than
anything else tried. (Note: if your queries are running slowly you can speed them up,
sometimes dramatically, by using indexing and ensuring that the queries are specified in such
a way that the created indexes are actually used. See example 4i on this page.)
September 25, 2009. A new version of sqldf is on CRAN. It contains bug fixes and can also
handle table names with a dot in the name provided the table name is enclosed in back
quotes in the SQL statement.
August 30, 2009. Added Example 4f temporal join to this page.
June 16, 2009. Added read.csv2.sql to development version. It is
like read.csv.sql except that sep defaults to ";" . See Example 13b at the end of this page.
June 7, 2009. Version 0.1-5 of sqldf is now on CRAN and should propagate to the mirrors
shortly. read.csv.sql is new. See Example 13below.
June 4, 2009. New command read.csv.sql.
May 16, 2009. Example 6g added below.
April 22, 2009. Added example 4e (left join) in the Examples section below. Example 4
section