logo资料库

SQLdf包.pdf

第1页 / 共31页
第2页 / 共31页
第3页 / 共31页
第4页 / 共31页
第5页 / 共31页
第6页 / 共31页
第7页 / 共31页
第8页 / 共31页
资料共31页,剩余部分请下载后查看
R SQLDF 详细教程 1 (2013-12-19 22:01:31) 转 载 ▼ To write it, it took three months; to conceive it – three minutes; to collect the data in it – all my life. F. Scott Fitzgerald Latest News There is now an sqldf discussion group to discuss sqldf (and other of my packages). sqldf is an R package for runing SQL statements on R data frames, optimized for convenience. The user simply specifies an SQL statement in R using data frame names in place of table names and a database with appropriate table layouts/schema is automatically created, the data frames are automatically loaded into the database, the specified SQL statement is performed, the result is read back into R and the database is deleted all automatically behind the scenes making the database's existence transparent to the user who only specifies the SQL statement. Surprisingly this can at times be even faster than the corresponding pure R calculation (although the purpose of the project is convenience and not speed). This link suggests that for aggregations over highly granular columns that sqldf is faster than another alternative tried. sqldf is free software published under the GNU General Public License that can be downloaded from CRAN. sqldf supports (1) the SQLite backend database (by default), (2) the H2 java database, (3) the PostgreSQL database and (4) sqldf 0.4-0 onwards also supports MySQL. SQLite, H2, MySQL and PostgreSQL are free software. SQLite and H2 are embedded serverless zero administration databases that are included right in the R driver packages, RSQLite and RH2, so that there is no separate installation for either one. A number of high profile projects use SQLite. (Also see this lecture.) H2 is a java database which contains a large collection of SQL functions and supports Date and other data types. PostgreSQL is a client/server database and unlike SQLite and H2 must be separately installed but it has a particularly powerful version of SQL, e.g. its window functions, so the extra installation work can be worth it. sqldf supports the RPostgreSQL driver in R. Like PostgreSQL, MySQL is a client server database that must be installed independently so its not as easy to install as SQLite or H2 but its very popular and is widely used as the back end for web sites. The information below mostly concerns the default SQLite database. The use of H2 with sqldf is discussed in FAQ #10 which discusses differences between using sqldf with SQLite and H2 and also shows how to modify the code in the Examples section to use sqldf/H2 rather than sqldf/SQLite. There is some information on using PostgreSQL with sqldf in FAQ #12 and an example in Example 17. Lag . The unit tests provide examples that can work with all five data base drivers (covering four databases) supported by sqldf. They are run by loading whichever database is to be tested (SQLite is the default) and running: demo("sqldf-unitTests")  Overview  Citing sqldf
 For Those New to R  News  Troubleshooting  FAQ   1. How does sqldf handle classes and factors?  2. Why does sqldf seem to mangle certain variable names?  3. Why does sqldf("select var(x) from DF") not work?  4. How does sqldf work with "Date" class variables?  5. I get a message about the tcltk package being missing.  6. Why are there problems when we use table names or column names that are the same except for case?  7. Why are there messages about MySQL?  8. Why am I having problems with update?  9. How do I examine the layout that SQLite uses for a table? which tables are in the database? which databases are attached?  10. What are some of the differences between using SQLite and H2 with sqldf?  11. Why am I having difficulty reading a data file using SQLite and sqldf?  12. How does one use sqldf with PostgreSQL?  13. How does one deal with quoted fields in read.csv.sql ?  14. How does one read files where numeric NAs are represented as missing empty fields?  15. Why do certain calculations come out as integer rather than double?  16. How can one read a file off the net or a csv file in a zip file?  Examples   Example 1. Ordering and Limiting  Example 2. Averaging and Grouping  Example 3. Nested Select  Example 4. Join  Example 5. Insert Variables  Example 6. File Input  Example 7. Nested Select  Example 8. Specifying File Format  Example 9. Working with Databases  Example 10. Persistent Connections  Example 11. Between and Alternatives
 Example 12. Combine two files in permanent database  Example 13. read.csv.sql and read.csv2.sql  Example 14. Use of spatialite library functions  Example 15. Use of RSQLite.extfuns library functions  Example 16. Moving Average  Example 17. Lag  Example 17. MySQL Schema Information Overview sqldf is an R package for running SQL statements on R data frames, optimized for convenience. sqldf works with the SQLite, H2, PostgreSQLor MySQL databases. SQLite has the least prerequisites to install. H2 is just as easy if you have Java installed and also supports Date class and a few additional functions. PostgreSQL notably supports Windowing functions providing the SQL analogue of the R ave function. MySQL is a particularly popular database that drives many web sites. More information can be found from within R by installing and loading the sqldf package and then entering ?sqldf and ?read.csv.sql. A number ofexamples are on this page and more examples are accessible from within R in the examples section of the ?sqldf help page. As seen from this example which uses the built in BOD data frame: library(sqldf) sqldf("select * from BOD where Time > 4") with sqldf the user is freed from having to do the following, all of which are automatically done:  database setup  writing the create table statement which defines each table   importing and exporting to and from the database coercing of the returned columns to the appropriate class in common cases It can be used for:   learning SQL if you know R learning R if you know SQL  as an alternate syntax for data frame manipulation, particularly for purposes of speeding these up, since sqldf with SQLite as the underlying database is often faster than performing the same manipulations in straight R  reading portions of large files into R without reading the entire file (example 6b and example 13 below show two different ways and examples 6e, 6f below show how to read random portions of a file) In the case of SQLite it consists of a thin layer over the RSQLite DBI interface to SQLite itself.
In the case of H2 it works on top of the RH2 DBI driver which in turn uses RJDBC and JDBC to interface to H2 itself. In the case of PostgreSQL it works on top of the RPostgreSQL DBI driver. There is also some untested code in sqldf for use with the MySQL database using the RMySQL DBI driver. Citing sqldf To get information on how to cite sqldf in papers, issue the R commands: library(sqldf) citation("sqldf") For Those New to R If you have not used R before and want to try sqldf with SQLite, google for single letter R, download R, install it on Windows, Mac or UNIX/Linux and then start R and at R console enter this: # installs everything you need to use sqldf with SQLite # including SQLite itself install.packages("sqldf") # shows built in data frames data() # load sqldf into workspace library(sqldf) sqldf("select * from iris limit 5") sqldf("select count(*) from iris") sqldf("select Species, count(*) from iris group by Species") # create a data frame DF <- data.frame(a = 1:5, b = letters[1:5]) sqldf("select * from DF") sqldf("select avg(a) mean, variance(a) var from DF") # see example 15 To try it with H2 rather than SQLite the process is similar. Ensure that you have the java runtime installed, install R as above and start R. From within R enter this ensuring that the version of RH2 that you have is RH2 0.1-2.6 or later: # installs everything including H2 install.packages("sqldf", dep = TRUE) # load RH2 driver and sqldf into workspace library(RH2) packageVersion("RH2") # should be version 0.1-2-6 or later library(sqldf) # sqldf("select * from iris limit 5") sqldf("select count(*) from iris") sqldf("select Species, count(*) from iris group by Species") DF <- data.frame(a = 1:5, b = letters[1:5])
sqldf("select * from DF") sqldf("select avg(a) mean, var_samp(a) var from DF") News March 28, 2012. sqldf 0.4-6.4 has been uploaded to CRAN. See NEWS file. December 19, 2011. sqldf 0.4-6.1 has been uploaded to CRAN. It fixes a minor bug. December 10, 2011. sqldf 0.4-6 has been uploaded to CRAN. See NEWS file. December 1, 2011. Some changes to FAQ #4 have been made to incorporate the improvements in RSQLite 0.11.0 . November 28, 2011. RH2 0.1-2.8 has been uploaded to CRAN. It includes a new version, 1.3.162, of H2. November 22, 2011. RPostgreSQL support has been added to sqldf in the sqldf development version. November 21, 2011. sqldf 0.4-5 is now on CRAN and should propagate to the mirrors shortly. See NEWS. November 15, 2011. sqldf 0.4-4 has been uploaded to CRAN. The primary new feature is the inclusion of a gawk program, csv.awk, which can transform input files by removing quotes surrounding fields, unescaping embedded quotes and replacing field separators with different separators. See the example here and also see ?sqldf from within R. Added later: Note that a bug was found in this awk program -- try thecsvfix program instead. November 5, 2011. sqldf 0.4-3 has been uploaded to CRAN. This version allows the file argument to be omitted in read.csv.sql if filter is specified and no file input is needed. (Previously it had to be specified as "NUL" or "/dev/null" depending on OS.) Also, if the file argument begins with "http:" or "ftp:" in those commands then it first downloads the file before reading it into sqlite. See FAQ #16. October 20, 2011. RH2 0.1-2.7 has been uploaded to CRAN. This version is a bug fix release. August 8, 2011. sqldf 0.4-2 has been uploaded to CRAN. This version adds the nrows and field.types arguments to read.csv.sql andread.csv2.sql. July 30, 2011. RH2 0.1-2.6 has been uploaded to CRAN. This version corrects a documentation bug. July 23, 2011. RH2 0.1-2.5 is on on CRAN. It should appear on the mirrors shortly. A significant change in RH2 is that it includes H2 1.3.158 which no longer requires that built in function names be upper case. July 23, 2011. sqldf 0.4-1.2 is on on CRAN. It should appear on the mirrors shortly. This version is a bug fix version. June 28, 2011. sqldf 0.4-1 is on CRAN. See NEWS for changes. June 15, 2011. sqldf 0.4-0 is on CRAN. See NEWS for a list of changes. May 24, 2011. The development version of sqldf now has MySQL support. It now also has a unit test suite that can be used with svUnit. The test suite works with any of RSQLite, RH2, RMySQL and RpgSQL driver packages.
May 11, 2011. A new version of the RpgSQL postgresql driver supported by sqldf is now on CRAN. See the RpgSQL NEWS file. March 7, 2011. A new version of the RH2 driver, version 0.1-2.3, has been uploaded to CRAN. It includes a workaround for the problem that the RJDBC driver which RH2 uses reads NULLs into R in numeric database fields as 0. This change fixes that so that they are read into R as NA. December 16, 2010. A new example has been added below. See Example 17. Lag . October 2, 2010. A new version of the RpgSQL postgresql driver supported by sqldf is now on CRAN. See the RpgSQL NEWS file. August 30, 2010. The development source allows the to.df argument of sqldf to be a function or the character string "name__class" (as well as the previously allowed values of NULL, "raw" and "auto"). If "name__class" is specified then instead of the usual class assignment heuristic sqldf uses the column names to determine class. Any column name of the form "x__y" where y is some R class, e.g."mydate__Date", is converted to that class and the suffix is removed. If a function is used as the value of the method argument then it is called by sqldf passing the data frame prior to class conversion as its first argument. This provides a way for user transformations to hook intosqldf. e.g. > library(sqldf) > > DF <- data.frame(a_Date = 0:1, b_POSIXct = 0:1, c = 0:1) > sqldf("select * from DF", method = "name_class") a b c 1 1970-01-01 1970-01-01 00:00:00 0 2 1970-01-02 1970-01-01 00:00:01 1 > > ## same > options(sqldf.method = "name_class") > sqldf("select * from DF") a b c 1 1970-01-01 1970-01-01 00:00:00 0 2 1970-01-02 1970-01-01 00:00:01 1 > > processDates <- function(data, ...) { + ix <- grepl("_date$", names(data)) + names(data)[ix] <- sub("_date$", "", names(data)[ix]) + data[ix] <- lapply(data[ix], as.Date, origin = "1970-01-01") + data + } > DF2 <- data.frame(a_date = 0:1, c = 0:1) > sqldf("select * from DF2", method = processDates) a c
1 1970-01-01 0 2 1970-01-02 1 August 21, 2010. A new example has been added below. See Example 16. Moving Average . June 5, 2010. A new example has been added below. See Example 15. Use of RSQLite.extfuns package library functions . June 5, 2010. Version 0.3-5 of sqldf has been uploaded to CRAN. See NEWS file. April 16, 2010. Added example 4j Per Group Min and Max on this page. March 16, 2010. gsubfn which sqldf depends on has come out with a new version, gsubfn 0.5-1, that can run without tcltk. That means sqldf can also run without tcltk now if tcltk is not found. tcltk is still suggested and parsing of the SQL command will be faster if tcltk is available. March 15, 2010. sqldf discussed in this January 2010 Spanish language blog post (English translation) . March 12, 2010. this link has an sqldf example using SQLite and this link solves the same problem also using sqldf but this time with PostgreSQL making use of PostgreSQL's windowing functions. February 13, 2010. New versions: sqldf version 0.3-4, and RH2 version 0.1-2 (DBI/RJDBC driver for H2 database) have been uploaded to CRAN. Also a new package RpgSQL version 0.1-1 (DBI/RJDBC driver for PostgreSQL database) has been uploaded to CRAN. The default action of sqldf (if sqldf's drv= argument is not used and if the "sqldf.driver" global option is not used) is to use PostgreSQL if RpgSQL is loaded or H2 if RH2 is loaded or SQLite otherwise. The main change in sqldf is that all H2 statements are now supported, not just those statements that return results. The packages should become accessible from the CRAN main site and the mirrors shortly. February 7, 2010. New versions of sqldf version 0.3-3, and RH2 version 0.1-1 (R driver for H2 database) have been uploaded to CRAN. They are primarily bug fix versions. Notable bugs that were eliminated were associated with the use of the persistence feature (using sqldf without any arguments) and the use of the filter= argument. Feburary 6, 2010. Added example Example 13c illustrating use of filter= argument with read.csv.sql. February 1, 2010. sqldf 0.3-2 is now in the svn repository and has been uploaded to CRAN. It now also supports the H2 embedded java database. This database has some SQL functions not available in SQLite. For more info see FAQ #10. January 27, 2010. Added FAQ #9 on examining table layouts. January 26, 2010. Added FAQ #8 on update. January 24, 2010. Added FAQ #7 on MySQL. January 22, 2010. Added FAQ #6 on case sensitivity. January 15, 2010. sqldf listed in Drew Conway's top 10 Must-Have R Packages for Social Scientists in a December 2009 post on his Zero Intelligence Agents blog. sqldf was also mentioned in November in dataspora by Michael E. Driscoll and is the subject of a blog post in Cerebral Mastication by J. D. Long. sqldf is also recommended for a particular application
in stackoverflow and Juliet Jacobson discusses why it fits in with her work flow here. Also some recent tweets on sqldf can be found here and here. December 28, 2009. New bug fix release sqldf 0.2-1 on CRAN. See NEWS file. December 26, 2009. Folded the Bugs section into FAQ #4 since this is more of an explanation of how to use dates in SQLite than a bug. That section has been further expanded to show how to use SQLite date and time functions to solve some problems involving the R Date class. December 22, 2009. sqldf 0.2-0 has been released and is available on CRAN. It now works with the latest version of DBI, DBI 0.2-5 (which quotes column names that are SQL reserved words instead of appending __1 to their name so the mangling of column names that are SQL reserved words is gone). Also sqldf 0.2-0 supports the libspatial- 1.dll SQLite loadable extension which gives the user access to several dozen new SQL functions listed here: http://www.gaia-gis.it/spatialite/spatialite-sql-2.3.1.html. The user must download this dll and place it in their path if they want to use these functions. (If this is not done sqldf will still work but without those new functions.) Also newfilter= arg on read.csv.sql and new read.csv2.sql command. For more details see this announcement and the NEWS file. December 9, 2009. Titus von der Malsburg posted on r-help peformance results of a problem with about 8,000 rows comparing an sqldfsolution to 4 other solutions using aggregate, summmaryBy, by and tapply, respectively, and found that the sqldf solution was the fastest. Marek Jared posted a variation on the problem, which included making it self-contained, and reached the same conclusion. (Added later: there are also some performance results here.) Since sqldf must build a database, transfer data frames to it, perform the operations, transfer the result back and destroy the database it created we would not expect it to be the fastest possible solution nevertheless as these performance tests show it is remarkably good and in those cases was actually faster than anything else tried. (Note: if your queries are running slowly you can speed them up, sometimes dramatically, by using indexing and ensuring that the queries are specified in such a way that the created indexes are actually used. See example 4i on this page.) September 25, 2009. A new version of sqldf is on CRAN. It contains bug fixes and can also handle table names with a dot in the name provided the table name is enclosed in back quotes in the SQL statement. August 30, 2009. Added Example 4f temporal join to this page. June 16, 2009. Added read.csv2.sql to development version. It is like read.csv.sql except that sep defaults to ";" . See Example 13b at the end of this page. June 7, 2009. Version 0.1-5 of sqldf is now on CRAN and should propagate to the mirrors shortly. read.csv.sql is new. See Example 13below. June 4, 2009. New command read.csv.sql. May 16, 2009. Example 6g added below. April 22, 2009. Added example 4e (left join) in the Examples section below. Example 4 section
分享到:
收藏