Preparations to use TPCH data on Postgresql

显示全部楼层 · 2012-5-21 10:19:41

Preparations to use TPCH data on PostgresqlFebruary 9th, 2010 Goto comments Leave a comment

dbgen puts a SEPARATOR symbol “|” at the end of each line, but I want to get rid of it, keeping only “|” ‘s between two columns. In other words, the original data format is:
col1|col2|col3|col4|
What I want is:
col1|col2|col3|col4
This may simplify some of my code which uses a “split”-like function to extract data lines.
So, first step, modify “dss.h”.

Locate the following line:
#define PR_END(fp) fprintf(fp, “\n”)
change it to:
#define PR_END(fp) {fseek(fp, -1, SEEK_CUR);fprintf(fp, “\n”);}
run “make”.
Now, you could run like “./dbgen -s 2″ to get 2GB dataset. After dbgen finishes, a bunch of “*.tbl” files will be generated. Move (mv) these files to “/tmp”. But before you can let postgresql manage your data, you have to create a database and corresponding tables first.
The command to create a new database is just “createdb”. And to create tables, just run “dss.ddl” that comes with TPCH package.
Now you can run the following code to import data into postgresql:
copy customer from ‘/tmp/customer.tbl’ WITH DELIMITER AS ‘|’;copy lineitem from ‘/tmp/lineitem.tbl’ WITH DELIMITER AS ‘|’;copy nation from ‘/tmp/nation.tbl’ WITH DELIMITER AS ‘|’;copy orders from ‘/tmp/orders.tbl’ WITH DELIMITER AS ‘|’;copy partsupp from ‘/tmp/partsupp.tbl’ WITH DELIMITER AS ‘|’;copy part from ‘/tmp/part.tbl’ WITH DELIMITER AS ‘|’;copy region from ‘/tmp/region.tbl’ WITH DELIMITER AS ‘|’;copy supplier from ‘/tmp/supplier.tbl’ WITH DELIMITER AS ‘|’;
You may move the “.tbl” files back from “/tmp” after importing is finished.
To speed up queries, you may also want to create some indexes:
CREATE INDEX n_nationkey_idx on nation (n_nationkey);CREATE INDEX r_regionkey_idx on region (r_regionkey);CREATE INDEX p_partkey_idx on part (p_partkey);CREATE INDEX s_suppkey_idx on supplier (s_suppkey);CREATE INDEX ps_partkey_idx on partsupp (ps_partkey);CREATE INDEX c_custkey_idx on customer (c_custkey);CREATE INDEX o_orderkey_idx on orders (o_orderkey);CREATE INDEX l_orderkey_idx on lineitem (l_orderkey);
Since Andras is interested, I have generated a postgres version of TPC-H queries for download. This single file contains all queries, please split them into seprate “.sql” files.

千问 · 2012-5-21 10:19:41

Picasso Database Query Optimizer Visualizer©Indian Institute of Science, Bangalore, India TPC-H data generation and loading

Note: Picasso can be used with any generic relational database schema and SQL queries. The examples in the Picasso documentation are with respect to the TPC-H benchmark, and the procedure for generating and loading the TPC-H database is given here.

1.
Download the TPC-H benchmark programs DBGEN and QGEN from the TPC-H site.

2.
Create the TPC-H schema in your database engine using the file dss.ddlasper these instructions: DB2
OracleSQL Server
Sybase ASE
PostgreSQL MySql

3.
Generate the data.

a.
Compile DBGEN and QGEN after creating a suitable makefile.
b.
Run it to produce the data files (.tbl files) and queries.

4.
Load the TPC-H data as per these instructions: DB2
Oracle
SQL Server
Sybase ASE
PostgreSQL MySql
Note:
While loading data into tables, it is preferable to initially not specify the integrity constraints (primary key, foreign key and index) since it might slow down the insertion process. These constraints can be added later as mentioned below after the data has been loaded into the tables.
5.
Apply the integrity constraints on the data using file dss.rias per these instructions:
DB2Oracle
SQL Server
Sybase ASE
PostgreSQL MySql

Note 1:
The dss.ddl and dss.ri files may have to be modified to suit the syntax requirements of particular database engines.
For example:
a.
The connect to statement should not be present for Oracle.
Also this statement is not required (for other engines) if a connection to the database has already been established.
The equivalent command for Sybase ASE (through the isql prompt) is use .
And for PostgreSQL (through the psql prompt), it is \c
b.
Sybase ASE (through the Interactive SQL prompt) requires that the script should not have a semi-colon (;) at the end of any statement.
c.
The schema name (qualifier to the table names) should be changed to reflect the actual schema name or the table names should be made un-qualified.
d.
While specifying foreign key constraints, Oracle requires the keyword constraint and the constraint name to appear before the keywords foreign key.
e.
While specifying any constraints, SQL Server, Sybase ASE and PostgreSQL require that these constraints should not be named.
f.
The isql prompt of Sybase ASE requires that there should be no semi-colon after each statement – instead there should be a go command.
g.
The commit work statements are not needed since any ddl command is generally automatically followed by a commit.
In the case of some engines, this could also lead to a syntax error.
h.
SQLServer uses the datetime data type in place of date.
Note 2:
The .tbl file may have to be modified to suit the syntax requirements of particular database engines.
For example,
a.
PostgreSQL requires that the delimiter should not appear at the end of every line and that string values should not be quoted.
b.
DB2 requires that the values should be comma-separated and the string values should be enclosed in double quotes (“”).

Documentation Home

千问 · 2012-5-21 10:19:41

Picasso Database Query Optimizer Visualizer©Indian Institute of Science, Bangalore, India
PostgreSQL(The following setup procedure is known to work with PostgreSQL 8 on Windows.)

Install PostgreSQL by running the .msi installer file.
The process will include creating two accounts: a service account (preferably called postgres) and an internal database account.
The service account is an OS user account, which will be created during this installation if it doesn’t exist.
It should not be a privileged (root or admin) user.

Grant full permissions to this service account on the PostgreSQL installation folder.
Login to the machine as the PostgreSQL service account.
The database is already initialized during installation.
You can optionally initialize a new instance (in a new data folder) by running this command in the command prompt:
initdb –D
For example, if the data folder is ‘data1’ in the current directory, the command is:
initdb –D data1
Start the service with the command:
postmaster –D
For example,
postmaster –D data1
Create a database that will contain the database tables.
The command is:
createdb
Open the PostgreSQL prompt with the command:
psql-U
Here, the db_username should be the internal database account.
You will be prompted for the password, after which the SQL prompt opens.

Note: If the –U option is not specified, this command assumes the current OS user as the db user, and prompts for the password.
Incidentally if the current OS user name is the same as the internal database account, you will be directly logged in.
Create the tables by entering appropriate SQL statements.
If the schema to be created is in a file, run the command: \i
in the psql prompt.
For loading tables into the database, use the COPY
utility of PostgreSQL. The command template is given below.
Run the command in the psql prompt.
COPY [table] FROM '' WITH DELIMITER AS '';
The filename should be given with the full absolute path.

For example, to load data file C:\nation.tbl which has pipe-separated data, into the NATION table, the command is:
COPY NATION FROM 'C:\\nation.tbl ' WITH DELIMITER AS '|';
Note:
Before loading large data files, it may be helpful w.r.t. response time to set the ‘checkpoint_segments’ to a higher value than the default in the file postgresql.conf in the data directory.
Create the integrity constraints (primary key, foreign key and indexes) now.
If the commands are in a file, run the command: \i in the psql prompt.
Create statistical summaries for each of the columns that may be used as Picasso predicates in the query templates using the following command in the psql prompt:

VACUUM ANALYZE;

Note: The diagrams on the Picasso web-site have been generated with statistical summaries created on all table columns.

Documentation Home