I am recently starting learning Amazon Redshift. AWS Redsift is actually a revised version of ParAccel. There are voices in the industry to yell that the Big Data is hyped. Now, it looks like the MPP (Massive Parallel Processing) is the answer to the hyped Big Data. Of course, it depends on definition of Big Data. But, it is true that not every company need Big Data solutions as Google or Facebook needs.
AWS Redsift is type of MPP DBMS, which is support SQL interface, which enterprise data warehosue developer are familiar with. But, there are some limitation as well. Redshift is often compared with Hadoop Hive (For example,
the slide here. ). But, I think it is not fair as Hive is designed to be a batch processing platform from the begin. It will be more interesting to see the comparison between Hadoop HBase and Redshift or Cassandra. They are all column oriented database. Furthermore, we have other near real time interactive analyzing platform in the industry. For examples, we have
Google Bigquery (commercial),
Google Dremel (open source),
Apache Drill (open source) etc. The common thing among them is that they are use different parallel computing architecture in stead of Map/Reduce framework. There are many other choice of MPP databases, for examples, Netezza, Dataextreme etc. Compare all of these is beyond just technical stuff.
I am reading a book published by PacktPub as a beginner of Redshift. The book name is "Getting Started with Amazon Redshift". I think this is a good book for beginner of Redshift. It has less than 200 pages. It is good handbook to quickly setup AWS cluster, loading data into Redshift and practise on it. Especially, it highlighted out some features that Redshift DOES NOT support. I particularly like those parts as I am interested in knowing more about what a new product can not do instead of hearing tons of marketing language to promote it. It is good learning material in additional to Amazaon's online doc. Below is the link to the book,
Also, here is a link from Amazon to show
how to load data into Redshift.
http://oksoft.blogspot.com/2013/03/redshift-tips.html
Redshift Tips
# List all tables:
select db_id, id, name, sum(rows) as mysum from stv_tbl_perm where db_id = 100546 group by db_id, id, name order by mysum desc;
# list all running processes:
select pid, query from stv_recents where status = 'Running';
# describe table
select * from PG_TABLE_DEF where tablename='audit_trail';
select * from pg_tables where schemaname = 'public'
# Disk space used:
select sum(used-tossed) as used, sum(capacity) as capacity from stv_partitions
# Query log
select query, starttime , substring from svl_qlog where substring like '%tbl_name%' order by starttime desc limit 50;
# command history
select * from stl_ddltext where text like '%ox_data_summary_hourly_depot%' limit 10
# last load errors
select starttime, filename, err_reason from stl_load_errors order by starttime desc limit 100
select filename, count(*) as cnt from stl_load_errors group by filename
# create table from another table
select * into newevent from event;
# Check how columns are compressed
ANALYZE COMPRESSION
# ANALYZE and VACUUM
If you insert, update, or delete a significant number of rows in a table, run the ANALYZE and VACUUM commands against the table.
"analyze compression tbl_name" command produce a report with the suggested column encoding.
# To find and diagnose load errors for table 'event'
create view loadview as
(select distinct tbl, trim(name) as table_name, query, starttime,
trim(filename) as input, line_number, field, err_code,
trim(err_reason) as reason
from stl_load_errors sl, stv_tbl_perm sp
where sl.tbl = sp.id);
select * from loadview where table_name='event';
# Query to find blocks used
select stv_tbl_perm.name, count(*)
from stv_blocklist, stv_tbl_perm
where stv_blocklist.tbl = stv_tbl_perm.id
and stv_blocklist.slice = stv_tbl_perm.slice
group by stv_tbl_perm.name
order by stv_tbl_perm.name;
Load tips:
# While loading data you can specify "empty as null", "blanks as null" allow "max error 5", "ignore blank lines", "remove quotes", "use zip". Use the keywords: emptyasnull blanksasnull maxerror 5 ignoreblanklines removequotes gzip
# use NULL AS '\000' to fix the import from specific files
# use BLANKASNULL in the original COPY statement so that no empty strings are loaded into VARCHAR fields which might ultimately be converted to numeric fields.
# Use the NOLOAD keyword with a COPY command to validate the data in the input files before actually loading the data.
# use COMPUPDATE to enable automatic compression
# FILLRECORD to fill missing columns at the end with blanks or NULLs
# TRIMBLANKS Removes the trailing whitespace characters from a VARCHAR string.
# ESCAPE the backslash character (\) in input data is treated as an escape character. (useful for delimiters and embedded newlines)
# ROUNDEC a value of 20.259 is loaded into a DECIMAL(8,2) column is changed to 20.26. or else 20.25
# TRUNCATECOLUMNS Truncates data in columns to the appropriate number.
# IGNOREHEADER to ignore first row
_____
If you are using JDBC, can you try adding the keepalive option to your connect string. E.g.,
jdbc:postgresql://instance.amazonaws.com:8192/database?tcpkeepalive=true
You can have AUTOCOMMIT set in your Workbench client.
_____
In order to avoid timeout error while using workbench on Windows, use the following setting:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\KeepAliveTime 30000
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\KeepAliveInterval 1000
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\TcpMaxDataRetransmission 10
_____
# Consider using DISTKEY and SORTKEY - There can be multiple sortkeys but only one primary key.
# wlm_query_slot_count - This will set aside more memory for query, which may avoid operations spilling to disk
# the isolation level for Redshift is SERIALIZABLE
_____
// There is no equivalent of "show create table tbl_name"
select from the PG_TABLE_DEF table to gather all the necessary schema information
// convert to and from unixtime
select extract (epoch from timestamp '2011-08-08 11:11:58');
select TIMESTAMP 'epoch' + starttime * INTERVAL '1 second' starting from tbl_name;
// Update a joined table:
update abcd set ser_area_code=abcd_update.ser_area_code, preferences=abcd_update.preferences, opstype=abcd_update.opstype,
phone_type=abcd_update.phone_type
from abcd_update join abcd nc on nc.phone_number = abcd_update.phone_number
http://docs.aws.amazon.com/redshift/latest/dg/t_updating-inserting-using-staging-tables-.html#concept_upsert
_____
// install postgresql
yum install postgresql postgresql-server
chkconfig postgresql on
// You will now create a file where the redshift password will be stored.
vi ~/.pgpass
c.us-east-1.redshift.amazonaws.com:5439:mydb:root:Passwd
chmod 0600 ~/.pgpass
// load data to redshift
cat to_psql.txt | psql -hc.us-east-1.redshift.amazonaws.com -Uroot -p5439 mydb > to_save.csv
// send the file as an attachment
echo "report file attached. " | mutt -s "result data " -a to_save.csv -- some_address@gmail.com
// mysqldump command that will generate the required statements to be used in redshift
mysqldump db_name tbl_name --where='1=1 limit 10' --compact --no-create-info --skip-quote-names > to_psql.txt
_____
Amazon data types are different than of MySQL. For e.g. literals can be saved only as varchar type and upto 65000 bytes.
http://docs.aws.amazon.com/redshift/latest/dg/r_Character_types.html
Here is a script that will do this conversion automatically.
https://gist.github.com/shantanuo/5115366
_____
If postgresql client is installed, we can connect to redshift using something like this...
# PGPASSWORD=Fly8946392085 psql -U fsb_user_85_22719249 -h flydata-sandbox-cluster.clroanynhqjo.us-east-1.redshift.amazonaws.com -p 5439 -d flydatasandboxdb
Welcome to psql 8.1.23 (server 8.0.2), the PostgreSQL interactive terminal.
No comments:
Post a Comment