Migrating PostgreSQL JDBC Code to RedShift

Once every few months I am asked to support Amazon’s RedShift as a backup database for CopyStorm. I doubted that RedShift would be a very good choice given the limitations of a database like RedShift (column based and designed for fast searching and not fast transactions), but decided to give it a try this week.

What did we do? Just for fun, we migrated our popular CopyStorm/PostgreSQL application to support Amazon’s RedShift.
(spoiler: We found that RedShift is a poor choice for a near real time backup database for Salesforce!)

Here are a few things we learned…

Since…

RedShift’s JDBC integration layer is based on PostgreSQL
A real PostgreSQL driver works with RedShift
RedShift’s own custom JDBC driver is based on the PostgreSQL driver

a person might think a programmer could pretend that RedShift is PostgreSQL.

However, If you need general PostgreSQL capabilities beyond SELECT, you may be in trouble. Here are few issue we encountered.

Do not be too quick to trust the metadata from the Amazon RedShift/JDBC driver.

The JDBC method DatabaseMetaData.storesLowerCaseIdentifiers() returns false. Too bad the truth is “true.”
This is kind of odd since the standard PostegreSQL driver returns the correct value.

Forget about inserting a NULL value into a TEXT column.

The RedShift/JDBC driver returns a “unsupported feature” when setting a prepared TEXT parameter to null. You will need to use an empty string instead.
Do not do this because it will fail: preparedStmt.setNull(n, java.sql.Types.CLOB)
This is a substitute (but not really the same): preparedStmt.setString(n, “”)

Do not try an use a real PostgreSQL database at the same time.

The RedShift/JDBC driver supports both jdbc:redshift and jdbc:postgres prefixes for database connection strings. This is bad if you are using both RedShift and PostgreSQL JDBC drivers because you may find the wrong driver at run time when you really want the real PostgresSQL driver.
You may find the wrong jdbc:postgres driver because of how the javax.sql.DriverManager class interacts with java class loaders. To find a driver for jdbc:postgres, the DriverManager loops through all register drivers until it finds one that claims to support jdbc:postgres. A problem can occur if the RedShift driver is asked about jdbc:postgres BEFORE the real PostgreSQL driver.
In our case, our development class loader in Eclipse worked great but a vanilla Java 11 JRE crapped out and found the RedShift driver when we wanted a PostgreSQL driver.
There are a couple of ways to fix this problem, and we chose the re-ordering of the javax.sql.Driver classes managed by DriverManager (a hack but it works). Here is sample code.

Copy to Clipboard

Strip CREATE INDEX Statements From Your Code

RedShift/JDBC should just silently ignore a CREATE INDEX but it complains and throws a SQLException instead..

Forget about a few common data types — they are not supported (or partially supported).

NUMERIC (not supported)
Use VARCHAR(MAX) rather than TEXT. TEXT is only 256 bytes long. VARCHAR(MAX) is 64K bytes.
SERIAL (not supported). Use BIGINT IDENTITY

These are the main feature issues we found migrating our database application. The real problems, however, were not features but performance.

When we started this experiment we expected to get a RedShift database with blazing data reading speeds with a measurable performance hit when loading data. This is not exactly what we discovered…

Since RedShift is not designed for transactional type work we expected SQL statements like:

INSERT INTO Account(…lots of columns) VALUES(….), (…), …
UPDATE Account SET name=’Fred’, industry=’Carnival’, … WHERE Id=’001…..’
SELECT id, SystemModStamp FROM Account WHERE Id IN (…)

to be a bit slower then talking to PostgreSQL. Unfortunately, they were not a bit slower — they were more than 2 orders of magnitude slower.

In our first experiment we talked to RedShift via the tier 2 fiber in our offices (2 – 5 ms from the backbone). We used a single row prepared SQL INSERT statement to insert 200 rows in a single transaction and, for a baseline, we did the inserts to a local PostgreSQL database and a MySQL database hosted on Amazon/RDS — the baseline said we should see performance in the range of 50,000 to 80,000 records per minute. What did we see?

We saw less than 60 records per minute committed to RedShift. We checked the query log in RedShift and confirmed what we were seeing. Yuch!

We thought that perhaps RedShift would do better if we used multi-row prepared SQL INSERT statements instead. Not all database support this feature but PostgresSQL is one that does. We refactored some code so we could specify how many rows to include in each SQL INSERT and ran our tests again.

The change made little difference in performance to our local PostgresSQL database or our Amazon/RED MySQL instance.
The change made no difference to RedShift performance. We tried a variety rows per select and it did not help. We were bummed.

Perhaps we should have given up at this point, but we are gluttons for punishment. Perhaps the problem was that we were running outside of an Amazon data center. Here is what we did:

We launched a EC2 instance running Ubuntu (using vncserver to get a desktop environment)
We repeated our tests.

We were disappointed. The results were nearly the same! At this point we came to the conclusion that the Amazon/Redshift JDBC driver may be good for many things, but it is not the way to load data into RedShift.

To be fair, RedShift documentation recommends using the PostgreSQL COPY extension to load CSV data files. Trust them…they really mean it.

The RedShift/JDBC driver is a fairly faithful representation of PostgreSQL FOR THE TYPE OF DATA that a person would generally want in a data warehouse. However, forget about simply porting a transaction type application using JDBC to load and maintain data. The basic SQL INSERT/UPDATE operations are just too slow.

If we build a product to load Salesforce into RedShift, we would do what we have recommended other people do.

Use a transactional database to backup your Salesforce.
Write custom code to copy selected data from your transactional database to RedShift
Use CSV files like the RedShift team recommends.
Have fun using RedShift.

RedShift is Not as close to PostgreSQL as One Would Hope

The Real Problem — Transaction Performance

Conclusions