Fastest Database Retrieval methods?

Question:

Sachin S

2007-07-29 23:52:31 UTC

What all techniques are there for fast database retrieval form huge database (in millions or billions)?
What technology is used to retrieve data of ATM card users from Bank in fraction of second?
I want to implement a system which can search and retrieve the matched info form a database which will have more then 50 million records as fast as possible. What technology i can use for it?

Four answers:

miamidot

2007-07-30 01:26:59 UTC

A frequent technique in engineering and manufacturing is to implement a multi-tier data framework.

One way is implementing a combination of

- a fast transactional front end

- a large data storage and retrieval back end.

Bearing in mind that, the smaller database the faster the transactions, you should design your schema to allow retention of most possibly used data. Which means your keys and indexes are mechanisms to differentiate what data is no longer used or has very low possibility of being used. Therefore, the schema designer has to be extremely familiar with the dimensional aspects of the data sources and destinations. The dimensions of highest significance that I have found are time, regional relevance, activity relevance. Therefore, it may be possible to implement a distributed data system based on time, regional and activity relevance.

I have found that distributed schema, even within the same locality is very effective in reducing database sizes and hence speeding up transactions. For example, if someone is making a deposit why would you even want his/her mortgage schema available? Why bother to even put the data on the same server? Obviously, distributing your schema and data enables you to handle tremendous amounts of transactions. That is presuming your client has sufficient financial resources to segregate the servers and you have the familiarity to disect your sources and destination of information.

Another consideration is to lower your front-end transactional database normalisation to "two and a half" rather than 3rd or BCNF. Reduction in normalisation often reduces table joins obviously but it also entails writing the same incoming data to multiple records.

Since it is not optimally normalised and being a front-end transaction db, it should not be used as the final authority.

However, the strategy I recommend is to combine your no frills schema front-end transactional server with a technology like Sybase Adaptive Server. I believe it is a "self-normalising server". I realise there are other brands of similar to Sybase Adaptive server, but I am not familiar with the others. Sybase Adaptive server has a very fast read response but a slow update response. It is best built before hand and not used for updates at all.

Therefore, you should have a minimal front-end transaction db whose schema is optimised for updates only and as much as possible not used for queries. There are times you need to query for authoritative information before updating your front-end db. That is where the Adaptive Server should come in.

You pull authoritative data from the Adaptive Server and cache it at the local temp db of your front-end transactional db only for a particular transaction. Your cache schema is therefore highly oriented towards having to perform as little joins as possible and therefore highly denormalised. If you have heard of agile development, I would like to coin the term "agile transaction" - do only what is necessary for the present transaction.

You need two aging mechanisms for the data you cache. First, is to determine that the possibility of a cached record being useful or could be dropped and flushed out. The other is to determine the cached data is out-of-phase or out-of-date. By being familiar with the data sources, you would be able to design a minimal signature field that you could use as a key for fast comparisons. Frequently, you need more than the time of transaction to compare to find out if a record is aged. A similar signature field might need to exist in the authoritative database, unless the signature can be designed to solely depend on incoming transaction info like time, session id, account number, etc. Frequently, I found that I had to combine the authoritative signature with the front-end signature and with incoming data to perform a check. The comparison could be a check-sum, a series of subtractions and additions or whatever operation you devise that is quick and accurate.

The problem with using a pre-built or compiled database is the issue of the database being out-of-phase. What if you had a transaction that needed to query an activity that occurred ten minutes ago and the Adaptive Server has a scheduled refresh time of 30 minutes? I guess you would have to put all these into consideration when designing the whole sets of schema from front to back.

I have heard that in older banking systems, they use network (codasyl) database for the frontend where the schema is tightly coupled to the transaction model or pattern. Whether that is true, I cannot verify. But I can verify that it is used in engineering and manufacturing. A codasyl schema can be designed to link its data nodes optimised to a limited set of queries/updates. Therefore adhoc queries/updates would perform very badly in a codasyl schema that has been designed specifically for a limited set of events.

Michael M

2007-07-30 00:10:55 UTC

A binary search using an index can easily return a record match in a fraction of a second with millions and even billions of records.

There are many different types of database indexes that are used for many different purposes. There are also methods slightly faster than binary searches. After you have an index in place, the beauty of a binary search is that the time taken to retrieve a record increases in proportional to the log base 2 of the number of records in the table. For example, if you have a million records, a binary index will find a match in about 20 loops. However, if you have a billion records, it will only take about 30 loops. Therefore, even though you increase the number of records by a factor of 1000, you only increase the time taken to retrieve the record by a factor of 1.5.

Oracle and SQL Server, among others, have spent many years perfecting fast retrieval algorithms for many millions and even billions of records.

In my opinion, the fastest way to perform a search is a method similar to a binary search. In a binary search, you define a range when you know the element will be and progressively divide the range in halves. I believe a better approach is to estimate where the element will be by calculating the ratio of the element sought within the defined range. Therefore, if you know you are looking for a word starting with “Y” and you know that your range is A through Z, then start 25/26 of the way through the index to determine the split point. If the elements in the range are nearly randomized, you should find the matching record using much fewer loops than a binary search.

2007-07-30 00:46:57 UTC

There are methods faster than linear or binary searching like B+ trees,hashing which can retrieve data faster.MySQL has an implementation of B+ trees as far as my knowledge goes. A lot of file systems also use B+ trees.

2016-04-01 13:19:51 UTC

Perhaps 'photos' is the program for you.

ⓘ

This content was originally posted on Y! Answers, a Q&A website that shut down in 2021.

about - legalese