John Ryan Experienced data warehouse architect 、 Developers and database administrators . He specializes in too many bytes Oracle On the system Kimball Dimension design , In many different industries, such as mobile phones and investment banking, it has accumulated more than 30 Year of IT Experience .
This article was first published as part of a series of articles on databases and big data .
01 The world has changed
In the past 20 year , The world has changed dramatically . stay 2000 In the year , There are only a few millions of people on the Internet , Or with a desktop computer 56k The cat came to the Internet , At that time, Amazon only sold books . today , Billions of people use smart phones or tablets every week 7 God 、 Every day 24 On the hour network , Almost everything is bought online , Also use Facebook、Twitter and Instagram These social apps interact with people . be a trend which cannot be halted .
People's psychological expectations have also changed . If the page doesn't refresh in a few seconds , We lost patience immediately , Change to another website . If a website is not accessible , We fear that that is the end of civilization as we know it . If a large website cannot be accessed , It's going to be big global news .
Instant gratification is not enough ！
(Instant gratification takes too long!)
— Ladawn Clare-Panton
notes ： If you're not an experienced database architect , You may need to read my previous articles on scalability and database architecture .
02 What has changed ？
The following conclusions can be drawn from the above ：
- Extensibility — Potentially explosive traffic growth ,IT The system needs to scale up quickly , To deal with exponential growth
- High availability — IT The system has to be weekly 7 God 、 Every day 24 Hour run , And it must be fault tolerant .（ Bank of America 2011 A breakdown occurred once a year , Yes 2900 Million customers for six days ）.
- High performance — With the increasing scalability , Performance has to keep up with , Keep it steady and fast . According to Amazon estimates , In extreme cases , Every additional second of page load time , The company loses every year 16 Billion dollars .
- Speed — More and more networking sensors come with the device （ Far do not say , Smart phones come with built-in networked sensors ）, There may be millions of transactions to be processed per second .
- Real time analysis — Batch processing and business intelligence at night are out of date . The boundary between analysis and manipulation becomes blurred , There is a growing need for real-time decision-making .
The Internet of things (Internet of Things) Let's speed up sharply ！
— Stonebraker Doctor (MIT) .
The above needs have led to wonderful marketing terms Translytical database , It means a hybrid solution , That is, the same solution can handle massive transactions , Real time analysis can also be done .
03 What's the problem ？
Provide high performance while reducing costs （ You may also want to use cheap servers ）, It's a challenge for all database vendors . however , There are conflicting needs ：
- performance — Minimize latency , Complete transactions in milliseconds .
- Usability — Even if one or more nodes of the system fail or are disconnected from the network , Can also maintain the ability to run .
- Extensibility — Can continue to scale up , To meet the requirements of massive data and transaction speed .
- Uniformity — Provide consistency 、 Accurate results — Especially in case of network failure .
- Durability — Make sure that the modification will not be lost once implemented .
- flexibility — Provide a common database solution , To support the workload of transaction and analysis .
We should have the ability of massive and progressive expansion , The only realistic way is to deploy a scale out distributed system . Usually , To maximize availability , Changes made to one node are immediately copied to two or more other nodes . however , Once data is allocated to multiple services device , It faces a trade-off between advantages and disadvantages .
for example ：
Performance and availability and durability
many NoSQL The database copies the data to other nodes in the cluster , To improve usability . If The database node crashes immediately after the write operation , The data is backed up on other machines , So the changes are persistent . however , You can also relax this requirement , Return immediately without backup . This maximizes performance , But there's a risk of losing changes . Changes may not last at all .
▲ Geographically distributed systems
Consistency and availability
NoSQL Databases support ultimate consistency . for example , In the diagram above , If the network with New York The connection is temporarily broken , There are two options ：
- Stop processing — But New York's availability has been affected
- Accept read / Write operation — Eliminate differences after network connection is restored . But the risk of doing so is to provide expired or wrong results , You may need to solve the problem of writing
obviously ,NoSQL Databases trade consistency for availability .
Flexibility and scalability
And Oracle and DB2 Compared with general relational database ,NoSQL The database is relatively flexible Bad ,（ for example ） I won't support it Join（ Connect ） operation . Except for a lot of people who don't support SQL Language database , Some databases （ for example Neo4J and MongoDB） It's designed to support specific problems — Graph processing and JSON data structure .
Even if like HBase、Cassandra and Redis Such a database , Also abandon the relational join operation , But many also restrict access to a single primary key , And it doesn't support secondary indexes .
Many databases claim that 100% Support ACID Business ,
Actually provide formal ACID There are few guarantors .
— Peter Bailis Doctor （ Stanford university ）
04 ACID Consistent with the final
Extended aspects of database solutions , One of the main challenges is to maintain ACID Uniformity . Amazon uses DynamoDB database , Relax the consistency constraint , In exchange for speed , This solves the performance problem , This has led to a large number of NoSQL database .
in addition , The most successful database （ Include Oracle） It doesn't provide real ACID Isolation, . This paper studies 18 A database , The default support SerializabilITy（ Serializability ） There are only three databases of （VoltDB、Ingres and Berkeley DB）. The main reason is that it is difficult to support serializability while maintaining performance .
In the end, consistency is a particularly weak pattern .
The system can return any data , We can still be consistent in the end .
— Peter Bailis Doctor （ Stanford ）
On the other hand , Final consistency provides little guarantee of consistency . The following figure illustrates the problem of final consistency . A user deducts money from a bank account 100 Thousands of dollars , But before the account changes are copied , Another user checks the balance of this account . The only guarantee is , As long as there is no further write operation , The system will eventually provide consistent results . What's the use of this ？ To be accepted, let alone .
▲ Cassandra — Final consistency
05 Rethink OLTP database
Ten years ago ,Michael Stonebraker The doctor wrote 《 The end of the architecture era 》(The End of an ArchITectural Era) This article , Think Oracle、 Microsoft and IBM Proposed 1970 The database architecture of the S is out of date .
He put forward OLTP The database should have the following characteristics ：
- Dedicated to solving a problem — Quick execution of short predefined （ Not improvised ） Business , The query plan is relatively simple . In short , It's special OLTP platform .
- accord with ACID standard — All transactions are single threaded , All serializability is provided by default . Always available — Using data replication （ Not hot standby ） To provide high availability , Almost no increase in cost .
- Geographically dispersed — Run seamlessly on a grid of scattered machines （ Further improve resilience , And locally improve performance ）
- No shared architecture — Multiple machines are connected through a peer-to-peer grid , Share the load . Adding machines is a seamless operation that does not cause downtime , And the loss of one node only causes a slight performance degradation , Instead of shutting down the whole system .
- Memory based — All in memory , To increase absolute speed , The durability is guaranteed by in memory data replication to other nodes .
- Eliminate bottlenecks — Completely redesign the database internals , Implementation of single thread running , At the same time, eliminate redo (Redo) Logging and the need for locking and locking — These are the most significant constraints on database performance .
To prove the possibility of the above , He built a prototype , namely H-Store database , And prove using the same hardware , TPC-C Benchmark performance is that of a business competitor 82 times .H-Store The prototype is excellent , It realizes processing every second 70,000 One transaction , And despite a lot of effort by database administrators to tune , A business competitor only 850 individual .
06 Nothing is difficult in the world ！
Stonebraker The doctor's achievements are impressive . Previous TCP-C The world record for every CPU The core is about 1,000 One transaction , but H-Store Dual core 2.8GHz Desktop computer , The speed is the original world record 35 times . He was in 2008 Articles from 《 Probe into OLTP 》(OLTP through the Looking Glass) Explains why business databases （ Include Oracle） Why is the performance so poor .
▲ Processing resource consumption of relational database
Shown above , Yes 93% System overhead is used for traditional （ Historical legacy ） Database system of , Including locking 、 Latch and cache management . The total is only 7% The machine resource is dedicated to the task at hand .
H-Store Just by eliminating these bottlenecks , Use memory based processing instead of disk based processing , To achieve the seemingly impossible task , That is, comprehensive ACID Transaction consistency , It has increased the speed by several orders of magnitude .
07 NewSQL Database technology
VoltDB First published in 2010 year , yes H-Store Commercial products of prototypes , Belong to the exclusive use of OLTP platform , be used for Web Transaction processing and real-time analysis . As this information graph shows , There are 250 A commercial database solution , Only one 13 Species are classified as NewSQL The ranks of Technology .
And others NewSQL The database is the same ,VoltDB Designed to run completely in memory , Provides the option to take periodic disk snapshots . It can run locally on 64 position Linux, You can also use AWS、 Google and Azure Cloud services to run , Adopt a horizontally scalable architecture .
Traditional relational databases write data to disk based log files .VoltDB Otherwise , It is to modify multiple machines in memory at the same time . for example , Even if two machines fail ,
K-Safety The coefficient is 2 It can guarantee no data loss , Because the data is stored in at least three memory nodes .
Business as Java stored procedure (stored procedure) Submit , It can be executed asynchronously in the database , And the data is automatically partitioned （ Fragmentation ）, Assigned to nodes in the system , Although benchmark data can be replicated to maximize connection performance .VoltDB It's a little unusual , That is to say JSON The form of the data structure , Support semi-structured data .
In terms of performance ,2015 A benchmark test conducted in 1998 showed that ,VoltDB The processing speed is at least NoSQL database Cassandra Twice as many , But the cost is only AWS Six times the cost of cloud processing One .
Last ,VoltDB 6 .4 Version passed the extremely harsh Jepsen Distributed security testing .
by comparison , I was right before NoSQL database Riak The tests carried out show that , Even with the strongest one Sex setting , Writing will also drop 30-70%. meanwhile , When using lightweight transactions ,Cas- sandra At the most 5% Writing .
And VoltDB The same thing ,MemSQL It is a horizontally extended memory distributed database , Designed for fast data acquisition and real-time analysis . in addition , It can run locally , It can also run on the cloud , And it can automatically partition between different nodes , At every CPU Parallel execution of queries on the core .
▲ Processing resource consumption of relational database
Despite the VoltDB There are many similarities , But the figure above shows an important difference .MemSQL Try to find a balance between the conflicting requirements of real-time transaction and data warehouse historical data processing . So ,MemSQL Store in rows (row store) To store data in memory , And use column oriented disk storage as backup , So it's going to be real-time （ lately ） Data is combined with historical results .
This makes it in OLTP And data warehouse (Data Warehouse) The field has gained a solid position , Although both solutions are aimed at the real-time data acquisition and analysis market .
10 Which applications need NewSQL technology ？
The acquisition speed and response speed are required to be very fast （ Average 1-2 millisecond ）, Simultaneous requirements ACID Any application that guarantees the accuracy of the transaction provided — For example, customer billing .
Typical applications include ：
- Real time authorization — for example , Verify for analysis and billing 、 Recording and authorizing mobile phone calls . Usually ,99 .999% All database operations must be in 50 Complete in milliseconds .
- Real time fraud detection — Used to perform complex analysis queries , Before the transaction is authorized , Accurately determine the possibility of fraud .
- Game Analysis — It is used according to the player's ability and the player's typical behavior , Real time dynamic modification of game difficulty . The goal is to keep existing players , And turning free customers into paid players . At speed 、 In the case of high availability and accuracy requirements , By using these means, a customer , Increased player spending on games 40%.
- Individualization Web advertisement — Real time dynamic selection based on Web Personalized advertising , Record ad presentation events for billing purposes , At the same time, the advertising results are recorded for subsequent analysis .
With the vast majority OLTP Application comparison , None of this looks impressive at first , But every week 7 God 、 Every day 24 The world of the hour Internet , These provide new frontiers for real-time analysis , And with the rise of the Internet of things , It also brings great opportunities .
although Hadoop More closely related to big data , And it's got a lot of attention lately , But database technology is anything IT The cornerstone of the system .
Similarly ,NoSQL Database provides a fast alternative to relational databases 、 Scalable options Choose , But despite the temptation to license free open source databases , In fact, it's still a dime a coin . in addition , just as VoltDB As shown , In fact, in the long run , Maybe it's better than NoSQL Class selection is cheaper .
On the whole , If there is Web scale 、OLTP and （ or ） Requirements for real-time analysis , You need to think about it NewSQL Class database .
If you are right about VoltDB Industrial Internet of things big data low latency solution 、 Real time data platform management in the whole life cycle , Welcome private message , Enter our official communication group .