Google Globally-Distributed Database

PDF Publication Title:

Google Globally-Distributed Database ( google-globally-distributed-database )

Previous Page View | Next Page View | Return to Search List

Text from PDF Page: 008

It then describes some refinements of the basic schemes as described. 4.2.1 Read-Write Transactions Like Bigtable, writes that occur in a transaction are buffered at the client until commit. As a result, reads in a transaction do not see the effects of the transaction’s writes. This design works well in Spanner because a read returns the timestamps of any data read, and uncommit- ted writes have not yet been assigned timestamps. Reads within read-write transactions use wound- wait [33] to avoid deadlocks. The client issues reads to the leader replica of the appropriate group, which acquires read locks and then reads the most recent data. While a client transaction remains open, it sends keepalive messages to prevent participant leaders from timing out its transaction. When a client has completed all reads and buffered all writes, it begins two-phase commit. The client chooses a coordinator group and sends a commit message to each participant’s leader with the identity of the coordinator and any buffered writes. Having the client drive two-phase commit avoids send- ing data twice across wide-area links. A non-coordinator-participant leader first acquires write locks. It then chooses a prepare timestamp that must be larger than any timestamps it has assigned to pre- vious transactions (to preserve monotonicity), and logs a prepare record through Paxos. Each participant then no- tifies the coordinator of its prepare timestamp. The coordinator leader also first acquires write locks, but skips the prepare phase. It chooses a timestamp for the entire transaction after hearing from all other partici- pant leaders. The commit timestamp s must be greater or equal to all prepare timestamps (to satisfy the constraints discussed in Section 4.1.3), greater than TT.now().latest at the time the coordinator received its commit message, and greater than any timestamps the leader has assigned to previous transactions (again, to preserve monotonic- ity). The coordinator leader then logs a commit record through Paxos (or an abort if it timed out while waiting on the other participants). Before allowing any coordinator replica to apply the commit record, the coordinator leader waits until TT.after(s), so as to obey the commit-wait rule described in Section 4.1.2. Because the coordinator leader chose s based on TT.now().latest, and now waits until that time- stamp is guaranteed to be in the past, the expected wait is at least 2 ∗ ε. This wait is typically overlapped with Paxos communication. After commit wait, the coordi- nator sends the commit timestamp to the client and all other participant leaders. Each participant leader logs the transaction’s outcome through Paxos. All participants apply at the same timestamp and then release locks. 4.2.2 Read-Only Transactions Assigning a timestamp requires a negotiation phase be- tween all of the Paxos groups that are involved in the reads. As a result, Spanner requires a scope expression for every read-only transaction, which is an expression that summarizes the keys that will be read by the entire transaction. Spanner automatically infers the scope for standalone queries. If the scope’s values are served by a single Paxos group, then the client issues the read-only transaction to that group’s leader. (The current Spanner implementa- tion only chooses a timestamp for a read-only transac- tion at a Paxos leader.) That leader assigns sread and ex- ecutes the read. For a single-site read, Spanner gener- ally does better than TT.now().latest. Define LastTS() to be the timestamp of the last committed write at a Paxos group. If there are no prepared transactions, the assign- ment sread = LastTS() trivially satisfies external consis- tency: the transaction will see the result of the last write, and therefore be ordered after it. If the scope’s values are served by multiple Paxos groups, there are several options. The most complicated option is to do a round of communication with all of the groups’s leaders to negotiate sread based on LastTS(). Spanner currently implements a simpler choice. The client avoids a negotiation round, and just has its reads execute at sread = TT.now().latest (which may wait for safe time to advance). All reads in the transaction can be sent to replicas that are sufficiently up-to-date. 4.2.3 Schema-Change Transactions TrueTime enables Spanner to support atomic schema changes. It would be infeasible to use a standard transac- tion, because the number of participants (the number of groups in a database) could be in the millions. Bigtable supports atomic schema changes in one datacenter, but its schema changes block all operations. A Spanner schema-change transaction is a generally non-blocking variant of a standard transaction. First, it is explicitly assigned a timestamp in the future, which is registered in the prepare phase. As a result, schema changes across thousands of servers can complete with minimal disruption to other concurrent activity. Sec- ond, reads and writes, which implicitly depend on the schema, synchronize with any registered schema-change timestamp at time t: they may proceed if their times- tamps precede t, but they must block behind the schema- change transaction if their timestamps are after t. With- out TrueTime, defining the schema change to happen at t would be meaningless. Published in the Proceedings of OSDI 2012 8

PDF Image | Google Globally-Distributed Database

PDF Search Title:

Google Globally-Distributed Database

Original File Name Searched:

spanner-osdi2012.pdf

DIY PDF Search: Google It | Yahoo | Bing

Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info

Cruising Review Topics and Articles More Info

Software based on Filemaker for the travel industry More Info

The Burgenstock Resort: Reviews on CruisingReview website... More Info

Resort Reviews: World Class resorts... More Info

The Riffelalp Resort: Reviews on CruisingReview website... More Info

CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com (Standard Web Page)