Inside Google Spanner, the Largest Single Database on Earth

flyinglizard · on Nov 26, 2012

What kind of accuracy exists between servers inside the same data center? I assume there are some internal delays (OS stacks, switches, etc) when synchronizing time inside a server group.

I mean, even if you had a picosecond accurate clock available for use inside a server farm, you would still need a way to query it with a known (not necessarily zero; just known) latency to synchronize several machines. Servers are not known latency machines (unless specialized hardware is involved).

How is that accomplished?

And what happens when two transactions happen below the system accuracy limit? (like two transactions pertaining to the same data, 20ns apart, in different servers; impossible to order).

Surely they have solved this, I just wonder how.

dhotson · on Nov 26, 2012

Looking at the paper it looks like instead of dealing with exact moments in time (eg timestamps) they have ranges (they call them intervals).

From the Spanner paper: "TrueTime explicitly represents time as a TTinterval, which is an interval with bounded time uncertainty (unlike standard time interfaces that give clients no notion of uncertainty)"

By representing time as intervals you can tell if one interval is definitely before or after another interval. If the intervals overlap however, then it means there's some uncertainty. I haven't quite got through the paper far enough to undersand how this kind of thing is handled. :-)

But I'm guessing that this kind of design means they're always consciously designing with error in mind. There's probably some acceptable amount of error in timing that they're able to carefully manage.

Edit: link to the paper for the lazy [pdf] http://static.googleusercontent.com/external_content/untrust...

Edit 2: It looks like the error typically ranges from 0 to 7ms averaging around 4ms. Outages can cause spikes in this error margin.

snewman · on Nov 26, 2012

They handle uncertainty by delaying the transaction commit. Suppose you commit a transaction and at that moment, the error bar for time (on the server that handles your transaction) is 4 ms. The server will wait 4 ms to tell you the transaction completed. The result is that if two transactions commit near-simultaneously, it's random which will be assigned a lower timestamp, but guaranteed that no one will observe either transaction until after the timestamp at which that transaction is deemed to have committed. Basically, they handle the window of uncertainty by pausing access to the affected rows so that no one can peek into the window.

toomuchtodo · on Nov 26, 2012

Temporal row-level locking.

alexkus · on Nov 26, 2012

> I mean, even if you had a picosecond accurate clock available for use inside a server farm, you would still need a way to query it with a known (not necessarily zero; just known) latency to synchronize several machines. Servers are not known latency machines (unless specialized hardware is involved).

I think it's just a case of traditional NTP (which is itself based on atomic clocks and/or GPS) done over 'the Internet' is subject to too much latency that it becomes unreliable for their needs (e.g. Spanner).

By putting the equivalent of their own NTP master servers (based on GPS and atomic clocks) in each of their major data centres they solve that part of the problem. The possible LAN latency is much more controllable and reliable, to within tolerances that makes Spanner workable.

The clever bit isn't putting their own NTP master servers in each data centre, it's how Spanner works when using this info. The article puts far too much emphasis on the former whilst glossing over the latter.

ghshephard · on Nov 27, 2012

Their alternative to NTP (and, it should be noted, NTP is never once mentioned in their paper), TrueTime API is actually pretty impressive. They are able to maintain <10ms latency 99.9% of the time across data centers all over the world. That's not something you can easily do with NTP over the internet. Placing Stratum-0 receivers (and, in fact, Stratum 0 devices) in their data centers isn't unheard of, but that alone won't get you the kind of global-temporal-synchronization that Google is able to achieve with TrueTime API.

I'm looking very forward to the paper on that protocol/system which will hopefully be forthcoming - They tease it with this: "This section describes the TrueTime API and sketches its implementation. We leave most of the details for another paper: our goal is to demonstrate the power of having such an API."

BCM43 · on Nov 26, 2012

I'd be worried about much more than 20ns. Someone email the ntp list not to long ago asking about getting nano-second resolution time, and one of the responses pointed out that light travels about a foot in a second. So do you want the time on this side of the room, or that side?

jgrahamc · on Nov 26, 2012

Printable nanosecond for reference: http://blog.jgc.org/2012/10/a-downloadable-nanosecond.html

throwaway54-762 · on Nov 26, 2012

Presumably you meant, "travels about a foot in a nanosecond." From Google, "the speed of light / (10^9) = 0.299792458 m / s" -> about one foot.

pdw · on Nov 26, 2012

As explained by Grace Hopper: http://highscalability.com/blog/2012/3/1/grace-hopper-to-pro...

debacle · on Nov 26, 2012

Light travels about 300000 km in a second. I think you meant nano-second.

alextp · on Nov 26, 2012

A foot a second is off by several orders of magnitude. It's closer to 300 thousand km per second in vacuum, and around 200 thousand km per second in fiber.

lwat · on Nov 26, 2012

He obviously means a foot in a nanosecond.

BCM43 · on Nov 29, 2012

Whoops, meant nanosecond. Thanks.

rescrv · on Nov 26, 2012

TrueTime measures a point in time plus/minus an error interval. The API allows a program to query for when a time/error pair has definitely passed. Each transaction acquires locks using 2-phase locking. At commit time, the time/error are read and the decision to commit is stored durably using Paxos. The locks are then held until the time of the transaction has definitely passed. This guarantees that no process which grabs the locks after they are released can commit at an earlier time than previous holders of the lock.

nlavezzo · on Nov 26, 2012

It's interesting to see that the creators of BigTable and the early proponents of eventual consistency have invested the last 4.5 years building a system that adds back strong consistency guarantees.

If the Spanner paper is as important as BigTable, ACID may become the new goal for those building distributed systems.

Full disclosure: I'm with FoundationDB, which is a distributed NoSQL database with high performance cross-node ACID transactions. http://www.foundationdb.com

duaneb · on Nov 26, 2012

> It's interesting to see that the creators of BigTable and the early proponents of eventual consistency have invested the last 4.5 years building a system that adds back strong consistency guarantees.

I think it's a large misconception that these are competing ideas. They're not. They just represent different use cases: search results don't need to be consistent within seconds of events. Customer information, however, does.

Dave_Rosenthal · on Nov 27, 2012

I agree that different projects prefer different tools, but these data stores most certainly do compete with each other for developer's use within Google.

Per the paper: "At least 300 applications within Google use Megastore (despite its relatively low performance) because its data model is simpler to manage than Bigtable’s, and because of its support for synchronous replication across datacenters."

So, the current trade off is ACID vs. performance, but I think the interesting point is that ACID is winning more and more now that a distributed ACID database is available to their developers.

Again, according to Google: "We believe it is better to have application programmers deal with performance problems due to overuse of transactions as bottlenecks arise, rather than always coding around the lack of transactions."

duaneb · on Nov 29, 2012

That doesn't sound like competition; it sounds like they should have been using ACID all along.

In any case, I don't think Google will stop using BigTable any time soon, especially with Spanner's massive latencies.

erichocean · on Nov 26, 2012

I'm with FoundationDB, which is a distributed NoSQL database with high performance cross-node ACID transactions.

Which isn't at all comparable with Spanner, since Spanner works between clusters/datacenters, unlike FoundationDB.

"Distributed" is getting severely overloaded when it comes to databases. We need to stop calling databases with sub-millisecond latency to other nodes "distributed". They're not distributed, they're clustered, and the techniques you can use with them are vastly different.

Consistent transactions with a clustered database are a solved problem, full stop, and available to every developer with virtually any database backend, simply by building your transactions on top of Apache Zookeeper, and running whatever durable database you want underneath.

Doing consistent distributed transactions at a global scale, which is what Spanner does, is so far novel. Doing ACID transactions in a 24-node cluster (what is described at FoundationDB's website, for instance) isn't impressive these days, and if that's all Google had done, you'd see a collective yawn around the web.

shin_lao · on Nov 26, 2012

From what I understand of this specific project, they "solved" the consistency problem (which is quite a feat, granted).

They don't say they have an ACID NoSQL database, which is, to me, an oxymoron: ACIDity is useful if you have a powerful query language.

In the end don't you fear that you might simply reinvent SQL? Or, am I missing something?

nlavezzo · on Nov 26, 2012

ACID and NoSQL are not mutually exclusive, although that is a common misconception due to the fact that almost all NoSQL databases gave up ACID in exchange for BASE (eventual consistency), making the development of their distributed databases much easier. However, ACID is simply a set of properties that allows you to group a set of reads & writes to a database into a transaction that is Atomic, Consistent, Isolated, and Durable. That set of reads & writes does not need to be SQL - in our case, it is a set of reads from and writes to a database of ordered keys and values.

For example, you could read the values associated with keys "a" and "b" and "c" and based on that information, make a change to the values associated with keys "c" and "d" and then commit the transaction. The transaction then either fails completely (if one of the keys you read had been written to since your read, in which case you re-try the transaction) or succeeds completely. That is an ACID transaction in a (NoSQL) key-value store.

shin_lao · on Nov 26, 2012

That sounds useful, but not as useful as transactions on joins.

It's great to have transactions when you have to cross reference data, on a key/value store, although you can have range queries I submit that kind of operations is less frequent.

But still, it sure is great to be able to run a batch of operations with the confidence it will be transactional, I'm sure there are many use cases that can benefit from it.

Oddly enough, it's strange that there aren't more NoSQL engines offering this feature as once you have MVCC you've done the hard part and AFAIK several NoSQL db have MVCC.

nlavezzo · on Nov 26, 2012

Yep, those types of operations are less frequent in k/v stores, but we believe it is because of the lack of ACID in most of them. When you have ACID and ordering you can use transactions to keep data and multiple indexes on that data in synch, and use range operations to query the indexes.

You can build all manner of higher level data structures with a transactional ordered k/v store, which is why our concept of "layers" is possible: www.foundationdb.com/#layers

damian2000 · on Nov 27, 2012

It seems to me more like they have "solved" (or broken) the CAP theorem ... http://en.wikipedia.org/wiki/CAP_theorem

* Consistency (all nodes see the same data at the same time)

* Availability (a guarantee that every request receives a response about whether it was successful or failed)

* Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)

According to the theorem, a distributed system can satisfy any two of these guarantees at the same time, but not all three.

shin_lao · on Nov 27, 2012

The CAP theorem states that at one point in time you can only have two in three.

It doesn't say much more than the obvious. Obviously, if one node parts, you're either inconsistent or available.

Dave_Rosenthal · on Nov 27, 2012

Just to clarify what the CAP theorem says: If one node parts, then that node is either inconsistent or not available. A fault-tolerant database could potentially stay up with a single unavailable node.

snewman · on Nov 27, 2012

No, they haven't broken the CAP theorem. If the network were to partition badly enough, Spanner would lose its quorum and would become unavailable, at least for guaranteed-consistent operations.

In practice, a network partition bad enough to bring down something like Spanner is pretty rare. (I worked at Google for four years, and I can't recall such a partition occurring.) But in theory it's certainly possible.

Evbn · on Nov 26, 2012

Which part of acid is relevant specifically to SQL (joins, I assume you mean)?

ghshephard · on Nov 27, 2012

I had to chuckle when I read this:

"As Fikes points out, Google had to install GPS antennas on the roofs of its data centers and connect them to the hardware below."

This is usually one of the first things an enterprising sysadmin does at companies when they first start thinking about time - drop a GPS receiver on the roof (and they usually come up with a bunch of cool graphs showing where all the satellites are over time).

Soon thereafter, and a bit of reading about the NTP protocol, they realize that just adding:

  server 0.pool.ntp.org
  server 1.pool.ntp.org
  server 2.pool.ntp.org
  server 3.pool.ntp.org

to their ntp.conf is sufficient for 99.99% of all endeavors which require accurate time, outside of big physics, and, apparently Google's Spanner Database.

This part was a bit incomplete:

"Typically, data-center operators keep their servers in sync using what’s called the Network Time Protocol, or NTP. This is essentially an online service that connects machines to the official atomic clocks that keep time for organizations across the world. But because it takes time to move information across a network, this method is never completely accurate,"

Much of the purpose (and math) behind the NTP protocol is to deal with network lag. And it does a pretty good job doing so.

Reading about the True Time Api at: http://static.googleusercontent.com/external_content/untrust...

"This implementation keeps uncertainty small (generally less than 10ms) by using multiple modern clock references (GPS and atomic clocks)"

So - apparently 10ms is their breakpoint - 10ms is about the limit of what you can expect out of NTP, so I guess it makes sense that if Google needs to do 10ms or better, something of their own invention would be required. Cool graph on the paper showing that 99.9% of variance across data centers thousands of kilometers apart are < 10ms deviation.

emmelaich · on Nov 27, 2012

Using NTP is ok until your network connection is broken for hours. Think backhoe into the fibre.

ghshephard · on Nov 27, 2012

Data Centers don't lose network connectivity - designed to have ingress and egress far enough apart that a local disaster can't take down the entire ring. Also, as I noted, one of the first things that enterprising sysadmins do is toss a Stratum-0 receiver (GPS receiver) on the roof, which feeds into the NTP protocol.

Finally - the entire purpose behind TrueTimeAPI is to sync up spanner's commits so they are consistent across the replicated/sharded nodes. Without Network, that database capability would come to a halt way before network time became a problem.

The bigger issue is the <10ms requirements. NTP over the Internet does not get you that consistently, and certainly not at the 99.9% success that Google was able to achieve with TrueTime API.

Bakkot · on Nov 26, 2012

View all pages: http://www.wired.com/wiredenterprise/2012/11/google-spanner-...

Also: > “We can commit data at two different locations — say the West Coast [of the United States] and Europe — and still have some agreed upon ordering between them,” Fikes says, “So, if the West Coast write happens first and then the one in Europe happens, the whole system knows that — and there’s no possibility of then being viewed in a different order.”

That's a large enough scale that you have to deal with relativity (light takes almost precisely 0.03 seconds to go from Palo Alto to Paris, eg). So in some sense there is no correct ordering. Anyone know how they deal with this? Have they just chosen some arbitrary point to make their reference frame, for purposes of ordering commits?

andrewcooke · on Nov 26, 2012

you're confusing finite speed of light and relativity - they're not really the same (although they are of course related). the relativistic sense of "no correct ordering" is irrelevant to the problem that google are solving (two observers, watching google's machines, from different, high-speed spaceships, may still disagree about the order of what happens in paris and the usa, or even in two machines in the same datacentre, but google doesn't care).

they've chosen gps time as a universal clock.

whitewhim · on Nov 26, 2012

The entire underpinnings of GPS relies on the speed of light to calculate positions as the GPS satellites send their position, velocity and a current time stamp. These can be used to generate a distance from the satellite based on the velocity of the light. Through multiple distance readings of multiple satellites a position can be determined. Relativity does in fact play a role since the satellites are not geosynchronous they are moving in comparison to Earths reference frame. This creates non negligible errors in GPS which have been accounted for in the GPS equations. If they hadn't been we would have much more error in GPS

andrewcooke · on Nov 26, 2012

there are at least two relativistic effects involved in gps timings - http://www.aapt.org/doorway/tgru/articles/ashbyarticle.pdf - but they would not be solved if the satellites were in geosynchronous orbit as the earth itself does not have a single non-accelerating reference frame.

NelsonMinar · on Nov 26, 2012

If you want more detail, this article links a research paper that describes the system in detail. Very clever focussing on the timebase as a way to improve distributed consistency; I'd always assumed NTP was sufficient. http://static.googleusercontent.com/external_content/untrust...

The part of the article that stood out to me is that Spanner is used in F1, the new backend datastore for AdWords. That's a significant vote of confidence.

Charlesmigli · on Nov 26, 2012

Interesting article. Mainly on the timing aspect though. tl;dr version here http://tldr.io/tldrs/50b375dd52b89ec3440000df

eze · on Nov 26, 2012

Past and current Googlers that frequent HN are notoriously absent from this thread. Come on, guys! Surely your NDA must allow some vague commentary...

euyyn · on Nov 27, 2012

Well, I don't know of the details of Spanner, but even if I did I wouldn't dare trying the limits of that NDA. Job's too nice to lose over that :)

philip1209 · on Nov 26, 2012

"[. . .] the company’s online ad system — the system that makes its millions [. . . ]"

That is an understatement.

skeletonjelly · on Nov 26, 2012

Is there a breakdown of their income sources? I would have guessed this is the prime income earner by a large margin.

philip1209 · on Nov 27, 2012

From their 2010 10-K:

"Advertising revenues made up 97% of our revenues in 2008 and 2009, and 96% of our revenues in 2010. We derive most of our additional revenues from offering display advertising management services to advertisers, ad agencies, and publishers, as well as licensing our enterprise products, search solutions, and web search technology."

2010 Revenue: 29.321 Billion

Therefore, advertising brought in >$28 Billion in 2010 (Therefore they average $1 Million in advertising revenue ever ~19 minutes, hence the laughable understatement of the article)

Source: http://investor.google.com/documents/20101231_google_10K.htm...

awavering · on Nov 27, 2012

It's an expression, not an understatement.

sneak · on Nov 26, 2012

Does this mean that Google datacenters are vulnerable to GPS jamming and/or spoofing now?

alanctgardner2 · on Nov 26, 2012

If you read the Spanner paper, there is a hierarchy of timers involved. GPS is one level, but every datacenter also has machines equipped with atomic clocks. I suspect even with GPS failure they could run entirely on atomic clocks, it would just increase their uncertainty, which would increase the commit times (Spanner is based on the idea that a transaction is committed when the time at which it was committed is guaranteed to have passed). As far as I know, it could actually run really slowly without special timing equipment.

cynwoody · on Nov 26, 2012

>I suspect even with GPS failure they could run entirely on atomic clocks, it would just increase their uncertainty

Using GPS is an operational convenience, not a necessity, I would think. If GPS didn't exist, they would have to have a master atomic clock, say, in Mountain View. Then they would have to bring each remote clock to Mountain View, synch it with the master, and then ship it (running continuously) to its final destination.

I remember perusing an HP catalog in the late eighties. You could buy a portable "traveling clock" for around $40,000. It was intended to be used to synchronize remote clocks. Here is a 1965 article from the Hewlett-Packard Journal, which describes the process:

http://www.hpl.hp.com/hpjournal/pdfs/IssuePDFs/1965-04.pdf

rwmj · on Nov 26, 2012

Please someone tell me where you can buy an atomic clock :-?

teraflop · on Nov 26, 2012

Here you go: http://www.thinksrs.com/products/PRS10.htm

DanBC · on Nov 26, 2012

Agilent used to do them, but someone else does them now:

(http://www.home.agilent.com/en/pd-1000001383%3Aepsg%3Apro/pr...)

(http://www.symmetricom.com/)

timdoug · on Nov 26, 2012

eBay. :)

http://www.ebay.com/sch/i.html?_nkw=FE-5680A

jrockway · on Nov 26, 2012

The article shortened "GPS-disciplined clock" to "GPS clock", which makes it inaccurate. The GPS clock is simply used to keep another clock accurate. (I use this setup at home; a timing GPS produces an accurate pulse-per-second, but the system clock is responsible for maintaining the time between pulses.)

(And if you're wondering, 15ns-accurate GPS clocks are $30 on eBay.)

sneak · on Nov 26, 2012

Which is why I listed spoofing in addition to jamming.

RobAley · on Nov 26, 2012

No, because they have the atomic clocks as a backup/double check.

herge · on Nov 26, 2012

Only from James Bond villans.

I am sure that the U.S. armed forces have a very secure scheme to prevent this.

josephagoss · on Nov 26, 2012

They have atomic clocks as a backup

logn · on Nov 26, 2012

Also see: http://news.ycombinator.com/item?id=4526710

Too · on Nov 27, 2012

“We can commit data at two different locations — say the West Coast [of the United States] and Europe — and still have some agreed upon ordering between them,”

I'm a bit confused by this. How will this solve the situation when the first transaction renders a second transaction forbidden. To keep it simple, say an account with only $10 and two transactions trying to withdraw $10 each.

plasma · on Nov 27, 2012

It's been a while since I last read the paper, but I think the second transaction would fail (optimistic locking) and must be re-tried (and as such would see the updated balance).

sargun · on Nov 27, 2012

Argh, "And, yes, you do need two separate types of time keepers" - No, you must establish a quorum of time keepers. Almost everyone's advice when setting up high reliability time keeping systems is to use 1 clock, or >3. 2 is no better than 1.

abhijat · on Nov 27, 2012

> VC is Google shorthand for video conference

That is the case nearly everywhere, I think :-)

Sarien · on Nov 26, 2012

m( "Spanner" is the colloquial German word for voyeur. Not the best name for a database. :)

lucian1900 · on Nov 26, 2012

Almost everything means something unexpected in some language somewhere.

Inufu · on Nov 26, 2012

Spanner is also a kind of butterfly in German: http://de.wikipedia.org/wiki/Spanner_(Schmetterling) ;)

delinka · on Nov 26, 2012

And the British use "spanner" where the Americans use "wrench." lucian1900 has it right. And because of that, you're either going to spend ages finding (or creating) words to market by, or you're going to just name the thing, get over it, and get back to work.

btown · on Nov 26, 2012

Also, keep in mind that this service was never meant to be public-facing. So, unlike Bing which was named with a Chinese interpretation in mind, Google doesn't need to care whether its internal service has a bad connotation in German colloquialism.

hahainternet · on Nov 26, 2012

"Spanner" means 'foolish' or 'stupid' in the UK

jrockway · on Nov 26, 2012

Are there any words in the UK that don't mean "foolish" or "stupid"?

kami8845 · on Nov 26, 2012

Yes. "fag" means cigarette.

aptwebapps · on Nov 26, 2012

Idiomatic churn is rather high over there, isn't it? I guess at some point you run out of unique sound combinations.