Google Is 2B Lines of Code, All in One Place

  • Just because people are talking about it: I work at MSFT, and the numbers Wired quotes for the lines of code in Windows are not even close to being correct. Not even in the same order of magnitude.

    Their source claims that Windows XP has ~45 million lines of code. But that was 14 years ago. The last time Windows was even in the same order of magnitude as 50 million LOC was in the Windows Vista timeframe.

    EDIT: And, remember: that's for _one_ product, not multiple products. So an all-flavors build of Windows churns through a _lot_ of data to get something working.

    (Parenthetical: the Windows build system is correspondingly complex, too. I'll save the story for another day, but to give you an idea of how intense it is, in a typical _day_, the amount of data that gets sent over the network in the Windows build system is a single-digit _multiple_ of the entire Netflix movie catalog. Hats off to those engineers, Windows is really hard work.)

  • I'm a google software engineer and it's nice to see this public article about our software control system. I think it has plusses and minuses, but one thing I'll say is that when you're in the coding flow, working on a single code base with thousands of engineers can be an intensely awesome experience.

    Part of my job- although it's not listed as a responsibility- is updating a few key scientific python packages. When I do this, I get immediate feedback on which tests get broken and I fix those problems for other teams along side my upgrades. This sort of continuous integration has completely changed how I view modern software development and testing.

  • So a monolithic codebase makes it easier to make an organization wide change. Microservices make it easier to have people work and ship in independent teams. The interesting thing is that your can have have microservices with a monolithic codebase (as Google and Facebook are comprised of many services). But you can also have a monolithic service with many codebases (like our GitLab that uses 800+ gems that live in separate codebases). And of course you can have a monolithic codebase with a monolithic service (a simple php app). And you can have microservices with diverse codebases (like all the hipsters are doing).

    I'm wondering if microservices force you to coordinate via the codebase just like using many codebases force you to coordinate via the monolithic service. Does the coordination has to happen somewhere? I wonder if early adopters of microservices in many codebases (SoundCloud) are experiencing coordination problems trying to change services.

  • I will say that I saw and experienced many things that changed my definition of 'large' at Google, but the most amazing was the source code control / code review / build system that kept it all together.

    The bad news was that it allowed people to say "I've just changed the API to <x> to support the <y> initiative, code released after this commit will need to be updated." and have that effect hundreds of projects, but at the same time, the project teams could do the adaptation very quickly and adapt. With the orb on their desk telling them at that their integration and unit tests were passing.

    I thought to myself, if there is ever a distributed world wide operating system / environment, it is going to look something like that.

  • Xoogler here. There were tons of benefits to Google's approach, but they were only viable with crazy amounts of tooling (code search, our own version control system, the aforementioned CitC, distributed builds that reused intermediate build objects, our own BUILD language, specialized code review tools, etc).

    I'd say the major downside was that this approach basically required a 'work only in HEAD' model, since the tooling around branches was pretty subpar (more like the Perforce model, where branches are second-class citizens). You could deploy from a branch but they were basically just cut from HEAD immediately prior to a release.

    This approach works pretty well for backend services that can be pushed frequently and often, but is a bit of a mismatch for mobile apps, where you want to have more carefully controlled, manually tested releases given the turnaround time if you screw something up (especially since UI is really inefficient to write useful automated tests around). It's also hard to collaborate on long-term features within a shipping codebase, which hurts exploration and prototyping.

  • Its interesting that they compare LoC with Windows. I suppose that this article wants us to be amazed at those numbers. However, my experience with Google's products indicates a gradual decline in performance and a simultaneous gradual increase in memory bloat (Maps, Gmail, Chrome, Android). Which ironically, FWIW, hasn't been the case with Windows. I have noticed zero difference in performance going from Windows 7 to 8 to 10.

  • "The two internet giants (Google and Facebook) are working on an open source version control system that anyone can use to juggle code on a massive scale. It’s based on an existing system called Mercurial. “We’re attempting to see if we can scale Mercurial to the size of the Google repository,” Potvin says, indicating that Google is working hand-in-hand with programming guru Bryan O’Sullivan and others who help oversee coding work at Facebook."

    Why Mercurial instead of Git?

  • Comparing "Google" to Windows isn't really a fair comparison. I'm sure all of the code that represents products that Microsoft has in the wild far exceeds 2B lines.

  • One humorous side-effect of having all that code viewable (and searchable!) by everyone was that the codebase will contain whatever typo, error, or mistake you can think of (and convert into a regular expression).

    I remember seeing an internal page with dozens of links for humorous searches like "interger", "funciton", or "([A-Z][a-z]+){7,} lang:java"...

  • Direct link to talk (The Motivation for a Monolithic Codebase ):

    https://www.youtube.com/watch?v=W71BTkUbdqE

  • I am unable to believe that Google has 2B lines of original code written from scratch at Google.

    Maybe they are counting everything they use. Somewhere among those 2B lines is all the source code for Emacs, Bash, the Linux kernel, every single third-party lib used for any purpose, whether patched with Google modifications or not, every utility, and so on.

    Maybe this is a "Google Search two billion" rather than a conventional, arithmetic two billion. You know, like when the Google engine tells you "there about 10,500,000 results (0.135 seconds)", but when you go through the entire list, it's confirmed to be just a few hundred.

  • "LGTM is google speak for Looks good to me" - actually common outside of Google.

  • In the spirit of "You didn't build that," I wonder how many lines of code comprise the binaries that Google binaries run on? Windows, Linux, network stacks, Mercurial, etc, etc.

    I also wonder if there's a circular relationship anywhere in there.

  • The CitC filesystem is very interesting. This is local changes overlaid on top of the full Piper repository. Commits are similar to snapshots of the filesystem. Sounds similar to https://github.com/presslabs/gitfs

  • I really wish there was a tendency to track all change/activity and not just total size; maybe like the graphs on GitHub. Removing things is key for maintenance and frankly if they haven't removed a few million lines in the process of adding millions more, they have a problem.

    Having a massive code base isn't a badge of honor. Unfortunately in many organizations, people are so sidetracked on the next thing that they almost never receive license to trim some fat from the repository (and this applies to all things: code, tests, documentation and more).

    It also means almost nothing as a measurement. Even if you believe for a moment that a "line" is reasonably accurate (and it's tricky to come up with other measures), we have no way of knowing if they're measuring lots of copy/pasted duplicate code, massive comments, poorly-designed algorithms or other bloat.

  • The comparison of Windows to all of Google's services is pointless and misleading.

    It's like comparing the weight of a monster truck and the total weight of all the cars at a dealership...

  • Assuming these numbers are right...

    (15 million lines of code changed a week) / (25,000 engineers) = 600 LOC per engineer per week

    Is ~120 LOC per engineer per workday normal at other companies?

  • I imagine that there's a lot of Java and C++. I do like Go but it makes you wonder if a more expressive language that requires a fraction of the code would be helpful. Maybe Steve Yegge will see Lisp at Google after all.

  • Some questions that immediately come to my mind:

    - What is the disk size of a shallow clone of a repo (without history)?

    - Can each developer actually clone the whole thing, or you do partial checkout?

    - Does the VCS support a checkout of a subfolder (AFAIK mercurial, same as git, does not support it)?

    - How long does it take to clone the repo / update the repo in the morning?

    Since people are talking about huge across-repo refactorings, I guess it must be possible to clone the whole thing.

    Facebook faces similar issues as Google with scaling so they wrote some mercurial extensions, e.g. for cloning only metadata instead of whole contents of each commit [1]. Would be interesting to know what Google exactly modified in hg.

    [1] https://code.facebook.com/posts/218678814984400/scaling-merc...

  • What? This is surpassing the mouse genom complexity. See this charts for comparison: http://www.informationisbeautiful.net/visualizations/million...

  • Really interesting article. Sounds like a great solution to the problem in git of submodules. Definitely worth looking at. Thanks for posting OP.

    IMO this system would best be suited for large companies, but I could see the VCS that they are developing being used by anyone if it gets a github-esque website.

  • Yeah, but it's only like ~200 lines rewritten in Lisp.

  • Looks like someone's going to have to update this: http://www.informationisbeautiful.net/visualizations/million...

  • This hurts just thinking about what the build, test and deploy systems must look like.

  • For those interested, the source analyzer Steve Yegge was working on called GROK has been renamed Kythe. I don't know how useful it turned out to be for those 2B LOC. http://www.kythe.io/docs/kythe-overview.html

    Steve Yegge, from Google, talks about the GROK Project - Large-Scale, Cross-Language source analysis. [2012] https://www.youtube.com/watch?v=KTJs-0EInW8

  • What I'd like to know and no one seems to mention:

    What's the experience like for teams not running a Google service and instead interacting with external users and contributors, e.g. the Go compiler or Chrome.

  • Really? This article sounds very over simplified, but I haven't worked at google so I wouldn't know. I'm assuming if you want to change some much depended on library, there's a way to up the version number so you don't hose all your downstream users. That's the way it worked at Amazon at least. Also, I wonder why the people in the story think Google's codebase is larger than that of other tech giants, not that it really matters.

  • What are the best practices to follow in a single-repo-multiple-projecrs world? Some people recommend git submodule, others recommend subtree.

    How do you guys manage alerts and messages - does every developer get a commit notification,or is there a way to filter out messages based upon submodule.

    How does branching and merging work?

    I'm wondering what processes are used by non-Google/FB teams to help them be more productive in a monolithic repo world.

  • The comparison bewteen the total LOC across all of Google's products against just one of Microsoft's is a bit unfair.

  • The comparison with windows really is just here to provide a something to compare for casual reader, it's not really that good. An OS is a huge project. But google has hundred of different project, apis, library, framework... Even unix with an "unlimited" source of developpers does not reach that point.

  • I remember somebody wise had said once: "Every line of code is a constraint working against you."

  • How do the monolithic repository companies handle dependencies on external source code?

    Are libraries and large projects e.g. RDBMS generally vendored/forked into the monolithic repositories, regardless of whether the initial intent is to make significant changes?

  • Are the source of piper and the build tools also in the mono repo and also developed/deployed off the head branch? Seems like a random engineer could royally fubar things if they broke a service which the build system depends on ...

  • Even if the numbers are off, the assumption that 40M lines of code take less effort to write than 2B lines of code commits the fallacy that effort is proportional to number of lines of code. Come on, Wired, you can do better.

  • Is this article saying that all developer employees have access to the "holy" search algorithm internals? I can hardly believe that to be true, given the fact that SEO is a complete industry.

  • How frequently Google does https://en.wikipedia.org/wiki/Code_refactoring

  • Those are mind-boggling numbers.

    Although I kind of doubt that "almost every" engineer has access to the entire repo, especially when it comes to the search ranking stuff.

  • A giant repo works for Google, and works for Facebook, and Microsoft, but it's bad for the development community at large.

    If you start centralizing your development you’re killing any type of collaboration with the outside world and discouraging such collaboration between your own teams.

    http://code.dblock.org/2014/04/28/why-one-giant-source-contr...

  • With 2 billion lines of code I would consider the problem of developers stepping on each other's toes essentially solved.

  • What I find most distressing is that their Python code indents with two spaces... This is so wrong, Google.

  • If they were to recompile all of it on a standard desktop PC how long would it take? A week?

  • I wonder how close the "piper" system is to the code.google.com project.

  • Am I the only one who initially read 28 instead of 2B ? :)

  • What is a "line of code"? out of the 2b lines of code google has, how much of it was auto-generated? how many of those lines are config files? This is a very silly article that has little to no value.

  • So they do not suffer on git submodules I guess

  • Cool , How much front and back end each ?

  • How many lines is duckduckgo? :-)

  • OMG it's all in one file? OMG OMG it's all on ONE LINE????!!!

  • Now I know why my google plus page takes half a day to load.

  • Is anyone else a little amazed that they have so many employees with access, yet their single codebase has not yet been leaked? It almost seems inevitable. They must have some contingency planning for such an event.

  • If they had done it in LISP, it would have only been 200K lines.

  • So a monolithic codebase makes it easier to make an organization wide change. Microservices make it easier to have people work and ship in independent teams. The interesting thing is that your can have have microservices with a monolithic codebase (as Google and Facebook are comprised of many services). But you can also have a monolithic service with many codebases (like our GitLab that uses 800+ gems that live in separate codebases). And of course you can have a monolithic codebase with a monolithic service (a simple php app). And you can have microservices with diverse codebases (like all the hipsters are doing).

    I'm wondering if microservices force you to coordinate via the codebase just like using many codebases force you to coordinate via the monolithic service. Does the coordination has to happen somewhere? I wonder if early adopters of microservices in many codebases (SoundCloud) are experiencing coordination problems trying to change services.

  • lol git clone http://urlto.google.codebase.git ...

    I wonder how much time it takes to clone the repo, provided they use git.

  • This explains quite some things.

    Still, this is not a very forward-thinking solution. Building and combining microservices – effectively UNIX philosophy applied to the web – is the most effective way to make progress.

    EDIT: Seems like I misunderstood the article – from the way I read it, it sounded like Google has a monolithic codebase, with heavily dependent products, deployed monolithically. As zaphar mentioned, it turns out this is just bad phrasing in the article and me misunderstanding that phrasing.

    I take everything back I said and claim the opposite.