The U.S. Department of Agriculture provides a free, annually-updated database describing the chemical composition of a large variety of foods. To the best of my knowledge, this database is where every website that offers nutrition information gets its data. This site searches and displays data from release 28 (downloaded November 3rd, 2016) of the database. Here's the USDA's suggested citation:
US Department of Agriculture, Agricultural Research Service,
Nutrient Data Laboratory.
USDA National Nutrient Database for Standard Reference, Release 28.
Version Current: September 2015, slightly revised May 2016.
Sure. The USDA's dump (linked above) is pretty awful -- you can choose between Microsoft Access and a text database format invented by someone who should not have been allowed to invent their own format. To make things a bit easier if you want to do your own analysis without dealing with the USDA's database hierarchy and choice of formats, you can download a several-megabyte CSV file containing my processed version of the data (one row per food with the first row containing human-readable column names). I've also imported the CSV file into a publicly-shared Google Docs spreadsheet.
The USDA database can include a huge amount of detail for each data point — there are columns for "number of studies", "minimum value", "maximum value", "degrees of freedom", "lower 95% error bound", "upper 95% error bound", and for good measure, a field for additional statistical comments. The use of multiple studies could be a cause of individual nutrients not summing to expected values.
Additionally, most foods completely lack measurements for at least a few nutrients. For example, "Plantains, raw" includes measurements for carbohydrate, fiber, and sugars (31.89 g, 2.30 g, and 15 g per 100 g, respectively), but none for starch. In this particular case, I believe that it's safe to assume that the remaining 15 or so grams are starch.
In some cases (in particular, fatty acids), the database includes
overlapping fields. Consider food 12155, "Nuts,
walnuts, english". It's commonly known that walnuts contain a large
amount of ALA (alpha-linolenic acid, an omega-3 fatty acid). ALA is
given the designation "18:3 (n-3)", as it has an 18-carbon chain with
three double bonds ("n-3" is shorthand for "omega-3" and refers to the
position of the double bonds in the chain). The USDA database contains
three listings for 18:3 fatty acids: "18:3 n-3 c,c,c (ALA)"
F18D3CN3), "18:3 n-6 c,c,c" (
"18:3 undifferentiated" (
F18D3). Of these, the first one is
clearly ALA, the second is gamma-linolenic acid (an omega-6 isomer of
ALA), and the third is... an 18:3 fatty acid of some sort? As best as I
can tell, the "undifferentiated" value could be either ALA or GLA.
But back to walnuts. For
F18D3 (the "undifferentiated"
measure), the database lists 9.08 grams per 100 grams of walnuts. For
ALA and GLA, no values are present. Given the commonly-cited 4:1 ratio
of omega-3 to omega-6 fatty acids in walnuts, I've made the assumption
that all of the 18:3 listed here is ALA, but I'm unsure if that holds in
all cases. I've made similar assumptions when more precise information
isn't present that undifferentiated 18:2 is linoleic acid,
undifferentiated 20:3 is ETA, and undifferentiated 20:4 is arachidonic
acid. The resulting numbers look plausible to me, but I'm a software
engineer, not a biochemist. If you have suggestions for improvements,
please let me know.
There's a Ruby script that mangles the text files from the USDA into a bunch of JSON files that are downloaded by the browser (including a trie used for search autocompletion) and an 8-megabyte indexed data file listing the composition of each food as a JSON object. The script does some cleanup and aggregation (e.g. calculating omega-6-to-omega-3 ratios), too.
The index is loaded by a small Google App Engine app written in Go and is binary-searched to figure out which part of the data file needs to be read to answer a particular query.
Feel free to email me at dan at eatnum dot com or visit my homepage.