2008-07-09: Yesterday I had my boxes steal a few gigabytes of XML data on gene expression in the human cortex. I wasn't aware how easy it would be to get this information, but now I'm stuck with a few gigabytes of content that isn't necessarily going to be doing any fancy tricks for me, and I'd like to fix that.


I cannot understand how I missed this.
http://humancortex.alleninstitute.org/ or http://brain-map.org/ (either one - second is more complete); I googlestalked some of the names behind the projects, and I ran into Alan Ruttenberg who was the individual responsible for page 19 of http://tinyurl.com/ysqm3z and "Harnessing the Semantic Web to Answer Scientific Questions" - http://esw.w3.org/topic/HCLS/Banff2007Demo?action=AttachFile&do=get&target=Banff2007Part2.pdf - pages 9, 10, 13, and 14.

Essentially these guys actually know what they are doing -- they built a Google Maps AJAX interface to their neuroscience information that they had been automatically collecting. The demo is offline now, but I've sent some inqueries about it, and for the past nine hours I've been running perl scripts that have stolen the majority of the XML data from the Allen Institute and Entrez for gene identification information. So if it truly doesn't exist, I can implement with my own data sets and throw up a really, really awesome page on the server.

Turns out that this is all because of the same guys behind Creative Commons.
http://sciencecommons.org/ "for the acceleration of the process of science" geeze, might as well be me
http://en.wikipedia.org/wiki/Creative_Commons " The project provides several free licenses that copyright owners can use when releasing their works on the Web. It also provides RDF/XML metadata that describes the license and the work, making it easier to automatically process and locate licensed works." ... which if you notice is a very fancy trick to get people to do documentation / metadata packaging. I wasn't aware of this ... I thought it was just something that's been around as long as "the public domain" licensing option.

So I joined them at the W3C and sent a hello/join message:
http://www.w3.org/2001/sw/hcls/
hrm, which isn't in the archives yet.

http://hcls.deri.ie/hcls_demo.html
"The following queries access a SPARQL (w3, Wp, sparql.org) endpoint hosted at DERI. The underlying triplestore contains over 325 million RDF triples of biomedical information. The information covers a large array of biomedical knowledge: from basic molecular biology over literature annotation up to anatomy and physiology."

# Query about dendrites.
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX go: <http://purl.org/obo/owl/GO#>
PREFIX obo: <http://www.geneontology.org/formats/oboInOwl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

select ?name ?class ?definition
from <http://purl.org/commons/hcls/20070416>
where
{ graph <http://purl.org/commons/hcls/20070416/classrelations>
{?class rdfs:subClassOf go:GO_0008150}
?class rdfs:label ?name.
?class obo:hasDefinition ?def.
?def rdfs:label ?definition
filter(regex(?name,"[Dd]endrite"))
}

http://hcls.deri.ie/deri_presentation.pdf

I am stuck. I grabbed ~1.2 GB of human cortex gene expression data from http://humancortex.alleninstitute.org/ and have clues as to interfacing this dataset with Google Maps, MeSH, maybe other bioinformatics databases. The data is just sitting here, it's not doing anything. I'm considering the construction of some tools to use the data set as a 'scaffold' for making assumptions about the user's brain, then setting up tools from a "system admin" perspective to generally relate techniques for biofeedback to certain regions, studies and methodologies that isolated different regions, etc., but all of that would require extensive programming on my part -- something like a few new scripts/programs per paper [that the authors probably neglected to share in the zip/PDF]. Is there an easier way to make this data useful?

The main issue I see is that it so far would require me to do some serious programming each time I see a new paper, each time I see a new model that I want to do; and how would this be integrated into that dataset anyway? The system administrator of a brain might be interested in managing I/O, biofeedback / neurofeedback, fMRI, EROS, rTMS, acoustic feedback, whatever. The human cortex gene expression data set can be displayed over the Google Maps API or something very similar to it, and then this could be used for neurotagging and the correlation of information to specific regions of the map, while also giving users an interface for managing that information, selection views, and what would this lead to? There's a gap between gene expression information, brain maps, and the neurofeedback that the system admin would be working with. ((One small idea would be to identify new potential molecular targets due to the genetic information and the expected protein expression. But this is just "looking for potentially useful new tools" rather than any runtime integration / programming -- the integration still has to be done by the sysadmin. The sysadmin would be notified of gene expression metadata related to regions of the brain that are in use commonly (lots of logging and analysis over spans of time) and then would be given some links over to PubChem that might be relevant to modifying the locally operating systems. The sysadmin would have to somehow acquire the tools and machinery to either order the new equipment or make it himself. That's great, but it's neurofeedback-- surely there's a way at runtime to use the gene expression information, like watching for signal spikes in the blood of certain compounds and hormones, or monitoring regions of the brain and other variables that could be used to incorporate insights from the literature without spending one's full attention to each minor augmentation. Maybe there's a way to automate these augmentations without having to code up each model? And without having to go to each scientist individually and beg them to release something more uh functional? I mention GAs and evolvability somewhere below, but it's sort of handwaving. Less handwavy might be to do linguistical analysis on the literature that isn't yet fully quantified/programmed and from there implement parts of an architecture together (like legos) to make those 'minor augmentations' (ready to be deployed) -- but is grammatical analysis sufficient? Just looking for keywords and using the same general models of the brain to figure out what interfaces are possible? That could completely ignore anything truly _new_ that each paper is supposedly bring to the table ... for instance, before you know about piezoelectricity, how would your framework accomodate it? Can't. So spending time on each article and hard coding the information might be the only way forward. Hrm. I hope I am wrong.))

I've been considering John Ohno's xsublim scripts combined with other aspects of cached functionality, like having a cluster sitting around working on unused brain output information (the stuff that doesn't make it into speech or typing). Maybe there's a way to manage a farm of these clusters so that they are working on the data that's being fed in.

Again: the problem would be that I would have to write scripts each time I come across a new paper or some new model in the neurosci disciplines. That would suck immensely. I'm willing to do it, but the computational complexity of going about that, eh. Just not the best of ideas. Instead of hardcoding everything, what about using evolvable information processing algorithms on the cluster? This is where information output from the brain would be passed through the cluster to "filter" and "process" it in various ways. It will not look intelligent at first. But the more you glance at it, the more you will see the GAs and mutators doing weird things, perhaps useful things. I just need to quantify this idea and see if it makes any more sense.

So we have:

1) Hard coding various models and metadata from the literature. This would be significantly simpler if people would just upload their programming work with their papers.
2) Evolvable computational agents that are given data and can be somehow selected to play with the information in some way. I'd rather have them mutating the semantic information and the connections between them, like queries for information in the semantic databases that match up to the observed conditions or something, and then either pulling forth code or suggesting code be written to account for certain observed patterns (dopamine up after serotonin down according to thermal scans in sector 33 (ok, that's out there)).
3) uhh



SNP information from deCODEme/23andme -- this is given to users in CSV format and can be used to cross reference the 1.5 GB Entrez dbSNP records with the Google Maps API on the frontend for cross-lateral sharing of neuroinformatics between individuals, such as family members or friends on a website; I'm currently working on this. (as of 2008-07-09).
ftp://ftp.ncbi.nih.gov/snp/database/organism_data/human_9606/

Solution to the general "what to do with the data" problem -- architecture/framework for the management of psychometric testing and so on, especially when it comes to the testing of neural tissues, and hopefully also that of the user behind the consoles. Essentially this is like device driver programming for the kernel. It's exactly that, even, but with some more servers for streaming data and process management in the overhead -- like monitoring variables and scheduling and such.

http://video.google.com/videoplay?docid=-1828931383490669655&q=blender+conference+2006 'At the Blender conference, Alberto Cardona discussed his methods for using Blender to reconstruct serial sections of Drosophila brains and annelid ganglia/connectives. I don't begin to understand the algorithms, but it seemed promising (I am also a neuroscientist). Google for videos of "Blender Conference 2006" to view his presentation.'

re: Alberto Cardona's 3D reconstruction problems; this might be resolved with 'brainbow'.









Kevin Brockway Mei Chi Chin http://dils07.cis.upenn.edu/postworkshop/ slides/Wed-session1/AllenBrainAtlas.pdf Tim Fliss Cliff Frensley zero@speakeasy.org http://www.speakeasy.org/~zero/ Reena Kawal http://vulcan.com/ related to Paul Allen Reena Kawal Associate producer, ABCNEWS.com Seattle Kirk Larsen Bryan Smith Carey Teemer Allen Institute for Brain Science Carey Teemer leads the Brain Atlas application development team while concurrently managing the technology programs group within Vulcan Inc.'s technology ... http://www.alleninstitute.org/content/core_team.htm