SQLplus query needed
-
First dibs on this question must go to Mathtalk, the author of the comments that go me this far. I would understand if they are not interested, but I do feel like I have to compensate him or her somehow. I need a query to build a map of gene collisions. >>> I am using Oracle9i and SQLplus The "known" genes in the genome are listed in table knownGene. For each gene (row) there is a genomic interval or locus with coordinates (chrN, Start position, end position). For each chromosome (1-22, X, Y) there is a table of entities called chains. These each have two intervals, target and query. A chain is a repeat. The plan is to make a list for each known gene. The list is named by its target gene, and populated by the names and descriptions of all the genes associated via the chains. A chain is a link, or edge, from one region to another in the genome. For each knowngene, all the chainSelf tables must be scanned for chains whose target coordinates indicate they would fall into the genes genomic interval. Each genes interval will have to be expanded by 1.5, but that can wait. With a gene's interval in mind, a list of chains that are enclosed by or that overlap it can be made. Each of these chains has a query locus associated with it. QNAME is the chromosome, QSTART, QEND define the interval. For each query interval, table knownGene is searched and every overlapping knowngene returned. These are the known genes that are linked to or "chain to" the initial target gene for which the list is named. They constitute the first of two columns in this is gene's list. The second column is populated by taking the name in knownGene and scanning table kgXref (were name is equivilent to kgID) to return all teh descriptions of the genes whose names we collected for this one target gene. I am not particularly clear on what the best final output is. Suffice to say I would want to sort the lists by size and see what genes chain to the greatest number of other genes, and then make a map of the network created by these associations using something like Cytoscape to show that the distribution of links follows a power law. >>>Here are the descriptions of the 3 table types: SQL> describe knownGene Name Null? Type ----------------------------------------- -------- ---------------------------- NAME NOT NULL VARCHAR2(255) CHROM NOT NULL VARCHAR2(255) STRAND NOT NULL CHAR(1) TXSTART NOT NULL NUMBER(10) TXEND NOT NULL NUMBER(10) CDSSTART NOT NULL NUMBER(10) CDSEND NOT NULL NUMBER(10) EXONCOUNT NOT NULL NUMBER(10) PROTEINID NOT NULL VARCHAR2(40) ALIGNID NOT NULL VARCHAR2(8) >knownGene looks like this, but with no headers. Values abbreviated to fit page here. >Table knownGene has 43232 rows total. #name chrom strand txStart txEnd cdsStart cdsEnd exonCnt proteinID alignID BC0634 chr1 - 4224 7502 4558 7173 2 AAH63682 24255 AL1373 chr1 - 4266 1936 7413 19346 2 Q9NSV7 41739 BC0544 chr1 - 4268 5808 4558 5808 1 Q96BN3 32430 SQL> describe chr10_chainSelf (there is one table for each chromosome 1-22, X, Y) Name Null? Type ----------------------------------------- -------- ---------------------------- BIN NOT NULL NUMBER(5) SCORE NOT NULL FLOAT(126) TNAME NOT NULL VARCHAR2(255) TSIZE NOT NULL NUMBER(10) TSTART NOT NULL NUMBER(10) TEND NOT NULL NUMBER(10) QNAME NOT NULL VARCHAR2(255) QSIZE NOT NULL NUMBER(10) QSTRAND NOT NULL CHAR(1) QSTART NOT NULL NUMBER(10) QEND NOT NULL NUMBER(10) ID NOT NULL NUMBER(10) >chr10_chainSelf looks like this with no headers. #bin score tName tSize tStart tEnd qName qSize qStrand qStart qEnd id 585 63520 chr10 135037 14000 16578 chr7 158545 - 96339 9634 204969 585 63670 chr10 135037 14000 16578 chr7 158545 + 62286 6228 204161 585 81668 chr10 135037 14000 17239 chr7 158545 + 62827 6283 133728 SQL> describe kgXref Name Null? Type ----------------------------------------- -------- ---------------------------- KGID NOT NULL VARCHAR2(40) MRNA VARCHAR2(40) SPID VARCHAR2(40) SPDISPLAYID VARCHAR2(40) GENESYMBOL VARCHAR2(40) REFSEQ VARCHAR2(40) PROTACC VARCHAR2(40) DESCRIPTION VARCHAR2(255) >kgXref looks like this but with no headers. Values have been truncated. >Table kgXref has 42026 rows total. #kgID mRNA spID spDisplayID geneSymbol refseq protAcc description AB0000 AB0000 P56555 DSR4_HUMAN DSCR4 NM_0057 NP_0058 Down syndrome gene 4 AB0001 AB0001 Q99983 OMD_HUMAN OMD NM_0050 NP_0050 osteomodulin AB0001 AB0001 Q99984 Q99984 C1orf29 NM_0068 NP_0068 mRNA expr in osteoblast >>>this is the query that was given to me. It can only check through one of the chainself tables instead of all 24 (chr1-22, X, Y). It returns the same results over and over. I dont think it ever stops. I have been sending the results to a spool file, and clearly the query doesnt do what I intended. I definately need it to check all chainSelf tables. /* query1.sql*/ select a.name, a.exoncount, d.description, c.name from knowngene a, chr1_chainSelf b, knownGene c, kgXref d where a.txstart <= b.tStart and a.txEnd >= b.tEnd and b.qstart >= c.Txstart and b.qEnd <= c.txend and c.name = d.kgId and a.chrom = 'chr1' ; Any direction would be greatly appreciated, but the answer is going to have to work. I can help develop it but it will essentially be the text in the query file, and thus syntax specific.
-
Answer:
Hi, iterative-ga: Let's outline the steps to take, and then go into the details (see below). 1. Create a table to hold each known genes, its chromosome, and the enlarged genomic intervals. Populate this table with an INSERT statement based on selecting records from table knownGenex. 2a. Create a table to hold pairs of known genes and chainSelf IDs, such that the start/end coordinates of the chainSelf target interval are within the enlarged genomic interval. Populate this table with several INSERT statements, one for each chromosome, joining the table from 1. and the respect chromosome's chainSelf table. 2b. Create a table to hold pairs of known genes and chainSelf IDs, such that the start/end coordinates of the chainSelf query interval are within the enlarged genomic interval. Populate this table with several INSERT statements, one for each chromosome, joining the table from 1. and the respect chromosome's chainSelf table. 3. Create a table to hold pairs of known genes and the chainSelf IDs that link them. Join the tables from 2a/2b by chromosome & chainSelf IDs, to produce the required "links" among known genes. It can be seen from this outline that the process can be done chromosome by chromosome, although we will be using tables that pool all the chromosomes' results together. But to prove the ideas, let's run through the process for chromosome 1. It may not only find a bug or two, but point out some indexing issues for performance. Please run through these steps just on chromosome 1 and let me know what the timings are like. regards, mathtalk-ga * * * * * * * * * * * * Step 1 ====== CREATE table knowngene_enlarge ( name varchar2(255) not null, chrom varchar2(255) not null, wxStart number(10) not null, wxEnd number(10) not null ); INSERT into knowngene_enlarge SELECT name, chrom, txStart - (txEnd - txStart)/4, txEnd + (txEnd - txStart)/4 from knowngenex; /* Note: We will probably want to add an index here on chrom, for the sake of the next steps. */ Step 2a ======= CREATE table knowngene_target ( name varchar2(255) not null, chrom varchar2(255) not null, chain_id number(10) not null ); /* for each chromosome, do like this: */ INSERT into knowngene_target SELECT a.name, a.chrom, b.id from knowngene_enlarge a, chr1_chainSelf b where a.chrom = 'chr1' and b.tStart between a.wxStart and a.wxEnd and b.tEnd <= a.wxEnd; /* Note: The hope is that the condition b.tStart <= a.wxEnd, although redundant, will help speed up the query by using an index on chrXX_chainSelf.tStart to limit the join. */ Step 2b ======= CREATE table knowngene_query ( name varchar2(255) not null, chrom varchar2(255) not null, chain_id number(10) not null, score float(126) not null ); /* for each chromosome, do like this: */ INSERT into knowngene_query SELECT a.name, a.chrom, b.id, b.score from knowngene_enlarge a, chr1_chainSelf b where a.chrom = 'chr1' and b.qStart between a.wxStart and a.wxEnd and b.qEnd <= a.wxEnd; /* Note: Very similar to above except using query interval instead of target interval, and also picking up the score field for later convenience. */ Step 3 ====== CREATE table knowngene_link ( name_t varchar2(255) not null, name_q varchar2(255) not null, chrom varchar2(255) not null, chain_id number(10) not null, score float(126) not null ); INSERT into knowngene_link SELECT a.name, b.name, b.chrom, b.chain_id, b.score from knowngene_target a, knowngene_query b where a.chrom = b.chrom and a.chain_id = b.chain_id; /* Note: To make this join efficient, we probably want to create a composite index on chrom & chain_id for knowngene_target and/or knowngene_query. */
iterative-ga at Google Answers Visit the source
Related Q & A:
- Why is sudo bash needed?Best solution by Super User
- How to convert sql query to Hibernate Criteria query?Best solution by Stack Overflow
- How to convert my SQL query to MS Access query?Best solution by Stack Overflow
- How can store sqlplus query in a file in formatted manner and display it to console?Best solution by Stack Overflow
- How to convert SQL query to LINQ query?Best solution by Stack Overflow
Just Added Q & A:
- How many active mobile subscribers are there in China?Best solution by Quora
- How to find the right vacation?Best solution by bookit.com
- How To Make Your Own Primer?Best solution by thekrazycouponlady.com
- How do you get the domain & range?Best solution by ChaCha
- How do you open pop up blockers?Best solution by Yahoo! Answers
For every problem there is a solution! Proved by Solucija.
-
Got an issue and looking for advice?
-
Ask Solucija to search every corner of the Web for help.
-
Get workable solutions and helpful tips in a moment.
Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.