How to output data in xml?

Legacy data migration QA plan

  • Looking for sources on how to QA output of a parser that converts monolithic legacy data files to individual XML snippets. The specs are too involved to include here. Books or white papers that address this problem in terms of QA planning and execution, how to sample the output and assess satisfaction of requirements (aside of XML validation) given a staggering variation of input data and loose specs, would be appreciated. (We've got the development side covered).

  • Answer:

    Hi, siam-ga: Your question asks about QA aspects of a legacy data migration project, but nearly all aspects of a project's planning can directly impact the QA tasks. So I think it best to widen our discussion !! :-) CAVEAT ====== My understanding of your project is very incomplete, but from what you have described, it seems to involve data cleanup at least as much as pure data conversion. You seem to emphasize the semantic content of the data more than its presentation, perhaps more strongly than your client would. Many legacy "markup" conversion projects focus on mainly on presentation of results, and hence target PDF or HTML output, as for example: [Minnesota Local Research Board] http://www.lrrb.gen.mn.us/Guidelines/Appendices/appendixB.asp Caveat: Your project is targeting XML output, but I know little about what application(s) which will use this migrated data. Normally the target application would dictate a lot of things concerning the QA process. In the absence of knowing more about that, however, I'm thinking of the results of the project as being targeted for various potential (unwritten) future uses, making it in a narrow sense something of a "data warehousing" or "data mining" project instead of simply Web publishing. [Something of that duality is inherent with MathML, which allows either content orientation or presentation orientation in representing mathematical formulas.] CARDINAL PRINCIPLES =================== I have two cardinal principles for data mapping projects, and I want to throw these out there in advance of telling you exactly how to apply them in your project: 1) Speak in complete sentences. The idea is that the units of conversion should resemble standalone statements, capable of being true or false (correct or incorrect) on their own. Of course this is not entirely the case, even in mathematics. There is always a context to what is being asserted. Nonetheless, in constructing your "XML snippets" be careful to avoid fragmenting the source data beyond a point where it can no longer be understood as "complete sentences" as this is a red flag that the converted data has lost its coherence. 2) Invent needed vocabulary. Your description of the specifications process echoes experiences I've had. Apparently there exist many basic patterns for the conversion "template" and probably even more "exceptions" to these patterns. In order to discuss the patterns and exceptions, and most importantly to be able to write them into the project specifications, I'm guessing that you will need to invent some new vocabulary. The discussion of critical issues can break down in the specification phase because the same imprecise words get used to describe a variety of truly distinct phenomena. Sometimes this is fortuitous and leads to deep insights into the similarity of tasks for the software to perform, but more often than not it results in a false sense of confidence by the client that difficulties have been ironed out. A FEW WEB CITATIONS =================== Okay, now that I've thrown out my two cents on the generalities, let me present a few papers I found in searching around the Web. While none describes a situation exactly like yours, each struck me as having some good ideas to contribute with respect to quality assurance in data conversion projects. First up is a white paper by Colin J. White of Database Associates: [An Analysis-Led Approach to Data Warehouse Design and Development] http://www3.newmediasales.com/dl/1755/Colin_White_Evoke_in_DWH_V2.pdf This paper has absolutely nothing to do with XML but champions the notion of using data quality dictate the design of a "data warehouse". It presents some terminology that may be useful in selling your project planning to the rest of the project team, such as the importance of "staging areas" for data to minimize data quality and data integration problems. Note that this is "version 2" of his paper, so apparently it made a good enough impression on the first client he used it with to make it into a second version! The second paper is by Robert Aydelotte: [From EDI to XML] http://www.posc.org/ebiz/xml_edi/edi2xml.html The author describes project planning for converting "legacy" EDI data formats into XML/edi (XML-based EDI) but doesn't go into detail about test cases and QA. However he gives a link to the ISIS European XML/EDI Pilot Project, where many seminar presentations and other project-specific documents are available. This was sufficiently far in the past that validation of XML was discussed solely in terms of DTD's, but what I found most interesting in this material was the discussion of "best pracices" for creating those DTD's. Third is a paper by Shazia Akhtar, Ronan G. Reilly, and John Dunnion: [Automating XML Mark-up] http://www.nyu.edu/its/humanities/ach_allc2001/papers/akhtar/ which may provide some cogent ideas toward selection of test cases. They describe using the "self-organizing map" (SOM) learning algorithm proposed by Kohonen to arrange documents in a two-dimensional map, so that similar documents are located close to one another. I was thinking that this idea might be applied in your project to the selection of test cases. Supposing that the 2777 XML snippets were mapped into a 2D diagram, selection of test cases from among these could then be done so that a greater number of idiosyncratic documents are chosen for critical examination (at the expense of using only a relatively smaller number where very similar documents are densely clustered). STARTING OVER AGAIN =================== Having hit all these fragmentary insights at the outset, let back up and divide the data migration process into three "quasi-sequential" phases: Data Cleanup (rectification of original data) Data Translation (data mapping and conversion) Data Installation (provisioning of revised data to applications) It would be nice if the three phases were truly sequential. In practice one allows a greater or smaller measure of parallel activity across these phases for the sake of speedy deployment. Understanding the interactions is a key to minimizing cost and risk. Data Cleanup ============ In a classic waterfall process for data migration, the data cleanup is done on the frontend of the project. Temptation to defer this upfront "intellectual effort" to a later point in the project calls into question the "integrity" of the conversion phase. If the data is not correct to begin with, how can a properly defined conversion process produce correct output? GIGO stood for "garbage in, garbage out", but it could also mean "good-data in, good-data out". In this particular project you've said that specifications exist for the original "tagged" format of the data files. That this data is organized into 63 input files seems somewhat incidental to the structure of the entities represented by that data. As a conceptual aid I'm thinking of those files as being somewhat like 63 tables in a relational database (please feel free to speak up and give a better description), guessing that each of the 2777 output files (XML snippets?) would generically depend on the aggregate of all the input files. You've also indicated that these specifications were abused, that to an extent old "markup" practices blurred the lines between content and presentation. For example, you suggest that semantic relationships are sometimes "coded" merely as a pattern of contiguous presentation (ordering) in the layout. If it is meaningful to correct these "bad practices" in situ, then it would be advantageous to do it before trying to convert them into "proper" XML output. For one thing it sounds as if the client has more "resources" who understand the legacy format than who understand the target XML format. Of course advantage should be sought in using "tools" to assist in this data cleanup, and it may be that the legacy format is simply too "fragile" to support an aggressive cleanup effort. Data Translation ================ I suggested decoupling the "conversion" into a "naive translation" phase and a "forgetful" (through stuff out) phase. This avoids confusing an intentional discarding of information, obsolete for future purposes, with the quite opposite objective, to "add value" by reconstructing explicit semantic relations from "implied patterns". A naive translation phase would put the legacy data into a more robust XML format, in which you can hope to leverage lots of existing tools (version control, XSLT, schemas, etc.) that may have no useful counterparts in the legacy format. The "mission statement" for this naive translation phase would be to provide XML tagging that duplicates the existing data in a literal fashion, so that at least in principle the legacy data could be fully reconstructed from this intermediate form. Note that XML/XPath does provide syntax for the ordering of sibling nodes. In this sense I'd hope that the "patterns" of implied relationships could be as manifest in the naive XML translation as they are in the legacy format. I'd anticipate that a number of issues with the original data would not be fully recognized until the conversion phase was well along. While it is currently hoped that many of the "exceptions" that are recognized late in the game will somehow fit neatly into the preconceived architecture of rules, Data Installation ================= As previously mentioned, without knowing something about the target applications, it's hard to discuss their relative importance in the QA process. You did mention in one clarification that the XML is to be used to generate HTML, and that "the client wants to review final HTML output" whereas you "feel it's much more important to look at the XML output itself." Given the greater insight you have into the HTML output process than I have, I'm certainly willing to adopt your point of view and consider the XML and its correctness as the focus of this question. It sounds as if the XML to HTML translation might be simply a stylesheet transformation, although the designation of the XML output files as "snippets" makes me suspect that a lot of "includes" surround this process. SPECIFIC QUESTIONS & ANSWERS ============================ Given this outline of the project, imperfect as only my imagination can make it, we can at least recap the questions you raised and discuss solutions: 1. What are some books and papers that address project planning for an exercise like this? This is the all-encompassing question asked in the original post. Project plans are a means to an end, not the end in themselves. Planning makes it more likely that you will reach the desired goal. As Gen. Dwight Eisenhower famously observed, while the plans for battle are useless as soon as war begins, planning is indispensable. You obviously have a good grip on the tools of Unix and XML, so I won't try to drive the discussion of project planning down to a technical level. However here's a book on generic project planning that I like: Project Management: How to Plan and Manage Successful Projects by Joan Knutson and Ira Blitz It's not extremely thick, about 200 pages, and I took a short course out of it a few years back, sponsored by the American Management Association. One of the key points that I took away from that course is that a project manager's role is that of facilitating, not doing the project work. I can't say that I ever took that lesson to heart, because I'm the quintessential player-coach on a project team, but I really do appreciate the contributions made by project managers who take care of the issues log, updating the schedule, drawing feedback from the clients, etc. without involving themselves in technical accomplishments on a day-to-day basis. For advice on software projects I can recommend the very readable Peopleware by Tom DeMarco and Timothy Lister (2nd ed.). I also find food for thought in the eXtreme Programming (XP) series. As a starting point I'd read: Extreme Programming Explained: Embrace Change by Kent Beck 2. How should we sample the output and assess satisfaction of requirements (aside from XML validation), given a staggering variation of input data and loose specs? I mentioned an idea above for using Kohonen's self-organizing map (SOM) to assist in selection of the test cases. You have obviously had some discussions with the client about preparing artificial data for use in unit testing, so clearly as you develop the conversion code you are planning on stubbing out certain sections to allow for this unit testing. I might try use some "debug" code to profile which patterns are being identified/applied and how often as your development code runs against the entire input. I'm unclear about whether the conversion will have to take all 63 files simultaneously as input, or whether it's more a matter of processing each one individually. But in any case if you can identify "natural" test cases for each code pathway, these will naturally serve as good test cases for unit testing. Asking the client to make up data for the sake of unit testing seems to me to carry some risk of wasted effort and even introduction of conversion issues that never existed in the original data! Just a thought (probably a paranoid one!). Once the conversion software is completed enough to run in "integration" mode, you will want to consult that "debug" log to see what the main code pathways are, and what "test cases" are good benchmarks (illustrate expected functionality) and what are open-issue related. I really feel that the automated testing suite is going to provide value on this project, despite the additional effort required of you, the lone developer. A major headache with late changes to specs or even with bug fixes is that the changes needed to add A or resolve B winds up unexpectedly breaking C. In my experience a test harness always provides value because it's better to discover that Murphy's Law has struck while the code changes are still fresh in your mind. So as a proxy for doing something clever with the SOM map, I'd suggest using the "profiling" counts from the test harness to decide how to sample test cases. As the client's experts report conversion issues, meet with the project manager to decide how the issues need to be logged, ie. spec change vs. bug in the code. Invent vocabulary as needed to update the specs with clarity for all concerned. 3. None of the project team except you are technical enough to understand actual code. How can the specs be made more specific without pseudo code that appears more confusing than the code itself (heavy use of regular expressions)? How can exceptions to specs that only the client is aware of be effectively documented (they keep popping up in conversations)? This is picking up where the last topic left off. Pattern matching is a key element of much declarative programming, but it can be tough sledding to give it "literal translation" in the specs. This is where an astute use of jargon, specially invented for this project, can pay off. Give the patterns that need to be discussed in the specs colorful, even semi-humorous names. It makes them memorable and gives the rest of the project team a feeling of belonging, of being "in on the secret". Give a fully blown definition of the pattern _once_ in the appropriate section of the specs, but thereafter simply refer to it by name. Suppose (merely for sake of illustration) that in the documents there's a typical pattern in which you have a section of BOLDED text, followed by exactly three sections of fine print, followed by a section in Italics. Regardless of what the actual purpose of this pattern is for the client's typesetting needs, you might aptly and humorously refer to it as the Three Blind Mice template. The lead paragraph might be called the Farmer and the closing one, the Farmer's Wife (since she "cuts off" the tail of the pattern). Or, if someone on the project team fancies him- or herself a chess aficionado, let them propose names like Queen's Gambit, etc. It's a chance for the non-technical but creative members of the project to make an expressive connection to the nitty gritty details, and usually enhances the commitment of the team as a whole to doing stuff the right way, rather than just producing something "of the form". For each section of the specs that defines a "pattern" you can have a standard subsection that describes "known" or suspected exceptions. As the exceptions are more clearly identified and distinguished, some of them are likely to evolve into subvarieties of "patterns", with their own exceptions. Listing the known exceptions can help the project team to prioritize the evolution of new patterns based on the depth and complexity of the existing patterns and exceptions. I don't know what language you plan to implement with. You mention regular expressions (and a focus on correctness rather than speed), which leads me to think of interpretive languages like perl or Awk. I prefer Prolog as a declarative language with strong pattern matching features, but in working with XML source documents of course XSLT is a natural choice. But regardless of how pattern matching will be coded, there needs to be an internally consistent vocabulary for all the variations that the project team can buy into. 4. The client wants to review final HTML output which will be generated from the XML, but I feel it's much more important to look at the XML output itself and leave transformation to smaller scale testing. How should we divide attention between the two? You have a clear instinct about this, which I would trust. But I think I'd try to adapt to the client's point of view in a way that makes it seem as though they are winning the argument. Specifically I'm thinking of serving up the XML pages with a very thin stylesheet transformation, which in the limiting case might be the default stylesheet used by Internet Explorer to render generic XML. If I knew more about the target application, I might see more clearly what incremental transforms might bridge the gap between the "raw" XML and the ultimately desired HTML. If you are the only developer, then I guess you'd be in the best position to judge how to finesse the differences. The presentation for testing will need to account for the size of the output documents. While "snippet" suggests a single page or so of XML, this may be wishful thinking on my part. If the documents are really big, one might use an "outlining" stylesheet that allows for "collapsible" sections of textual display to assist navigation within the document. This is something I should know more about than I do; if it's of interest, then make a Request for Clarification (RFC) with the button at top of my Answer, and I'll put a demo together for you. 5. One more thing: How about adding interactive parser functionality that will accept manual input if it can't recognize a pattern despite exception handling? Or having the XML output documents edited manually, if a problem is too specific to warrant a parser change? Should this be allowed, given that ongoing revisions will require repeating this manual change? Obviously allowing for an XML output document to be edited manually wouldn't require much programming effort on your part, where the first option sounds to my uneducated ear as if it would require a lot of effort. You accomplish the revision tracking for output documents more or less easily by logging them into a version control system. There are some issues with this. You'll need to come up with a naming convention for the output documents which reflects their "identity" across changes in the parser, and I have no clue how this might be done. Also you'll need to come up with an extra "database" that identifies which output documents are being treated as "manual" exceptions, with the intention of "checking out" only those documents prior to a run which are supposed to get automated treatment. I don't think those are insuperable obstacles, and in fact I think the identification of the "exceptional" output documents ties in well with what I suggested above about having "exception" subsections in the specs. My only real objection to this sort of approach, which may be pragmatically best, is that in principle one would prefer to do the cleanup on the source data, rather than in ad hoc fashion as a post-processing phase. Perhaps for you the concept of interactively directing the parser has a fairly immediate and easily implemented meaning, one that is more restrictive than simply allowing the user to do whatever they please. One aspect of it that I'd drill down on is how the parser is to be "interrupted" to allow manual interaction. The exceptions are likely to include not only documents that fail to match patterns, but also documents that match patterns that were unintended. In the latter case it seems that it might be prohibitively slow to "set breakpoints" in the software that asks a user to decide in each circumstance whether to allow automated parsing to continue or to "interrupt" for manual interaction. CONCLUSIONS =========== I've been off thinking and talking to myself about these ideas for too long, but everytime I went back to look over your notes in relation to my ideas, I got the feeling that my ideas had at least partial relevance to paths you'd already gone down. I need a reality check, so I'm putting what I've got together as well as I can tonight for you to take a look, and I'm standing by for any further clarification! regards, mathtalk-ga

siam-ga at Google Answers Visit the source

Was this solution helpful to you?

Just Added Q & A:

Find solution

For every problem there is a solution! Proved by Solucija.

  • Got an issue and looking for advice?

  • Ask Solucija to search every corner of the Web for help.

  • Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.