Tuesday, November 20, 2018

Doing some parsing

Today I worked on parsing a document for the lab. We're interested in various biochemical pathways and the enzymes involved. Much of the data we need already exists, but the web site that provides it has it an an inconvenient format with several different kinds of data smashed together into one string. Some records were missing the gene name; those had a somewhat different format. Records of one kind seemed to be already fused to the corresponding records of a different kind, even though one of the latter can correspond to one of the former, so there was duplication. My script picked out the individual pieces of data (using regular expressions) and deduplicated the combined records into several separate lists.

No comments:

Post a Comment