Tag Archives: big data

DNA Replication – Frequent Words, Reverse Complement, Pattern Matching, Clump Finding, Skewi, Mismatches

Genome replication is one of the most important tasks carried out in the cell. Before a cell can divide, it must first replicate its genome so that each of the two daughter cells inherits its own copy.

Replication begins in a genomic region called the replication origin (denoted oriC) and is carried out by molecular copy machines called DNA polymerases.

Locating oriC presents an important task not only for understanding how cells replicate but also for various biomedical problems. For example, some gene therapy methods use genetically engineered mini-genomes, which are called viral vectors because they are able to penetrate cell walls (just like real viruses). Viral vectors carrying artificial genes have been widely used in agriculture.
In the following problem, we assume that a genome has a single oriC. Read more of this post


Running a Hadoop Job

In my local directory, I have mapper.py and reducer.py, that’s the code for the mapper and reducer.
Running a job
Read more of this post

MapReduce – Mappers and Reducers

How that data is processed with MapReduce.

PTopToBottomrocessing a large file serially from the top to the bottom could take a long time.

MapReduce is designed to be a very parallelized way of managing data, meaning that your input data is split into many pieces, and each piece is processed MapReducesimultaneously.

RealReal-world scenario.  A ledger which contains all the sales from thousands of stores around the USA, organized by date. Calculate the total sales generated by each store over the last year.  Just to start at the beginning of the ledger and, for each entry, write the store name and the amount next to it. For the next entry, if store name is already there, add the amount to that store. If not, add a new store name and that first purchase. And so on, and so on.
Read more of this post

Big Data

Every day, billion of gigabytes of high-velocity data are created in a variety of forms, such as social media posts, information gathered in sensors and medical devices, videos and transaction records.

Not a Big Data – Not everything is actually a big data problem. There are lots of cases where you can use traditional systems to store, manage, and process your data. For example order details for a purchase at a store or information about a person’s bank details

What is a Big Data? – All orders across hundreds of branches nationwide or  All bank transactions made on the Wisconsin state area during the year

Definition of Big Data – A data set of terabytes or more to be ‘big data’.  It’s data which can’t comfortably be processed on a single machine. Big Data is not just size of the data, but also data is created very fast and data from different sources is in different formats. Most data is not worthless but actually does have a lot of value. Read more of this post