CodeFest and BOSC 2015 – Lots of workflows and Docker

People working at BOSC Codefest 2015

This year’s BOSC is being celebrated in Dublin, Ireland. For the third year in a row I am attending to my favourite conference, and as usual, to the 2-days previous Hackathon or Codefest.

Starting by the hackathon, fantastic organization as always, thanks to Curoverse and Bina for sponsoring and providing food and coffee (essential for working, well known fact). There were as usual several projects going on, but this year I was greatly surprised by the size of the Common Workflow Language group; all the people in the header picture are working on this project. I got involved in the project as well – actually had interest on it before, but couldn’t see how to contribute to it.

To start with I should try to explain what the project is about; citing the documentation:

The Common Workflow Language (CWL) is an informal, multi-vendor working group consisting of various organizations and individuals that have an interest in portability of data analysis workflows. Our goal is to create specifications that enable data scientists to describe analysis tools and workflows that are powerful, easy to use, portable, and support reproducibility.

CWL builds on technologies such as JSON-LD and Avro for data modeling and Docker for portable runtime environments.

CWL is designed to express workflows for data-intensive science, such as Bioinformatics, Chemistry, Physics, and Astronomy.

Basically, CWL tries to define a standard way of writing workflows, so that later different platforms, i.e Radix or Arvados, can write implementations for it. The main idea is to create portable and reproducible workflows. CWL is a project that was born on the previous hackathon in Boston last year, that’s a message for those that doesn’t have much faith on hackathons 😉

During the hackathon, we tried to run some examples and we found that, when Docker was listed as a requirement, the workflows were not running on our MacOS computers due to how Docker runs on Mac — Docker is based on Linux containers and, as Mac is not Linux… to run docker you need to use something like boot2docker, which creates a lightweight Linux VM that runs docker inside it and talks to the host OS, in this case Mac. Problem here is that when within a workflow a tool runs in a Docker container, it creates intermediate files that the host OS cannot see because the container file system is not the same than the host file system, breaking the workflow. After pinpointing the problem we issued a Pull Request that was merged after some review by Peter Amstutz, one of the main CWL contributors, making us (me, Robin, Roman and Sinisa) official CWL contributors, yay! So that was a profitable codefest for us.

About the conference itself, what to say; very interesting as always. I won’t make a detailed summary of the talks as I use to do for two reasons

Very updated and live “what is happening” information on twitter following the hashtag #BOSC2015
Thanks to Google, the talks are being recorded and will be soon available on the BOSC site (linked above)

The only remarkable think that I want to comment on is the great idea of asking questions to the speakers using twitter, IMHO it created an open and very active participation… kudos to the organizers!

What I take home from this 4 days of coding and talks is… Well a lot of things, but to begin with – my interest in CWL is growing and growing, and I really want to understand the project better and get involved! Also, that I really have to catch up on the Docker ecosystem, I think that I’ve heard the word Docker on every single talk during the conference.

Hope you enjoyed reading! I feel a CWL-specific post coming up soon 😉

PyCon Sweden – Day 2

Me (left) and Robin Andeer about to start our talk!

Today was the second day of PyCon Sweden 2015, which marked the end of this fantastic conference. After an excellent set of talks the first day, the expectations were high! I have to say though that, given that I was giving a talk today together with Robin Andeer, I could not be as focused as yesterday. Here is my best attempt on summarising the talks I found more interesting.

1. Keynote – Kate Heddleston

More about Kate here.

The title of the talk was “The ethics of being a programmer”. Kate has given a wonderful talk about the situation of privacy and freedom when it comes to the digital world, that at the end of the day, is the personal world as well. The talk started with the story of how her father saves lives as a doctor. He has the legal and moral obligation to do his absolute best to save these lives.

This could be extrapolated to us as developers as well, maybe we cannot directly save the life of one person that has had an accident, but we can reach thousands of people, and as Kate said “The power of reaching thousands of people comes with a responsibility and ethics”.

The talk has gone around two “study cases”: Twitter harassment and Snapchat pornography “sharing”. Unfortunately this is one of the talks where you have to be there in order to really get what she was talking about.

Very nice discussion at the end of the talk, as every time that morality or ethics take place on a conversation (believe me, I’m a vegan…).

2. Why Django sucks – Emil Stenström

More about Emil here.

The talk started with a disclaimer that drew my attention:

Is not that I hate Django, but I think its important to know its not-so-good parts as well, you can’t just love something blindly.

I think all developers will agree on this: how many times have you been involved in a discussion about lets say, frameworks, where the two parts were blindly defending their framework ignoring or playing down its defects? I have, both listening and blindly participating. I think we should sometimes just be more objective :).

The talk focused on three main problems that Emil finds in Django.

Shared templates: Rendering everything in JS is a bad practice, rendering everything server side as well. You need to find a balance and render/process a bit on both sides, and to do so you need to be able to process templates on both sides. Django does not have a good solution for this.
Server push: It is not possible with Django (at least out of the box) to sent notifications to the clients, its always the client the one that needs to send a request to the server to get any information. In order to build any efficient application that requires real time notifications (a chat, some kind of timeline, etc), you need to be able to do this.
Template components: In order to add an external widget to your application, for example, you need to modify lots of bits of code: Add the JS, add the CSS, and link to those (and place correctly) wherever you have to use them. It would be more convenient and cleaner that adding components was easier and more independent.

3. How Python Drives the Analysis of Billions of DNA Sequences – Guillermo Carrasco and Robin Andeer

More about me… nah, you’re already reading my blog. More about Robin here.

So that was our talk! I’ll let other judge the quality of the talk and I’ll say only two things. One is that you can find everything about our talk in this repository, from the slides to the transcript of what we said (maybe not 100% updated, but informative enough).

Second is about the feeling of the talk, which was great! Despite the jitters, the talk was relaxed and people seemed interested. We got some very interesting questions at the end of the talk, and even after the questions time, people came to us to congratulate us and/or continue the discussion, that was very reinforcing!

What to say… thank you very much 🙂 It was great talking to everyone, and of course I am always open to discussions! Just leave a comment or send me a mail.

4.How to build a web application with Flask and Neo4j – Nicole White

More about Nicole here. GitHub here.

This talk was my favourite of today. Not only because the topic was interesting (never saw a graph database in action before), but also because of the format of the presentation.

Nicole started describing a little bit the differences between a relational database and a graph database. As an example she used a very simple schema with users and posts, that then showed as a graph where the joint table disappeared in favour of just notes pointing to each other.

During the next part of the presentation Nicole followed the FlaskR tutorial but, instead of using SQLite, she used Neo4j. The coolest thing of the presentation is that she actually deployed the app in Heroku for us in the audience to sign up and publish posts. This was risky but payed off, as we could see real time how the database was being populated, and how the relations between the nodes of the graphs were being created. At the same time, Nicole was showing the important bits of code that did the work.

Amazing presentation, kudos to Nicole ^^

Conclusions

Wrapping up two days of conference is difficult, but I hope that my summaries for these two days are enough to give you an idea of how interesting this has been and how much have I learned.

Just want to thank the conference staff and voluntaries for letting us talk and for an amazing organization: one of the very few conferences that I haven’t starved for being vegan, fantastic!

Also thanks a lot to Robin for bearing with me before and during the presentation, we’ve worked hard on this together 🙂

Remember, share if you liked it! Thanks!

PyCon Sweden 2015 – Day 1

Today was the first day of PyCon Sweden 2015. I’ve had the luck of attending for the first time! And this means that I’ve attended to 50% of all PyCon Sweden editions! (yes, its only its second year.. 😛 ). Joking apart, even though its a young fork of the main PyCon conference, here in Sweden the Python community is very big and active, which makes this conference very interesting.

A lot of nice talks today, a lot of them centred in data science and related topics, which was awesome, as it is a very interesting field, and close to bioinformatics in some sense, which is what I work with.

I’ve tried to take some notes on the talks I have attended, and here are my thoughts/summary about them, with some links of interest.

1. Keynote – Ian Ozvald

2. Analyzing data with Pandas – Robin Lindeborg

Robin’s GitHub here.

Robin has gone through a very nice introduction to Pandas, so if that is what you are looking for, you should definitely check his slides, that are available on his GitHub.

The topics covered have been very wide, including: data filtering, arithmetics on data frames and series, how to deal with missing values, etc. Lots of code examples, I highly recommend going through his slides.

At the end of the talk, he has conducted a life demo using data that corresponds to the military expenses from both USA and Sweden, from 1988 to 2015. I’ll let you guess who has been spending more money on armament ;-). Nice demo, nothing failing, which is quite surprising!

3. Docker and Python at Spotify – Belhorama Bendebiche

Belhorma GitHub here.

The talk started with a description of what Spotify was using before using Docker. Basically, he was mentioning things like Debian packaging, heavy puppet configurations and clusterSSH, used for deployment with all the risks it implies (it basically replicated the same command across multiple hosts using ssh. These solutions had a lot of problems: Configuration mismatches, rolling back was hard, network issues, human errors, etc.

Now, they use intensively Docker. Docker basically creates a small linux container that its easily configurable. A linux container could be defined very roughly as a process that “thinks” that is an isolated OS, having its own file system and everything it needs. A Docker build is configured using a dockerfile, that has commands to build the image, like dependencies and requirements. Each command is a layer and images can have parents.

The presentation finished with a nice demo of how a docker file and docker image looks like and how is it run. Unfortunately I can’t find his slides. Maybe some kind reader has? Please leave a comment!

4. Deep learning and deep data science – Roelof Pieters

Roelof GitHub here.

Very genuine presentation, where Roleof has started asking us if we were a “cat person” or a “dog person”. The talk has been driven by this question, describing how he build a classifier for taking pictures of cats and dogs to distinguish them.

Some of the tools or packages that he mentioned or used are scikit-learn, caffe, theano and ipython notebook. Caffe seems to be a very nice deep learning library, where you can find pre-trained models already available.

One message to take home is: The more features you have to classify your data, the better…but you have to be careful as well: more classifiers implies an exponential growth of the data.

The talk finished with some examples of deep learning used for audio and image recognition and some face recognition techniques, based on how the brain actually works.

He will publish the slides soon, I hope that in GitHub, otherwise I’ll update this blog post if I get to know where they’re published.

5. Hacking Human Language – Hendrik Heurer

Slides here, very nice ones!

This has been one of my favourites :). Hendrik is doing his Msc.Thesis in Natural Language Processing (NLP) at KTH in Sweden, and during the talk, he has gone through some of his projects or studies he has been carrying out.

To start with, he showed what seemed to be a map of Europe, and turned out to be a 2D plot of the GPS tag of thousands of flickr photo. Pretty neat :-). He continued with lots of examples of how you can use Python for NLP, like article content analysis or sentiment analysis, where he showed some examples (check the slides).

A good source to start learning, the free ebook Natural Language Processing with Python.

Some other topics he covered are: Work tokenization, stemming (finding the root of a word), Part-of-speach tagging (is the work a noun? a verb?…) or named entity recognition (the word is a name, a place, a date, etc).

Very interesting how he described the process of converting words into vectors on an n-dimensional space, so that then you can do linear algebra on those vectors to get what you want. For example comparing the vectorial space of two different languages, you can translate from one to another only by looking at the position of a word in the original language space and looking at the same position in the space of the other language. (that may be a bit confusing…)

Ends up graphing his google searches and with a funny joke: The words “KTH” and “lazy” get really close in the space.

6. Python: How a Notebook is Changing Science – Juan Luis Cano

Juan Luis blog here (only Spanish). GitHub here.

Nice talk describing how useful iPython Notebook is being in science. He uses Russell’s paradox to highlight the need of demonstrating both results, and procedure to get to those results.

IPython notebook is proven to be a very effective tool for that, giving the possibility of join code and rich explanations in a single place.

Several use case examples were shown in the presentation, all of them illustrating different aspects of iPython Notebooks, that from now one will be named Jupyter, which is a name made from Julia + Python + R, which are the platforms/languages that build the project. The big new thing about Jupyter is that it pretends to be language agnostic, allowing you to run even something called iMatlab (?!).

Conclusions after the first day

Very nice presentations all of them. I am very pleased of having attended and I am looking forward for tomorrow! A bit nervous for the talk I am giving, but really willing to do it!

About motivation

This blog post is a bit different than usual, in the sense that is not technical at all, but just my reflexions about how to get motivated, and my thoughts about why I think our (my?) motivation drops down from time to time.

The idea of this blog post comes from a small “motivational crisis” I suffered right after my Christmas vacations.

“A picture speaks a thousand words” – they say, so just take a look to my GitHub contribution chart to see what I mean with “motivational crisis”.

From mid December to mid February, I entered some kind of vicious cycle of demotivation: I was unmotivated at work, so I was less productive, which made me feel more unmotivated, which made me be less productive, which… after some time, I managed to leave this circle and start working with regularity again, but the fact is that this is not the first time that happens to me, and I know that this is a common issue, specially among programmers, so I started thinking about the cause of these demotivation peaks. This is just a compilation of my thoughts and some real science as well. Brace yourselves.

Why do I get unmotivated?

There are of course multiple reasons for this, and it is very personal and particular to each individual, but there are mainly three factors that I found that really affect my motivation at a given time:

1. Repetitive, mechanical or non-challenging tasks

There are times when on “has to do what he/she is told to”, or times where something just needs to get done, and it is something that you don’t get anything from. These situations really desperate me, but this is something we really have to deal with as grown up people ;-). If you’ve read the subtitle of my blog, you will understand this a little bit more. I always try to automate everything that can be automated, both for the sake of reliability and reproducibility, and because I find stupid doing the same exact task more than once. Whenever I can’t automate something for whatever reason, but I still have to do that thing, it really drops down my productivity.

2. Getting stuck on something you can’t do anything about

Occasionally, you get stuck in something that you just can’t do anything about. The classical example is when your work depends on third parties. When that happens and I can’t continue with my work, or even worst, I have to be attentive to their actions on the matter, preventing me to focus 100% on another task, I get really frustrated. Frustration and motivation are not good friends…

3. Uncontrolled procrastination

Procrastination is a real problem among people that work 100% of the time in front of a computer. In developer specially I would say. Do you recognise this situation? You’re working on something and you start Google-ing for some issue you’re having — “Hmm this conversation is interesting, and that framework/library/tool they’re talking about… I’m going to read about it, just to grasp the general idea” and… boom! Suddenly the day has gone and you’ve done anything but reading.

When that happens you feel bad because you’ve not been productive. Which makes you loose motivation.

How to get out from that demotivation vicious circle?

However you get unmotivated, once you get there it may be difficult to get out sometimes. Here are some tips that help me keep motivated.

If you have to do it, do it as soon as possible

Delaying “must” tasks is not a good idea. If you have to do something that you don’t like, the worst thing you can do is keep down-prioritizing it in your TODO list. All the time that these unwanted tasks are in your todo list, it is time that you have that voice in the back of your head telling you — “you should be doing that task… you will have to at some point”.

It is difficult, but be strong, sit down and work on that task as soon as possible. When you finish it you will have double satisfaction: Getting something done, and getting rid of something you don’t like. Think about that: the sooner you finish that task, the sooner you will be able to work on something you really like.

If you can wait, they can wait

This is regarding the second cause of demotivation. If you have to wait for someone in order to be able to continue your work, just do it. The key is not to obsess about that. Schedule some time for following up with third parties and don’t stress yourself on waiting for an answer. For example, send a reminder or answer related mails only at the end/beginning of the day. What you have to avoid is feeling bad about it, if there is nothing you can do, that’s it, don’t worry. Use the waiting time to work on something that motivates you.

Procrastination is not always a bad thing

On her book A Mind For Numbers, Barbara Oakley talks a lot about motivation and procrastination. The main idea about procrastination is that we do that because it gives us an immediate satisfaction feeling or “reward” for our brain. When you procrastinate, your brain is satisfied because you are moving your thoughts to something more pleasant. However, in long term, because you know that you should be doing something else, you feel bad about it. In order to stop feeling bad, you move again your thoughts to another more pleasant matter, and so on. This is why when one procrastinates, basically keeps going back and forth several things: reading a bit of this, a bit of that.

To avoid this, what I do is try to use this immediate satisfaction in my favour. I try to follow a routine where I work focused for a period of time and, when this time passes, I give myself a reward: a coffee, some reading, check facebook, a walk, etc. If you’re not disciplined enough to control that routine by yourself, try using something like the pomodoro technique for some time. The important part is to manage this “work/something done implies reward” flow. It is difficult at the beginning, but it pays off.

“Gamify” your work

Whenever possible, try to make it fun! One thing that I find fun and motivating right now, for example, is the GitHub longest streak. I am about to reach my longest contributions streak and, even if I know that it is not important at all how continued you contribute, I find it challenging and motivating to do so. This also pushes me to work on personal projects, which is good! Try to “compete” with friends, get inspired by what other developers you admire are doing. Ultimately, have fun!

Know yourself better

Finally, but by no means less important: Know yourself better. This may sound poetic, but knowing yourself will help you to understand why you are unmotivated, and more importantly, how to recharge your batteries.

A couple of months ago a good friend of mine introduced me to the topic of personalities. Thanks to him I’ve gotten to understand my personality better and thus learned a couple of things that help me to keep good energy. If you’re interested on this kind of stuff, this blog post is a very good introduction to the topic.

Being truthful to yourself is important, and will help you to keep a good mood and energy. If you are more of an extroverted and enjoy social activities, being with people and going to dinners and parties, go and do it. If you are more of an introvert and enjoy having time for yourself, reading or going for a walk, go and do it. No matter your personality, the important thing here is that you know what makes you feel better, and that you find the time for those things. Being on a good mood is 90% all you have to do to keep motivated.

What are your thoughts about motivation? Do you happen to have these motivation valleys? What do you do when that happens? Comment below for a nice discussion 🙂

Thank you for reading and sharing!

Genomics reference data: The fragmentation problem (part II)

This blog post is the continuation of “Genomics reference data: The fragmentation problem (part I)“.

In part I, we saw that the majority of the research groups work with several species and they need to download lots of reference data. We also saw that the sources for this data are diverse and unstructured. With the remaining questions I try to know how people is fetching and structuring reference data.

Questions 4, 5 and 6: How do you fetch your reference data? How do you structure your reference data? How do you keep your reference data up to date?

I decided to group these questions because they’re so closely related: You fetch your reference data and store it somehow. Later, from every now and then you have to update the reference data. If you fetch data from several places and how you get it (the structure) is different for every place, you’ll probably end up creating your on structure.

Here, some results:

As expected, the majority of the people uses a combination of automated and manual work. Quite surprising the amount of people that downloads the data manually themselves, with a 29.3%. If we sum up, a total of 82.9% of people do some kind of manual work for something that, IMHO, should be completely automatic. We are of course in the red group as well.

One particularity is that, if we group the answers by “Do you work only with one species, or several species’ genomes?” and “How do you fetch the reference data?”, we can see that only 2/7 of the people that answered “Only one specie” answered as well “I use an automated pipeline for fetching the data”. I would expect the opposite relation here, as fetching data for only one specie should be easier to automate.

How do you structure the reference data?

I am quite surprised with this result actually. I honestly thought that a higher percentage of people would use a different structure, given the amount of people that fetches reference data from different places and species. At least that is what we try to do: To unify everything under the same directory tree, which implies merging datasets from different places in a single place.

I also asked the question “What motivates you to use your own structure for your reference data?”. This was an open-ish question and thus difficult to plot, so I’ll just paste here the relevant answers. Again remember that you can obtain the responses form in my GitHub repository for this blog:

Indeed the major reason for structuring the data on a particular way is unification and convenience. The “I think its more logical” answers evidences the fact that in some particular sources of reference data, you can be very lost at the beginning….

Question 7: How do you keep your reference data up to date?

This is also interesting, and the answers to this questions were quite dissapointing, in the sense that there doesn’t seem to be any way to do this nicely… Again I will just paste here the set of answers:

- check at start of new projects
- I infrequently check for new updates and fetch them manually
- do manual check : Most of the time we stick to stable version e.g. we are still using hg19 
- I don't
- regular downloads quarterly
- custom cron scripts
- I use one of the aforementioned tools to automatically check for new versions of the reference data
- custom tool
- following the corresponding mailing lists
- Generally only want to use one version for a given project.  Check for new versions if applicable when starting something new.
- cron+wget/rsync for most sources
- I regularly check for new updates and fetch them manually 
- when i remember or someone needs a newer version
- prefer working with one version across the project 
- only when need
- I update data when I start a new project
- Staying consistent with versions of references  is more important than having the latest references.
- Update as projects demand

As you can see, the responses are quite diverse, and it involves a lot of manual work. Lots of answers are on the lines “when a project starts” or “on demand”, again making evident that the difficulty on automation of this process makes it hard to catch up with.

Question 8: Do you use any of these tools for downloading reference data?

There are lots of tools to help with the task of fetching reference data. Cloudbiolinux seems to be the most used among the participants, with 5 users. Whilst for the rest of the tools, apparently everyone uses its own. NOTE: There were lots of empty answers to this question, so if you make the numbers it won’t sum the total number of participants.

Cloudbiolinux is indeed a very complete tool that helps a lot in fetching reference data. We use it all the time, but still there are some types of data or organisms that are not available there.

It would be so nice to have a single tool for this task…

Conclusions

It is clear that something is wrong with the organization of genomics reference data. As I said on several occasions in this post, should be fairly easy to download and update such an important part of a genomics analysis.

The problem seems to be that organizations need to fetch data from several places and then structure it in some way adapted to their needs. Also, the data is available only in some particular servers (UCSC, ENSEMBL, etc) so people have to adapt to them. This is a problem for convenience, availability and reliability: What happens if a service is down? or if they decide to change the data structure? Something needs to be done here…

BioMart seems to be a good approach. Federation of the data would be a good win for both convenience and reliability. But we still have the problem of availability: It is the labs who host the data, so in case someone decides to stop hosting the data, no one can access it anymore again. Unless, of course, someone else has previously downloaded it and wants to publish it again in BioMart. Wouldn’t it be fantastic if that distribution was automatic? This scenario reminds me quite a lot to the BitTorrent protocol, doesn’t it? That’s what me and Roman Valls think, at least.

Someone (UCSC, ENSEMBL, etc) publishes some reference data, and starts distributing it through this protocol.
Inherently to the protocol, the data starts distributing all across the clients that download it, becoming more available and reliable.

A top layer could be added to be able to create logical or custom directory structures, still pointing to the same torrent files, so that users can download data and store it already in a particular structure, if desired. Also one could implement another layer to automatically build indexes for several aligners using some kind of virtualization like Docker, for example.

Of course this is just an idea and would need a lot of work on designing and thinking first, but I would love to know your thoughts about something like this, or about the situation in general. Also one needs to have in mind that BitTorrent is a stigmatized protocol in HPCs, so would have to deal with that as well.

And this is all for the survey. I must apologise again because the format of the survey (totally my fault) made it quite difficult to parse and extract conclusions, but as I said in Part I, one learns from this kind of mistakes 🙂

Please share if you found this post useful, and feel free to comment!

Genomics reference data: The fragmentation problem (part I)

If you work with genomics data, you know what I’m talking about. In order to perform any kind of bioinformatics analysis, you need to have at least some of the following reference data:

Reference genome for alignment, as well as index files for the alignment tool you’ll be using; i.e BWA, Bowtie, Bowtie2, etc.
Variant Calling data (VCF files)
Gene Transfer Data (GTF files)
Annotation data (snpEff, VEP, Annovar)
Genetic Variation data (dbSNP, etc)

Maybe even more, depending on the type of analysis you’re doing. The fact is that most of these data is dispersed around several servers and institutions. Not only that, this data is stored without any particular standard on the different sources you can find it. For example, if you want a particular piece of reference data for a particular organism, you may have to follow a different tree structure and naming conventions that if you want a different piece of reference data for the same organism if that last piece of data is stored on another server.

There are solutions that try to solve this problem making it transparent to the user which is the source of the reference data (i.e Cloudbiolinux or COSMID). However, due to the aforementioned lack of standards on the storage and maintenance of these data, these tools cannot always help.

Last July I was in Boston to attend the BOSC conference and previous hackathon. And I thought it would be a good opportunity to ask around what people was doing to solve this problem in their research groups. I set up a survey that participants would fill in. I also tweeted the survey so that other people could fill it. This is the survey, it is now unlinked from the responses sheet, so basically it is not accepting new entries.

In the following lines, I’ll try to summarise and make some sense from the responses I obtained. While I was writing, I realised that I have a lot to say, so I decided to write it in (two?) parts. This is part I, Part II will come soon 🙂

Participation

I got a total of 42 responses from different organisations, including Harvard Medic survey_participation al School, Princeton University or MIT among other research groups. The survey was open for one week (2014-07-07 to 2014-07-12), and I got most of the participation the first day.

A fair amount of research groups, I think. I want to thank everyone that participated in the survey.

The questions

I must say that the format of the survey made it a bit complicated to parse the results and extract information, I guess that’s something that one learns with experience as well. My intention with the questions in the survey were to determine:

Are we doing something really wrong, or is it really a fragmentation problem with the reference data?
What is other people doing? What solutions are they implementing?
If the problem is real… what can we do?

Question 1: Do you work only with one species, or several species’ genomes?

I considered this question important because It is not the same if you have to sync data from only one specie than if you have to get data from several species. Even if different data for the same specie is located in different places, to automate it is not difficult if you don’t have to plug in and/or update new species’ data continuously. It turns out that most of the groups are working with several species.

This is our case as well, and actually I find this group more interesting for the purpose of this survey, so great that is the clear majority of the cases.

Question 2: What kind of reference data do you use for your research?

From above’s plot one can see that there is a subset of reference data which use is common for almost everyone: Reference Genome and Index Files, with Reference Genome being used by everyone that answered the survey (well, except one…). This is expected given the fact that this data is used in the alignment step, the most common in a typical genomics analysis. Other common data among groups is GTF, VCF, GVF and annotation data.

Other custom-ish data like BED-detail or hand-made annotation tab-delim files… are as expected less common.

Here in the National Genomics Infrastructure in Sweden we use Reference Genome, Index files for several aligners and all the data in the more common groups.

Question 3: Where do you fetch your data from?

This is the question I was more interested in, because what made me think on this survey was the fact that:

We always need to download new data and update existing one
We do it in a semi-automated way because we don’t know any better…

The first thing I noticed is that the three most used places for downloading data are UCSC, ENSEMBL and BioMart . I think that everyone knows both ENSEMBL and UCSC, where you can download lots of reference data, but what I was not aware of was about BioMart (shame on me?). BioMart seems to be an effort to federate scientific data. What they say is:

Set up your own data source with a click of a button
Expose your data to a world wide scientific community through BioMart Portal.
Federate your local data with data from other community members

Basically, one can set up a node for sharing data, this node or database can then be listed in BioMart, together with the datasets that it contains, and you can then download that dataset later (applying some filters if you want to). The idea is good, but still we have the initial problem: What do you do if you can’t find a dataset that contains all the data you need? You look for it in another dataset and keep going… Also, it is very dependent on their software. I would like to see more common protocols being used.

In this survey, up to 17 reference data sources were listed… it is clear that there is a need for unification.

Conclusions (part I)

First of all, please feel free to explore the data yourself, you can find the parsed responses (without personal data) here, together with the Ipython notebook I’m using to analyze the data.

With this first part of the survey’s responses I could see basically that we are a common use case: We use data from several species, download lots of different data and from different places. Also, I discovered new reference data sources, like BioMark, a very good initiative, imho.

There are lots of places where to download the data you need, and if you start reading about them, it is quite frustrating to see how they diverge. Something that should be simple because is the base for any analysis, can become very time consuming.

On the second part(s) of the survey, I’ll go through the questions:

How do you fetch the reference data?
What tools do you use for fetching reference data?
How do you keep your reference data up to date?
How do you structure the reference data?
Where do you store your reference data?
Comments

Seems like I still have some homework to do.

I would love to start a discussion about this, so feel free to put your comments below. As I said, part II will come soon.

Hope you found this interesting!

“Mining” twitter for my “Twitteraniversary”

Twitter congratulations mail

So apparently today it has been three years since I created my twitter account, yay! However I feel like it was just a year ago that I started using it seriously… maybe 2? Was I using it before? I definitely think I had a “disconnection” period from twitter at some point…

I was asking myself these questions on the bus today, so I decided that it could be a fun exercise to actually answer these questions with facts. How? Well, fortunately you can pick up all the information you want from twitter using their complete API, and that’s what I did :-).

Just with a first call to the /account/verify_credentials.json API, you already get a lot of information:

//NOTE: Some info has been removed from the original response!
{   'created_at': u'Sun Sep 25 09:09:01 +0000 2011',
    'description': u'Computer scientist. Passionate about new technologies, programming languages and geeky stuff in general. Very interested in bioinformatics.',
    'favourites_count': 37,
    'followers_count': 68,
    'friends_count': 72,
    'lang': u'en',
    'listed_count': 2,
    'location': u'Stockholm',
    'name': u'Guillermo Carrasco',
    'screen_name': u'guillemch',
    'status': {   'created_at': u'Fri Sep 26 06:39:32 +0000 2014',
                  'favorite_count': 1,
                  'favorited': False,
                  'hashtags': [u'Twitterversary'],
                  'id': 515390196389801984,
                  'lang': u'en',
                  'retweeted': False,
                  'source': u'Twitter for Android',
                  'text': u"#Twitterversary it's 3 years in twitter today! (Only one using it actually xD)",
                  'truncated': False},
    'statuses_count': 376,
    'time_zone': u'Stockholm',
    'utc_offset': 7200}

Isn’t it cool? One day like today in 2011 I created my account, and I already have some of the stuff I wanted:

37 favorited tweets
68 followers
376 tweets in total
A mean of 0,34 tweets/day! You’ll not complain that I’m too verbose if you follow me…

This line chart represents the number of tweets over time:

Tweets per day on the last 3 years

As I suspected… I was actually some time without tweeting anything: From February 24th 2012 to Aug 26th 2012. Not a year, but a couple of months.

I find interesting my tendency of not tweeting more that once per day. This is something that I actually try to do consciously, I find annoying people that tweets too much and floods your timeline. Yes okay, sometimes I can tweet 2, 3 or even go crazy an throw 4 tweets per day, but not too often.

Now, we may want to be a bit more precise, right? For example this number, 376 tweets… how many are actually mine, and not retweeted? Twitter API documentation says:

Retweets can be distinguished from typical Tweets by the existence of a retweeted_status attribute

So if we group the tweets by the field “retweeted_status” it turns out that only 173 tweets are mine! Thats slightly less than 50% of them. I think it’s ok, a good balance between speaking and listening.

It is actually fun to play around with the Twitter API, you can keep answering silly questions like the one before:

What’s my most favorited tweet? 4 favorites, and it goes for the tweet for my last blog post, so let’s see if I can improve it this time 😉

Using celery to scale bioinformatics analysis http://t.co/dFM4cxcbCj
— Guillermo Carrasco (@guillemch) May 16, 2014

What’s my most retweeted tweet? A tweet where I asked for help to Genomics people on filling up a survey (that I still have to parse, shame on me!), with 8 retweets

Calling all #genomics data wranglers! Please fill in this 10 mins survey about fetching & updating ref data, thanks! https://t.co/tTz4qaCSwQ
— Guillermo Carrasco (@guillemch) July 7, 2014

I think that you get the idea. I highly recommend you to play around with this, it is quite fun! And you don’t have to code anything if you don’t want to, just use the ipython notebook that I wrote for doing my analysis, either raw code or the final result. There are lots of explanations about what I’m doing 🙂

Hope you have fun, if you did, share!

Using Celery to scale bioinformatics analysis

Celery is an asynchronous task queue/job queue based on distributed message passing

Yes I know, there are tons of tutorials on how to run Celery out there, but I just wanted to showcase how we use it in our production environment, this is a real life example.

What is celery? How does it work?

Celery is an asynchronous task queue/job queue based on distributed messaging passing. Plainly speaking, and taking out complexities, what this means is that you will have a queue of messages produced by someone that we will call producers. Then you will have someone, we will call them workers, reading this messages and doing some work. The following picture would represent this workflow:

BASIC celery architecture

This is the most basic Celery architecture you can have. Celery can work with several messaging queue systems, called brokers. We use RabbitMQ for our production environment. But you can use several.

As I said at the beginning, there are tons of tutorials out there, being the official one very good. So I will skip the “How-To” and just describe our environment.

How do we use Celery?

Bcbio-nextgen is the genomics pipeline we use at Science For Life Laboratory for the analysis of our samples. It is based on Python and developed by Brad Chapman. Celery is also written in Python, so the integration of the pipeline tasks with celery is straightforward.

A basic analysis would go as follows:

The raw data that comes from the sequencing machines (BCL image files) needs to be converted to something we can work with, i.e FASTQ files. At the same time, a demultiplexing process of the samples is carried on by Illumina software.
Once this is done, several steps compose an analysis, that is done using this FASTQ representation of the sequenced samples. These steps can be atomic, though we try to run the whole analysis in a row. Summarising, the steps of a complete basic analysis would be: Sequence alignment, contaminants removal, merge samples, mark duplicates and variant calling.

NOTE: Sorry if you are not familiar with Gemone Sequence Analysis, but the important thing to note here is that these tasks can be atomic.

Step one is done locally on our processing servers because it is very I/O intensive and our disks are fast. But step two is very CPU/memory intensive, and we need to do the analysis somewhere else. Here is where Celery comes in play. We are using an HPC center where we have our workers. These workers are listening to different message queues. When the preprocessing finishes on our servers (and the necessary data is sent to the HPC), these servers send a message “analyze” to one of this queues. When the workers pick up this message, the complete analysis starts. This figure illustrates our architecture:

Celery architecture at Science For Life Laboratory

You may think that we are not properly using Celery, as even if the analysis can be split in several steps, we are sending a message that basically says: Do… everything! Well, this is partly true. Actually the pipeline is designed to be able to restart the analysis at any point, is just that we almost never have to do that. Take a look at all the tasks that we have defined. Also, the main task “analyze” basically launches a program that takes care of atomically send jobs to SLURM, the queueing system used in our HPC.

The benefit of using Celery is that it can Scale to as many Workers as you want. Each worker will pick up a task and return the result if required, all of this asynchronously (or polling if you want to). Have in mind though that workers and producers should scale evenly. Here there is a very good presentation by Nicolas Grasset of things you should consider when working with Celery.

I hope that you enjoyed the reading, the intention of this blog post was, as I said, just to show a practical example of how Celery can be used.

NGS Data Processing Hackathon in Sofia, Bulgaria

A hackathon (also known as a hack day, hackfest or codefest) is an event in which computer programmers and others involved in software development, including graphic designers, interface designers and project managers, collaborate intensively on software projects – Wikipedia

On the past two days, a hackathon in Next Generation Data Processing has been celebrated in Sofia, Bulgaria. The aim of this hackathon was to put together NGS Data Processing projects developers and try to give a push to that projects in two days of intense coding. Hackathons like this are a very good opportunity to know not only cutting edge bioinformatics projects, but also the amazing people behind them.

Me, together with Roman Valls, have been working on a project called FACS, witch stands for Fast and Accurate Classification of Sequences. Like the widely used fastq_screen, FACS is a sequence classifier that tells you if a determined sequence read belongs to a reference organism or not, so it can be used as a contamination checking tool.

The advantage of FACS over fastq_screen is that it uses a completely different algorithm, bloom filters. As it does not require an alignement step, it is faster than fastq_screen. A previous version of FACS, written in perl, was published some time ago. We are now working on a new C implementation. For this hackathon we planned to work on three main tasks:

Work on the automation of a complete test suite for testing and benchmarking FACS and fastq_screen, on a reproducible way.
Fix the memory leaks reported by Coverity Scan.
Implement paired-end support for FACS

So far, in two days we could work on tasks 1 and 2. There was already an automated test suite, but there was a problem while automating the testing of fastq_screen using the aligner bowtie2, because the indexes are not available in the public Galaxy server, where all the reference data for the tests is downloaded, so if the user wouldn’t have then, the benchmarking would crash. We implemented a simple method to build the indexes in case of not finding them.

After that last fix and a bit of refactoring in the tests, running “make benchmark” on FACS root directory will build FACS, download reference data, generate test data with SimNGS, and run tests against this data with both FACS and fastq_screen (which will be also automatically be downloaded and installed). If you have an instance of CouchDB, the tests results will be uploaded to that instance, within databases named facs and fastq_screen, in JSON documents like this one:

{
   "_id": "6454e0822e48c768b02b036a4472690c",
   "_rev": "1-a9730c20e82fc579e912ff78290f6985",
   "contamination_rate": 0.333,
   "memory_usage": [
       25.18359375,
       47.51953125,
       47.53515625
   ],
   "p_value": 0.9999885,
   "total_read_count": 9000,
   "contaminated_reads": 2997,
   "sample": "facs/tests/data/synthetic_fastq/simngs.mixed_eschColi_K12_dm3_3000vs6000.fastq",
   "threads": 16,
   "bloom_filter": "facs/tests/data/bloom/eschColi_K12.bloom",
   "total_hits": 252986,
   "begin_timestamp": "2014-04-02T19:54:17.217+0200",
   "end_timestamp": "2014-04-02T19:54:17.245+0200"
}

You can use this information later for plotting the benchmarking results. We also have an IPython notebook (WIP) for doing that, mostly work of Roman. We already have plots for accuracy and speed. Our intention for the near future is to provide a benchmarking and comparison of memory usage. We are using memory_profiler for this.

On fixing memory leaks, we have to thank Ognyan Kulev for his collaboration. He helped us fix some problems and his knowladge on C was very useful for the task. There is still work to do though.

We finally couldn’t work on implementing paired-end reads support for FACS, but collaborations are welcome! 🙂

Overall it has been a great experience, as always on this events I’ve learnt a lot and heard about awesome projects going around. If you find this interesting, take a look at all the tasks proposed for the hackathon and navigate through the related projects.

Hope you enjoyed the read!

Best Practices on Development Workshop

Yesterday I held a workshop at Science For Life Laboratory (SciLife) which I called “Best Practices on Development”.

The aim of this workshop was to show the attendants (scientists on its majority) how to develop software in a better way. On these years working on scientific environments, I have seen quite a big tendency on not giving the deserved importance to software quality, prioritizing only obtaining the desired results. This blog post from Chris Parr describes this situation quite well.

IMHO we are doing a good job in the production team at SciLife, all our code is publicly available on GitHub, we have some of our software integrated in Travis-CI, and we have a good collaborative workflow (Trello, Pull Requests, GH Issues, etc.). For this reason, I decided that it would be good to share these good practices and experience with the rest of the scientists at SciLife, as well as everyone who wanted to attend.

Here you can find all the material of the workshop. On that repository, you can find:

A lot of material on Git & GitHub, Python, styling, testing and debugging
Two branches called exercises and solutions.
The slides I used (which are not very informative as it was quite interactive :-P)

I tried to structure the workshop so that the people could solve this problem step by step:

Is your first day, and they told you that you have to find a bug in a piece of software, fix it, document it, and make sure that this doesn’t happen again (a.k.a write tests!)

The experience was awesome, people were very collaborative, even discovering some non-intentioned bugs in my code due to last minutes changes (yep, practice what you preach…). Even the mixture of background and levels, I am very satisfied about how the people followed the workshop, all the effort put on it was definitely worth it.

Just hopping that this derives on a better development.

And needless to say, you’re free to re-use whatever you want from my repository!

Mussol's blog

In automation we trust

Author: guillemch