If I were Howard: improving upon the Phage Hunters project

Undergraduate labs are notoriously dismal, certainly from the student point of view and often from faculty’s as well. One obvious reason is that the design and delivery must to some extent be cookie-cutter because materials and set-up must be planned for what may be dozens of sections, delivered to hundreds of students, and most often constrained to once-a-week, two- or three-hour meetings. The Howard Hughes Medical institute (HHMI) has sponsored the HHMI Phage Hunters (SEA-phage) project (and related UT Austin Freshman Research Initiative) with the goal of addressing several shortcomings of classical Intro Bio labs. I believe that it’s on the verge of being something truly awesome (at least from the point of view of a structure/function discovery-driven instructor) and greatness for ‘my’ target group could be by following the core thinking to different ends.

Any curriculum design must begin with a concrete series of goals. The webpage cited above lists

ownership of a project
a chance to publish and contribute to the scientific community
regular milestones to measure progress
authentic scientific discovery
(implied, I think: gain familiarity with tools/techniques of modern molecular biology)

Looking at these, I think they’re sound as stated. The places where I think the project can be improved upon concern optimization of ownership, the expected readership in the scientific community, and that killer word ‘authentic’. Briefly summarized, the Phage Hunters projects has students collect soil samples, isolate phage, generate genome sequences, and deposit resulting data to public databases and annotate them.

CAVEAT: I’m a molecular biologist and geneticist, and some of my concerns may arise simply because I’m not the target audience. Nonetheless, my proposals apply to departments serving molecular/genetic folks.

What is ownership/does HHMI Phage Hunters hit the mark?

One of the overwhelming weaknesses of classic ‘cookbook’ college labs is that they are not actually investigations, they are demonstrations–“Please do what thousands have done before, see what thousands have seen before. The TAs have a guide of your expected results that is 10 years old, because nothing ever changes.” On the surface, I think the HHMI Phage Hunters project addresses this, in that your phage sequence may be different than my phage sequence. So at one level, yes–I truly don’t know what I am going to find, and it will be ‘fun’ to compare to what you find.

But now we get to the first aspect of ‘authentic’. Authentic research is driven by a need-to-know, and this should apply both to the field in general (the final arbiter being granting agencies 🙂 ) as well as to the individual pursuing the results. From my point of view, I think most current ‘large scale student engagement’ projects are succeeding here only if we take a rather paternalistic attitude toward students pursuing the project. These are the critical questions that I think constitute the ‘sniff test’ for any such project:

Do students know what hypotheses are being tested by their sequencing quests? I think this may be the highest hurdle; the easiest efforts to design are bug hunts (and my own proposal below isn’t immune to this criticism)
Is there a risk that the project will have to end because sufficient data will allow a conclusion to be drawn?
Do the previous year’s results drive a change in future year protocols based on new information?
Do students at one institution eagerly await sequences from other institutions?
Do faculty at the institutions offering the course (but who are not THEMSELVES involved) eagerly read the resulting work?
Do students who complete the course return periodically to the databases to see how the research in which they played a role is unrolling?

My worry is that by and large, the answers to these questions distinguish this work from ‘authentic’ research. Even after leaving a field, most of us follow it for years. There are several reasons–general interest, the fact that we have favored hypothesis and which to know the outcomes as they are tested, projects that we initiated and which to see resolved. I am not confident that for students that I work with, and the issues addressed by the Phage Hunters project would achieve this goal.

Publication and contribution

Again, I am 100% on board with the goal here. But my question is, if I, as an individual scientist, regularly submitted additional installments of phage sequences from soil samples I gathered at the grounds of the participating institutions to research journals what would the fate of my paper be? If the outcome hinges on the submission arising from students in the HHMI SEA Phage/Phage Hunters project, then my feeling is that we are to some extent belittling them, akin to letting them sit at the adult’s table for work that might not get the adults a similar seat. While I support providing venues for student work to be displayed and recognized on its own merits, I think we need to be forthright about what we are and are not offering. And I believe it is legitimate to ask “How close can we come to hitting the target squarely?”

Authentic scientific discovery

This is one of my favorite words; my recent job applications are peppered with it. But setting it as a goal is not the same as delivering it as an outcome. So here are some bullet points of my own that define the target ‘authentic scientific discovery for students’ from my point of view

student fully understands the question/hypothesis that work will address
student can predict outcomes and their consequences to the question/hypothesis
results new/novel and not wholly predictable, but also not ‘random’
students in different groups approach different aspects of the question
students in one location are informed by and interested in results at another location
future iterations of the project build upon previous results
project susceptible to being forced to change because questions are answered, discoveries are made

[Moment of irony: I wish that latter standard more often applied to faculty work as well…]

The elephant in the room

Most of the discussion of the various attempts seem to focus on the protocols, the data, the students–all important elements. But there’s a key ingredient without which we’re all doomed to failure: the mentor-scientist-teacher who is coordinating the discussing and thinking. Unless this individual really ‘gets it’–understands the learning goals, challenges student thinking, ties the data to concepts and hypotheses, allows students to lead, guides discussions along fruitful paths–everything else falls to ashes. I’ve directed introductory biology labs at a large undergrad institution for a decade, and long ago came to the realization that in many regards I’m the least relevant member of the team. Unless the ‘boots on the ground’–the Lab Instructors doing the interactive guiding of students–understand philosophically what we’re trying to do, grasp the structure of the exercises, are skilled in leading discussion, and have themselves been mentored in teaching, and perhaps even have that mystical ‘talent’, nothing I put out there ’causes’ science and learning to happen in the classrooms. That’s a long list of demands, and exists completely outside of experimental design, equipment, scientific question, etc. If we can’t engage, train, and support those delivering these programs, we’re done.

Massing for structure/function characterization

I believe the general goals of the HHMI Phage Hunters project as I see them can be achieved (reiterating caveat: for folks interested in structure/function and molecular/genetic projects) by considering how powerful the tools for mutagenesis and functional analysis are in both E. coli and Saccharomyces cerevisiae (henceforth ‘yeast’, with apologies to all those Schizosaccharomyces pombe researchers out there). My key argument is this: what if we re-phrased the genesis of the Phage Hunters project as “What are the most interesting sequences to determine/use cases where students would be applying tools of molecular biology?” and “What research could students do where the overall structure is ‘mass produced’, but results are nonetheless individual and informative (and require each investigator to analyze their finding(s)), and the body of results enlightens all participants?”

Following is a general approach to structure/function studies that could be applied to any gene product that has a phenotype in yeast (and recA+ E. coli though I must leave details to those more familiar with it 🙂 ), noting that the project could be expanded still further because gene fusions and cleverness can bestow phenotypes on a vastly wider spectrum processes still. I’ll close by listing general outlines for a few such systems that I’ve been thinking about (because I’m contemplating rolling up my sleeves and directing such projects in undergraduate labs and/or rolling back the clock and seeking Assistant Professorships).

Saturation mutagenesis: How hard, and why bother?

For my general case, I’m going to discuss a hypothetical protein and assume that it has a selectable phenotype (i.e. when it works, good things happen and those cells able to do those good things outlive/outgrow those that cannot). Further, let’s assume for now that the general issues of folding and stability are things we could communicate to students and that exploring the ‘folding space’ for a given protein is worthy work. In the examples at the end of this essay, I’ll give much more concrete discussions of proteins to study and what might be ‘interesting’ about each.

Yeast is the king of homologous recombination. Show it a free end of DNA and if there is a match in a plasmid or the genome, sequence replacement will happen. In other words, to do gene or part-of-a-gene replacement in yeast is essentially the same as introducing DNA. And there are myriad ways of achieving that. Here’s an electroporation protocol from yeaki. This capability allows for the following generic sequence of events (each of these would need to be done only once for an entire serious of projects, and all are straightforward):

If gene being studied is non-essential, delete it from the yeast genome utterly. (If not, replace endogenous form with one that is tagged by changing sequence to contain recognizable elements such as restriction sites)
Clone gene under study onto one of the thousands of handy yeast plasmid vectors
Engineer a restriction site near any region of the gene where mutagenesis will be undertaken

In order to introduce/replace any region of a gene being introduced, the steps would be

Generate primers flanking the region to be studied by mutagenesis
Perform ‘error-prone PCR’ in order to create DNA fragments with a relatively high frequency of altered sequences relative to wild type
co-transform the PCR products and plasmid DNA that encodes the gene under study which has been cut at an engineered restriction site in a region overlapped by the PCR fragment. Published example here.
Select for transformants using a marker built in to the plasmid
Profit! OK, actually select for whatever phenotype represents an interesting outcome

To make my life easy, let’s consider the following case: our protein of interest is only necessary under conditions where yeast is growing on glycerol (again purpose here is to speak concretely, not to knock your socks off). So as long as we’re living on glucose, the lack of a functional version of our gene-under-study has no consequences. Secondly, let’s say that we would like to study protein stability, so we are hunting for proteins that function under low and normal temperatures, but fail to function at high temperature (such ‘conditional’ mutations are wonderful to work with in hunting under such conditions immediately overlooks premature stop codons, failed recombination events, and a zillion other dead ends).

To identify our “grows at low, not at high” transformants, we simply

Take the plates we generated by transformation, which would have perhaps 2-300 yeast colonies per plate
Replica plate (or alternate figure) onto our glycerol-only media at LOW and HIGH temperatures
Look for colonies that grow up on LOW temp glycerol (our gene product is capable of function in cool weather) but not on HIGH temp glycerol (gene product falls apart when it gets hot out)

Extracting sufficient DNA from yeast is trivial, and one can proceed from there to PCR amplify the DNA and send it out to sequence.

So formally, we’re now in a point of parallelism to the HHMI Phage Hunters–a question has been identified, we have a pool of DNAs that we anticipate will be different and therefore ‘interesting’ to learn more about, we have isolated them and engaged the sequencing facilities of the world. What should students do, mentally and experimentally, with the data above?

Form of the information

The sequences returned from the investigations will contain (primarily) single amino acid changes that have destabilized the protein at high temperatures, but allow it to function at lower ones. One potential branch is to ‘hand off’ the protein to undergraduates in upper division biochemistry labs for expression and analysis. Obviously, if the protein cannot be readily purified or engagingly analyzed this is a non-starter… but it’s part of the massive Venn diagram that could be generated around the question “what genes should be targeted in this approach.” Those that are interesting for and amenable to biochemical or structural analysis are certainly exciting candidates; below I will argue that almost all are suitable for further genetic analysis.

In and of themselves, the discovered mutations allow for thoughtful consideration by undergraduate investigators. In most forms of the proposal, proteins for which a crystal structure exists (or for which a structure for a close homolog is available) are vastly stronger candidates, so I’ll assume one exists here. Where in the structure does the change occur? How does the alteration change the presumed stability; can we explain phenotype from change in amino acid properties? Note that this is the basis for one key component of the Genetic Disease inquiry available at thinkBio, and so would dovetail smoothly with a similar curriculum. Further, hypotheses generated in this phase give rise to predictions for subsequent rounds of investigation by these or next-generation students.

Of course, a ‘trivial’ side effect of characterizing these mutants is that students would become increasingly fluent in interpreting DNA sequence data, translating DNA sequence to amino acid sequence, thinking in terms of reading frame and DNA => RNA => protein processes within the cell, comparing amino acid properties and considering their roles… all part of most Intro Bio curricula.

Ownership and Community

Since the mutations available in any pool of PCR fragments would be random (though part of a finite conceptual pool), it is reasonable to expect that different groups would discover different mutations. This is especially true if the ‘mutation space’ includes dozens of mutants that confer the sought after characteristics and each investigating group followed up on only one or two. So there is a strong reason to believe that in most implementations, what the group working at bench #3 discovered would not be identical to that found by group #4. Even if they did find the same mutation, this result would be confirmatory and would begin an argument that there are only a small number of changes that can satisfy the selection criteria applied in the mutant hunt.

The statistical findings of the COMMUNITY would be of great importance in thinking about the overall question (again, for this example is “what positions contribute to the thermal stability of protein X”). Once the pool had grown sufficiently large to contain many duplicates, statistical analysis could be applied to determine whether the search was nearing ‘completion’ or to estimate the total set size–this in a ‘need to know’ context that might actually engage students in basic stats skills. Equally important, the hypothesis formed by any group about ‘their’ mutation makes predictions about what other changes are likely to turn up in the screen–instantly making the findings of all others engaged in similar pursuit interesting.

Moving forward… because we’re making progress

As indicated in the preceding section, one of the outcomes of sufficient turns of the wheel would be that investigators would begin to saturate the mutation space–they would reach a point where most ‘new’ findings were increasingly duplicates of older ones. At this point, the relatively trivial question “What amino acid alterations cause phenotype X” is answered, though the much more ‘authentic’ question of “What distinguishes members of the ‘in’ club vs. those that we did not find, and that presumably do not confer the phenotype?” comes increasingly into focus. This question is open to every group that has contributed to the data set and wishes to think about it; discussion boards could knit the community and organizers could enjoy watching the ebb and flow of ideas.

Equally importantly, the end of Phase I constitutes the foundation for Phase II. Recall that one of the criteria for selecting the initial gene product was “it confers some selectable advantage on the cell.” We now have mutants of the protein that allow cell growth on glycerol at LOW temperature, but not HIGH. The exact same procedures and materials outlined above are now used to look for solutions to the problems generated by the first round of mutagenesis: we see secondary mutations that restore function to our limping protein.

A few notes–first, to those asking “what is the evidence that such fixes are even possible?” The short answer is ‘nothing’, but the better answer is: recall that the first round mutants are only ‘a little bit broken’–they work at low temperature, just not at high. So it is reasonable to assume that many/most of our changes could be brought back to function with minor rather than major overhauls. Second, there is the question “certainly it is true that restoring the wild-type sequence will restore function?” This is true; part of the answer will have to come with experience: do we get solutions other than true wild type (point of genetic terminology: the ‘fixed because it’s wild type again’ is called a ‘revertant’; the ‘fixed in phenotype but by additional changes’ would be a ‘pseudorevertant’).

Again, this work is not embarked upon as a “let’s just do some stuff” approach; students’ hypotheses about why an initial mutation had a phenotype generate predictions about what changes will fix it.

Imagining concretely

This has gotten pretty abstract despite my best efforts, so let me paint a more concrete picture. Students do the suggested mutagenesis. As they populate the communal database with alterations, some of them notice that not only does their mutant put a larger hydrophobic residue in place of a smaller one in the hydrophobic core of the protein, but other positions nearby in the 3D structure do the same. The community hypothesis emerges that disrupting the core by disrupting the close packing of residues is a cause of the heat-induced loss of activity. Strong prediction: shrinking nearby members will in some cases restore the ability of the sidechains to fit together. So another bunch of students inherits the project at this point and isolates mutants with restored function at high temperature. Are these largely restricted to (or do they at least include) downsizing of hydrophobic sidechains in the neighborhood? If they do, are there other changes that clearly do not fall into that category? If so, by what mechanism do we hypothesize that those are acting? Are those hypotheses testable by further rounds of function => non-function => function cycling?

I’m not making this stuff up

Back in the day (the day being the mid-late 90s) I did this sort of work with Dictostelium myosin. The initial question was “can we isolate cold-sensitive mutants of the myosin motor,” with the rationale being that it had recently been crystallized, and what was needed was novel forms of myosin that would be more stable at heretofore uncharacterized and/or ephemeral points in its mechanical cycle. Taking advantage of some unique phenotypic characteristics of the Dictyostelium/myosin system, I identified 19 such mutations. Several of these highlighted regions that had been identified by biochemists based on other techniques and other criteria, and I embarked on suppressor hunts for one of the mutants in particular, a change at position 680 from glycine to valine. This yielded 19 secondary alterations that restored function. As an assistant professor, I employed a number of undergraduates who pursued these same strategies, with each having a project defined by an initial mutation or cluster. They successfully isolated changes, extracted DNA, sequenced it, and brought interesting hypotheses to me for consideration. Alas, I failed to push the project hard enough or to the appropriate granting agency, and it ground slowly to a halt. But as far as proof-of-principle goes, it’s all there. And this was Dictyostelium discoideum–other than having wonderful myosin phenotypes, there is nothing in Dicty that isn’t 20x easier in yeast :-).

So… where do we begin?

In case anyone out there has their gears turning, I’m going to throw a little grist into the mill as I’ve been toying with the idea myself. As I mentioned above, some of the features of the ‘ideal gene product’ might include

has convenient, selectable phenotype
conditions exist under which gene product is NOT required (cells can be propagated without it)
ideally: can also easily identify DEFECT in function
methods exist for biochemical purification and characterization of key aspect(s)
crystal structure exists
biologically interesting; medically interesting
functions in yeast or has homolog in yeast

Some concrete pieces

This is mostly top-of-head stuff, or stolen from my recent proposals for research I would lead in a lab hosting undergraduates. If a project like this were to go Big as HHMI Phage Hunters has done, there should be a comment period where everyone strained their brain to meet as many of the criteria above as possible (and perhaps decided how many of the candidates to proceed with).

I personally love everything about the yeast mating type pathway. Yeast can live without sex, so no matter how defective any functional element is, no problem. And there are 2 sexes; what is required for the goose may be unnecessary for the gander. Further, one can select for the ability to have sex (short term: complementation; longer: each of two genomes has both a necessary feature and is lacking one–only the duo can live once appropriate conditions are applied, so it’s mate [fuse, in the case of yeast] or die) and one can select for cells that refuse to have sex (basically, a poison pill approach–have the prospective mates encode a protein that is potentially toxic, such as a membrane channel that passes a poison; only those that eschew mating opportunities will survive drug application).

Not only is the yeast mating pathway genetically interesting; there are some famous homologs in play. The mating pheromone produced by one of the cell types is a homolog of the multi-drug resistance P-glycoprotein, mutants of which are the foundation of evolved drug resistance of many cancers. This pheromone is matured through addition of a fatty acid. The signaling pathway within a cell that informs it that mating partners are near involves a G-protein coupled receptor pathway. The mating pheromone not already mentioned is matured through proteolytic cleavage.

That’s one argument. In the Intro Bio teaching I have done, we talk about Hemoglobin a lot. Yeast makes a globin homolog, and if you expose yeast cells that lack their globin to enough nitrous oxide, they die while their wild type counterparts live. What would the phenotypes be of the dozens of hemoglobinopathies known to man be if transferred to this structurally similar protein? Wouldn’t it be interesting to see what secondary changes to structure could FIX those that lead to phenotypes in the yeast system? [Note: quite possible that there is too much distance between the systems for some mutations to have cognate phenotypes, but fun just to find out]

p53 expressed in yeast can be assayed for its transcriptional activities, and some dominant-negative alleles retain that aspect of their phenotype

For fun, here’s a page entitled “The Yeast homologues of human diseases-associated genes” (note: this shouldn’t be taken to imply all of these have unique or useful or even any phenotype in yeast!)

It’s easy in yeast

Introducing DNA to yeast: electroporation protocol from yeaki.

Construction of mutagenized genes in vitro in yeast here.

To get a look at sequences of created/discovered mutants: yeast colony PCR protocol

SaveSave