Modulation of Gene Expression Through DNA-Binding Proteins: Is There a Regulatory Code?

M. Beato 1 Hämatol. Bluttransf. Vol 29

1 Physiologisch-Chemisches Institut I der Philipps-Universität, Emil-Mannkopff-Str. 1-2, 3550 Marburg, FRG

Introduction

The information stored in the DNA of a fertilized egg can be divided into two different classes: structural information, required for the synthesis of all macromolecules that build up the organism, and regulatory information, needed to modulate the expression of the structural information in time and space, that means during the development of the different tissues. The connection between the two types of information is provided by regulatory macromolecules, that are of course encoded in the structural information and regulate its expression through interaction with regulatory elements of the DNA, thus closing the information cycle (Fig. I). The structural information is stored in the DNA in the form of the genetic code that was unraveled in the 19605. Part of the structural information are the signals for initiation and termination of transcription and translation, as well as the signals for RNA modification and splicing. On the other hand, little is known about the molecular mechanisms by which regulatory information is stored in the DNA. The general idea, however, is that recognition of specific features of the DNA molecule by regulatory DNA-binding macromolecules is essential for regulation. What exactly is recognized on the DNA and how the interaction modulates gene expression are the questions to be answered. During the past decade, several DNAbinding regulatory proteins from prokaryotes have been purified to homogeneity, and their structure as well as their interaction with DNA have been studied in great detail. A comparison of the amino acid sequence of 13 DNA-binding regulatory proteins reveals two regions of homology overlapping the known DNAbinding domains (Fig.2; [ I, 2]). Interestingly, mutants that disturb the binding of the lac-repressor to the operator are clustered around these two regions [ 1]. The secondary, tertiary, and quaternary structure of several DNA-binding regulatory proteins from bacteria and bacteriophages exhibit striking similarities in their DNA-binding domains [2]. Not only are these proteins symmetric dimers or tetramers, but they contain a pair of twofold related alfa-helices connected by a ß-turn that are responsible for most of the contacts with the B-form of the DNA double helix.

Fig. I. The information cycle

Fig.2. Regions of homology among 13 prokaryotic DNA-binding regulatory proteins

One alfa-helix fits into the major groove of DNA while the other lies across it, holding it in position. If one looks at the relevant alfa-helices along their longitudinal axis, one observes that the orientation of the amino acid side chains exhibits a clear polarity. That means that the nonpolar amino acid side chains are oriented toward one side of the alfa-helix, whereas the polar and charged amino acid side chains are oriented toward the other side of the helix. This would be the site that contacts the DNA major groove. This brief summary on the structure of prokaryotic regulatory proteins suggests that a basic protein structure has originated in evolution that can fulfill the requirements for DNA recognition. The actual function of a particular regulatory protein may depend on other domains of the protein that mediate the interaction with different modulator molecules. As for the DNA sequences that are recognized by the regulatory proteins, they also show considerable homology. Two types of conserved sequences can be derived from a comparison of 23 sites recognized by 13 regulatory proteins [ 1].

I. TGTGT N6-10 ACACA
II. CAC N5 -10 GTG

Both consensus sequences show a twofold rotational symmetry as expected from DNA sites recognized by dimeric or tetrameric proteins. Both conserved sequences are also similar in that they are composed of two short blocks (3-5 base pairs) of well conserved nucleotides separated by a more variable region (7 -8 base pairs on aver age). This structure of the binding sites is compatible with a model according to which the regulatory proteins contact the DNA only from one side and interact with two consecutive turns of the double helix (see later discussion). Most of the mutations that prevent binding of a particular regulatory protein to its binding site are located within the strictly conserved regions. It is striking that the homology between different binding sites for the same regulatory protein is not necessarily better than the homology between sites for different proteins, independen t of whether they function as positive or negative modulators of transcription. In fact, the cyclic AMP receptor protein (CAP) of Escherichia coli can bind not only to its own sites in the regulated promoters, but also to the lac and ara operators [3, 4]. Thus, it appears that the mechanism by which regulatory proteins recognize their binding sites on DNA is similar regardless of the functional consequences of the in teraction. In higher organisms, several DNA-binding regulatory proteins have been described. The best characterized are probably the T antigens of DNA tumor viruses such as SV40 and polyoma. The behavior of these proteins is reminiscent of that found in the repressor systems of gamma bacteriophages. By binding to three adjacent sites on the DNA, they can act as inhibitors of transcription from the early promoter or as activators of the late promoter [5]. I will concentrate on another group of regulatory proteins that have been extensively studied in our and other laboratories during the past 20 years, namely the receptors for steroid hormones. It is now well established that steroid hormones exert their effects on gene expression

Fig. 3 a-c. Structure of the glucocorticoid-binding sites of MMTV and hMTIIA. Computer graphic representation of the DNA double helix containing the nucleotide sequences of a MMTVI; b MMTVIIA ; and c hMTIIA (shown in Fig. 5). The sites of contact with the receptor are indicated by open Iriangles. Those positions hypermethylated in the presence of receptor are marked by full Iriangles. The receptor molecules are represented as broken circles. Numbers refer to the distance from the "cap" site

through interaction with intracellular receptors, that in their turn recognize regulatory eleruents in the neighborhood of the regulated promoters. Regulatory elements are defined as DNA sequences that in addition to being required for receptor binding, are needed for the hormonal regulation of transcription in gene transfer experiments. They were first reported in the long terminal repeat region (L TR) of mouse mammary tumor virus (MMTY), that contains the main promoter for proviral transcription [6-8]. Olucocorticoids were known to induce viral transcription in different cell lines [9], and gene transfer experiments with deletion mutants in the L TR region showed that the sequences relevant for hormonal regulation are located between 50 and 400 base pairs upstream of the initation of transcription [7, 10-12]. Within this region, several binding sites for the glucocorticoid receptor of rat liver have been described [8, 13]. Using a cloned proviral DNA from OR mice [8], we found four binding sites that share the hexanucleotide

5'- TOTTCT -3'
3'-ACAAOA-5'

Methylation protection studies have shown that both G residues in the hexanucleotides are in direct contact with the receptor [14]. In the binding site with the highest affinity for the receptor, further contacts are located in both strands 9-10 base pairs upstream of the hexanucleotide. These findings suggest an interaction of a dimer of the receptor with one side of the doublestranded DNA involving the major groove in two subsequent turns of the helix [14]. Such a model (Fig. 3 a) is very similar to

Fig.4. Position and orientation of the binding sites for the glucocorticoid receptor in hormonally regulated genes. The binding sites are indicated by open hoxes. The horizonta/ arrows show the orientation: to the right for the upper strand; to the left for the lower strand. The abbreviations are as follows: MMTV-L TO long terminal repeat region of mouse mammary tumor virus; hMT-I1A, human metallothioneine IIA ; chL YS chicken lysozyme; hGH-I human growth hormone; rUG rabbit uteroglobin; hPOMC human proopiomelanocortin; ch VIT chicken vitellogenin; chOVchicken ovalbumin; rPC3 [1] rat prostatic protein C3 [3]; pSC 7 Drosophila inducible gene at locus 74F; rTO rat tryptophanoxygenase; CAP initiation of transcription; PR progesterone receptor; ER estrogen receptor; AR androgen receptor; ER ecdysone receptor

that already mentioned for prokaryotic DNA-binding regulatory proteins, An analysis of other glucocorticoid-regulated genes showed that the presence of a regulatory element is not an exclusive property of the retroviral genome. The human metallothioneine IIA gene (hMTIIA), that has been shown to be induced by glucocorticoids in many different cell lines, contains a glucocorticoid regulatory element about 250 base pairs upstream of the initiation of transcription [ 15]. This element is very similar to the strong binding site found in the LTR region of MMTV (compare a and c in Fig. 3). In addition, there is a weak binding site in the hMTIIA promoter located at around 320 base pairs upstream of the initiation of transcription [15]. Similarly to the weak binding site in the L TR region of MMTV (Fig. 3 b), the shorter footprint and methylation protection pattern in the weak binding site of hMTIIA suggests binding of a receptor monomer. Interestingly, this weak site at -320 can be deleted without influencing the hormonal inducibility of hMTIIA [15]. Thus, it could be that a functional in teraction req uires binding of a receptor dimer to a strong site on the DNA. In the meantime, we have identified binding sites for the glucocorticoid receptor in several hormonally regulated genes. A summary of these results along with data from the literature is shown in Fig. 4. The promoter for the chicken lysozyme gene (chL YS), contains two binding sites for the glucocorticoid receptor, located at around 180 and 60 base pairs upstream of the initiation of transcription [ 16]. The upper binding site, that has a lower affinity for the glucocorticoid receptor, coincides with sequences required for hormone-dependent expression of the gene in oviduct cells [16]. In fact, these sequences mediate not only glucocorticoid regulation, but also induction by progesterone in microinjection experiments [16]. Interestingly, the partially puri fied progesterone receptor from rabbit uterus binds to the same sites as the glucocorticoid receptor, although with different affinity. Thus, it appears that the binding sites for the receptors of two different steroid hormones may be identical or at least share common sequences. That these similarities may not be limited to the progesterone and glucocorticoid receptors is suggested by studies with genes regulated by other steroid hormones (Fig. 4). The chicken vitellogenin II gene that is induced by estrogens in the liver, contains a binding site for the estrogen receptor around 600 nucleotides upstream of the transcription initiation site [ 17]. An analysis of the nucleotide sequences in this region reveals an element almost identical to the binding sites for the glucocorticoid receptor (Fig. 5). A review of the literature showed that a rat gene for a prostatic protein. that is

Fig.5. Consensus sequence for the glucocorticoid regulatory element. The nucleotide sequences of the main binding sites for the glucocorticoid receptor are aligned to yield maximal homology. Abbreviations are as in Fig. 4

known to be induced by androgens, rPC 3 ( 1 ), also contains a sequence homologous to the binding site for the glucocorticoid receptor some 140 nucleotides upstream of the initiation of transcription ([18]; Figs.4 and 5). Finally, an ecdysone-inducible gene of Drosophila (pSC7) also contains a binding site for the glucocorticoid receptor some 330 nucleotides upstream of the transcription initiation site ([ 19]; Figs.4 and 5). These findings, taken together, suggest that the regulatory elements for different steroid hormone receptors may be similar or at least overlap. The rabbit uteroglobin gene is induced by glucocorticoids in the lung and by estrogen and progesterone in the endometrium [20]. We have looked for binding sites for the glucocorticoid receptor and found none in the neighborhood of the promoter. The closest binding region detected is located 2700 nucleotides upstream of the cap site, and is composed of three binding sites showing sequence homology to other glucocorticoid regulatory elements (Figs.4 and 5). That this site may be relevant for regulation in vivo is suggested by the finding of a DNase I hypersensitive site in this region only in chromatin of hormonally stimulated endometrium (unpublished results). The human growth hormone gene (hGHI) is induced by glucocorticoids in several cell lines [21 ]. In fact, gene transfer experiments with a chimeric gene suggested that a fragment of DNA containing 500 base pairs upstream of the initiation of transcription is sufficient for hormonal regulation [22]. In binding experiments with the glucocorticoid receptor, however, we found amain binding site located around position + lOO, within the first intron (Figs.4 and 5). If this site is involved in transcriptional regulation in vivo, it would mean that the regulatory element can act even when located downstream of the regulated promoter. Taken together, the data shown in Fig.4 show that the regulatory elements for steroid hormones share some of the properties of the so-called enhancer elements [23]. They can act at variable distance from the regulated promoters, both upstream and downstream, and in both orientations. There is in fact direct experimental evidence for an enhancer function of the glucocorticoid regulatory element in the L TR region of MMTV [7]. A comparison of the nucleotide sequences of ten different binding sites for the glucocorticoid receptor yields the consensus sequence shown in Fig. 5. Therefore, the glucocorticoid regulatory elements have been conserved in evolution between chicken, rodents, and humans. The bestconserved regions include all those sites that are involved in direct contacts with the receptor [ 14]. The symmetry in the element

4 5 6 7
5' A A C A N8-11

12 13 14 15
T G T T 3`

T G is reminiscent of the binding sites for prokaryotic regulatory proteins, suggesting that molecular mechanisms similar to those operating in bacteria may be responsible for DNA recognition in higher organisms. What could this mechanism be? And, how can a regulatory protein accommodate so much sequence variation in the central part of the recognition site? Of course, a model like the one shown in Fig. 3 will only require the binding sequence to be preserved in the two nucleotide blocks that are the sites of contact between the relevant alfa-helices and the major groove of double helix. This would explain the tolerance in the central part of the element, but what kind of interactions take place in the conserved regions? Certainly most of the overall energy of binding is sequence independence, and originates from ionic interactions with the phosphate backbone of the helix [2, 24]. This explains why all DNAbinding regulatory proteins also interact nonspecifically with DNA. In addition, specific base recognition is based on a complementary network of hydrogen bonds between amino acid side chains in the relevant alfa-helices and DNA base pair atoms exposed in the major groove of the double helix [24]. In fact several amino acid side chains such as Arg, Lys, GIn, and Asn, can form multiple hydrogen bonds with paired bases on the DNA [25]. It has been proposed that if the regulatory protein moves a few Angstroms away from the DNA, most of these hydrogen bonds would be broken or would not be formed, but many of the ionic interactions will be preserved. This mechanism may be utilized by the proteins for sliding along the DNA in search of their target sites [26]. If we consider the base pairs in the major groove in terms of their ability to form hydrogen bonds, we realize that an AT

Fig.6. Pattern of hydrogen bond donor-acceptor sites in the major groove of the DNA double helix in and around the receptor-binding sites. The conserved nucleotide sequences of twelve binding sites for the glucocorticoid receptor with the flanking base pairs on each side, have been analyzed for the pattern of hydrogen bond donor-acceptor sites in the major groove. Only those positions showing more than 90% conservation are shown (open circles acceptor sites; full circles donor sites). Arrows point to the conserved N- 7 positions of guanines that represent sites of contact with the receptor [14]

base pair has the structure acceptor-donor -acceptor and is therefore symmetric, whereas a GC base pair has the structure acceptor-acceptor-donor (Fig. 6). If one now compares the ten glucocorticoid receptor binding sites with their flanking sequences in terms of this hydrogen bond pattern, one observes a very good preservation of the donor-acceptor structure around the contact sites, with very little agreement outside the binding region (Fig. 6). A certain symmetry can be detected centered at position 10: two well-preserved blocks, 3 to 8 and 12 to 17, separated by less-preserved positions, and interrupted in symmetric positions at 5 and 15. Of course, other interactions are probably implicated in recognition, but the network of hydrogen bonds seems to be an essential part of the code in which regulatory information is stored in DNA. A precise understanding of the molecular mechanisms by which the regulatory code is read could derive from the fine structural analysis of cocrystals containing the DNA-binding domains of regulatory proteins bound to the corresponding nucleotide sequences [27, 28]. Only then will it be possible to decide whether there is a general rule underlying the mechanism of sequence specific recognition by regulatory proteins.

Acknowledgments.

The experimental work reviewed here has been supported by grants from the Deutsche Forschungsgemeinschaft
and the Fond der Chemischen Industrie.

References

1. Gicquel-Sanzey B, Cossart P (1982) EMBO l 1:591-595

2. Takeda Y, Ohlendorf DH, Anderson WF, Matthews BW ( 1983) Science 221: 1020-1026

3. Ogden S, Haggerty D, Stoner CM, Kolodrubetz D, Schleif R (1980) Proc Natl Acad Sci USA77:3346-3350

4. Schmitz A (1981) Nucleic Acid Res 9:277 -291

5. Tjian R (1978) Ce1113: 165-179

6. Geisse S, Scheidereit C, Westphal HM, Hynes NE, aroner B, Beato M (1982) EMBO1I:1613-1619

7. Chandler VL, Maler BA, Yamamoto KR (1983) CeI133:489-499

8. Scheidereit C, Geisse S, Westphal HM, Beato M (1983) Nature 304:749-752

9. Ringold aM (1979) Biochim Biophys Acta 560:487-508

10. Hynes NH, van Ooyen All, Kennedy N, Herrlich P, Ponta H, Groner B (1983) Proc Natl Acad Sci USA 80: 3637-3641

11. Majors l, Varmus HE (1983) Proc Natl Acad Sci USA 80: 5866-5870

12. Buetti E, Diggelmann H (1983) EMBO J 2:1423-1429

13. Payvar F, deFranco DF, Firestone aL, Edgar B, Wrange Ö, Okret S, austafsson lA, Yamamoto KR (1983) CeI135:381-392

14. Scheidereit C, Beato M (1984) Proc Natl Acad Sci USA 81:3029-3033

15. Karin M, Haslinger A, Holtgreve H, Richards RI, Krauter P, Westphal HM, Beato M (1984) Nature 308:513-519

16. Renkawitz R, Schütz a, von der Ahe D, Beato M (1984) CeI137:503-510

17. Jost lP, Seldran M, Geiser M (1984) Proc Natl Acad Sci USA 8:429-433

18. Parker M, Hurst H, Page M (1984) J Steroid Biochem 20:67-71

19. Moritz T, Edström lE, Pongs O (1984) EMBO 13:289-295

20. Beato M, Arnemann J, Menne C, Müller H, Suske a, Wenz M (1983) In: McKerns KW ( ed) Regulation of gene expression by hormones. Plenum, New York, pp 151-175

21. Martial JA, Baxter lD, Goodman HM, Seeburg PH (1977) Proc Natl Acad Sci USA 74: 1816-1820

22. Robins DM, Paek I, Seeburg P, Axel R (1982) Cell29:623-631

23. Banerji J, Rusconi S, Schaffner W (1981) Cell 27: 299- 308

24. Seeman NC, Rosenburg lM, Rich A (1979) Proc Natl Acad Sci USA 72: 804-808

25. Rein R, Kieber-Emmons T, Haydock K, Garduno-Juarez R, Shibata M (1983) J Biomol Struct Dyn 1: 1051-1079

26. Berg oa, Winter RB, von Hippel PH (1981) Biochemistry 20: 6929-6948

27. Anderson l, Ptashne M, Harrison SC (1984) Proc Natl Acad Sci USA 81: 1307-1311

28. Frederick CA, Grable J, Melia M, Samudzi C, Jen-Jacobson L, Wang BC, Greene P, Boyer HW, Rosenberg lM (1984) Nature 309:327-331