In molecular biology, a CCAAT box is a distinct pattern of nucleotides with GGCCAATCT consensus sequence that occur upstream by 60–100 bases to the initial transcription site. The CAAT box signals the binding site for the RNA transcription factor, and is typically accompanied by a conserved consensus sequence. It is an invariant DNA sequence at about minus 70 base pairs from the origin of transcription in many eukaryotic promoters. Genes that have this element seem to require it for the gene to be transcribed in sufficient quantities. It is frequently absent from genes that encode proteins used in virtually all cells. This box along with the GC box is known for binding general transcription factors. Both of these consensus sequences belong to the regulatory promoter. Full gene expression occurs when transcription activator proteins bind to each module within the regulatory promoter. Protein specific binding is required for the CCAAT box activation. These proteins are known as CCAAT box binding proteins/CCAAT box binding factors. A CCAAT box is a feature frequently found before eukaryote coding regions, but is not found in prokaryotes.
Consensus sequence
In the direction of transcription of the template strand, the consensus sequence, or the calculated order of the most frequent residues, for the CAAT box was 3'-TG ATTGG -5'. The use of parentheses denotes that either base is present, but it is not specified as to their relative frequencies. For example, "" would mean that either thymine or cytosine are preferentially selected for. Within metazoa, the core binding factor -DNA complex retains a high degree of conservation within the CCAAT binding motif, as well as the sequences flanking this pentameric motif. The CCAAT motif in plants differs slightly from metazoa in that it is actually a CAAT binding motif; the promoter lacks one of the two C residues from the pentameric motif, and the artificial addition of the second C has no significant effects on binding activity. Some sequences lack the CAAT-box completely. Secondly, the surrounding nucleotides in plants do not match the consensus sequence above determined by Bi et al.
Core promoter
The CAAT box is what is known as a core promoter, also known as the basal promoter or simply the promoter, is a region of DNA that initiates transcription of a particular gene. This region, in particular for the CAAT box, is located about 60–100 bases upstream, however no less than 27 base pairs away, from the initial transcription site or a eukaryote gene in which a complex of general transcription factors bind with RNA polymerase II prior to the initiation of transcription. It is essential to the transcription that these core binding factors are able to bind to the CCAAT motif. Experiments in many laboratories have shown that mutations to the CCAAT motif that cause a loss of CBF binding also decreases transcriptional activity in these promoters, suggesting that CBF-CCAAT complexes are essential for optimum transcriptional activity.
Binding
In an experiment done with core binding factors and DNA complexes, researchers were able to determine the preferential sequences of the promoter in a region over and immediately adjacent to the CAAT box, and two regions on either side of the CAAT box. By using PCR-mediated random binding selection process, researchers were able to show that the sequence "3' - G ATTGG - 5'" immediately flanking the ATTGG region was preferentially selected on the coding strand. This was shown using an oligonucleotide sequence which contained 27 random nucleotides, flanked by a defined 20 nucleotide sequence on each side. While no single nucleotide was selected in every clone on either side of the ATTGG motif, there were several nucleotides in positions selected with high frequency. Most notably from the sequence above was the G residue towards the 5' end of the ATTGG. The other residues also listed were notable, but there is a split between two residues. This same experiment also yielded the same sequence as shown above when using a different oligonucleotide that contained an ATTGG core and flanked by 12 5' random nucleotides and 10 3' random nucleotides. Both these sequences are very similar and confirmed in multiple experiments. For sequences that flanked the ATTGG motif with two adenine residues on its 5' end and G on its 3' end, seems to have inhibited formation of the CBF-DNA complex and subsequently occurred in only 1% of the promoter sequences. In another experiment performed with the major late promoter of adenoviruses from a variety of host species, it was shown that the mutation of the CAAT box and CCAAT sequence, which is thought to play a pivotal role in the of subgroup C human adenoviruses, in species with a deficient CAAT sequence. The transcription initiation at mutant MLP species was significantly reduced compared with that of the wild type or species in which there was a CAAT mutant. The failure to restore the normally functional adenoviruses, exhibited by a CAAT box, is consistent with the idea that the CAAT box plays a vital role in the adenovirus MLP and is preferred over other transcriptional elements.
CCAAT in plants
These core binding factors, or nuclear factors, are composed of three subunits – NF-YA, NF-YB, and NF-YC. Whereas in animals each NF-Y subunit is encoded by a single gene, there has been a diversification in plants in both structure and function. Families of NF-Y consist of between eight and 39 members per subunit. A large reason for this diversification is because of gene duplications and tandem duplications, which have helped contribute to the larger family sizes of NF-Y compared to the single encoded animal nuclear factors. Each subunit contains an evolutionarily conserved part – the C-terminal of NF-YA, the central part of NF-YB, and the N-terminal of NF-YC, greater than 70% of these across species remains conserved. Neighboring regions however are generally not conserved.
NF-YA subunit
The NF-YA family encodes transcription factors that are variable in length. The NF-YA proteins are generally characterized by two domains that are strongly conserved in all higher eukaryotes investigated to date. The first domain contains 20 amino acids that forms an alpha helix that appears significant in its interactions with NF-YB and NF-YC. The second domain is adjacent to the A1 domain by a conserved linker sequence is a sequence of 21 amino acids vital in the specific DNA to CCAAT box binding. The A1 and A2 domains are conserved towards the C-terminus of mammals, but occupy a more central region in plant NF-YA subunits. In plants, the NF-YA subunit has evolved to regulate the development of a facultative root organ only present in leguminous plants and shown to be expressed in root tissue. It was shown to have drought-resistant-like properties, becoming upregulated during drought stress in the roots and leaves of Arabidopsis. NF-YA mutants have shown a loss of function and a hypersensitivity to drought-like conditions, and contrastly, overexpression of NF-YA has resulted in drought resistance.
NF-YB subunit
The NF-YB family is, similar to the NF-YA subunit, variable in length, however, on average much smaller than the NF-YA subunit. They have been characterized with a structure and amino acid composition similar to the histone fold motif. This is composed of three alpha-helices separated by two beta strand-loop domains. Similar to NF-YA, NF-YB has been shown to also improve drought resistance when overexpressed and also the promotion of flowering in Arabidopsis.
NF-YC subunit
The NF-YC proteins are an intermediate size between that of NF-YA and NF-YB proteins and also contain the HFM that is prevalent in NF-YB proteins. It has also been shown to be involved in flowering time in certain plants where its influence is potentially regulated by the binding of the protein CONSTANS to the NF-YC subunit.
NF-Y complexes
Because of the evolutionary change in NF-Y encoding genes in plants, they subsequently have a large range of potential trimeric complexes. For example, in Arabidopsis, 36 NF-Y transcription factor subunits have been identified and which could theoretically form 1690 unique complexes. This number, of course is higher than what actually happens since some subunits have specific binding patterns. Functional analyses on NF-Y encoding genes in plants have shown, as a result of their evolutionary diversification relative to their animal counterparts, have acquired diverse specific functions, such as embryo development, flowering time control, ER-stress, drought stress, and nodule and root development. This may only be a small portion of their capabilities, since the number of theoretically combinations of NF-Y complexes is so large and only a small portion can actually be created.
CCAAT enhancer binding proteins (C/EBPs)
Another aspect of the CCAAT binding motif is the CCAAT/enhancer binding proteins. They are a group of transcription factors of 6 members, which are highly conserved and bind to the CCAAT motif. While research on these binding proteins is relatively recent, their function has been shown to have vital roles in cellular proliferation and differentiation, metabolism, inflammation, and immunity in various cells, but specifically hepatocytes, adipocytes, and hematopoietic cells. For example, in adipocytes, this has been shown in a variety of experiments with mice: ectopic expression of these C/EBPs were able to initiate the differentiation programs of the cell, even in the absence of adipogenic hormones, or the differentiation of preadipocytes to adipocytes. In addition, an overabundance of these C/EBPs causes an accelerated response. And furthermore, in cells lacking C/EBP or in C/EBP-deficient mice, both are unable to undergo adipogenesis. This results in the mice dying from hypoglycemia, or the reduced lipid accumulation in adipose tissue. The C/EBPs follow a general basic-leucine zipper domain at the C-terminus and are able to form dimers with other C/EBPs or other transcription factors. This dimerization allows the C/EBPs to bind specifically to DNA through a palindromic sequence in the major groove of DNA. They are regulated through various means, including hormones, mitogens, cytokines, nutrients, and other various factors.