Causal Modelling in Fertility Research: A Review of the Literature and an Application to a Parental Leave Policy Reform

This paper reviews empirical studies that have examined the causal determinants of fertility behaviour. In particular, we compare the approaches adopted in the different disciplines to improve our understanding of how birth dynamics are infl uenced by changes in female employment and changes in family policies. The wide array of panel data that have become available in recent years provide great potential for advanced causal modelling in this fi eld. Event history modelling has been a dominant approach in sociology and demography. However, researchers are increasingly turning to other methods to unravel causal effects, such as fi xedeffects modelling, the regression discontinuity approach, and statistical matching. We summarise selected studies, and discuss the advantages and the shortcomings of the different approaches. In an empirical section, we analyse the impact of the German 2007 policy reform on birth behaviour to illustrate the diffi culties involved in isolating policy effects. The fi nal chapter concludes by underscoring that even simple modelling strategies may be benefi cial for improving our understanding of how policy effects shape demographic behaviour, and for laying the groundwork for more fi ne-grained causal investigations.


Introduction
While most Western governments may shy away from formulating any clear-cut pro-natalist policy goals, they nevertheless have a vested interest in understanding the effects that family policies can have on birth dynamics. Scholars have approached this topic from different perspectives. Thus, the modelling strategies diffi culties researchers face when seeking to isolate policy effects in an empirical investigation. The fi nal section summarises and evaluates the pros and cons of the different approaches. Such an overview can never be complete. It is limited to the two abovementioned broad research streams, and is restricted to studies that use the potential of longitudinal micro-level data to unravel causal effects, which may be panel surveys, retrospective surveys, or process-produced administrative data. 1 Furthermore, most of the literature we discuss below addresses behaviour in European countries since the 1990s.
2 Female employment and fertility

Contributions and limitations of event history modelling
Despite claims that the heyday of regression analysis has long since passed (Morgan/Winship 2015), event history modelling has remained a dominant method for studying fertility behaviour, at least in demographic and sociological research. Unlike in simple cross-sectional OLS regression, "time" is at the heart of these methods. Through its use of time-varying covariates, event history analysis lives up to one of the main principles of causal analysis, namely that the cause should always precede the effect (Bhrolcháin/Dyson 2007). Thus, a large body of literature emerged that examined how women's work -operationalised by a time-varying covariate distinguishing between different employment states -is related to fi rstorder and higher-order birth risks (for a meta-analysis, see Alderotti et al. 2021;Matysiak/Vignoli 2008). Because the moment of childbirth is not equivalent to the decision to become a parent, these studies conventionally predated the date of childbirth by several months. Important policy-relevant results were generated from this type of research, which illustrated that the links between female employment and fertility can vary greatly across policy contexts, socio-economic statuses, birth parities, cohorts, and time periods (Andersson et al. 2014;Matysiak/Vignoli 2008Schröder/Brüderl 2008;Wood/Neels 2017). Micro-level investigations also opened up the opportunity to operationalise employment status in a more sophisticated manner. While most of the macrolevel studies were confi ned to simple summary indicators, such as the female employment rate, micro-level studies could operationalise women's work over a fi ne-grained measure that differentiated between various employment categories, including educational participation, gainful employment, unemployment, and other activities. While this level of granularity enabled researchers to cast a more nuanced light on the micro-level correlations between female employment and fertility, it also raised broader questions about the meaning of women's employment status. This was particularly evident for the category of "housewife" that was included in some of the datasets. It was deemed obvious that housewives would experience elevated fi rst birth rates, based on the assumption that their life course concept was closely intertwined with the idea of having a family. Although the time-varying modelling of women's employment status guaranteed a correct ordering of cause and effect, it was also evident that it did not entirely eliminate reverse causality. People plan their lives ahead and they base their current actions on their anticipated future behaviour (Hoem/Kreyenfeld 2006). An individual's current employment status is therefore more than just an indicator of his/her present circumstances. Instead, it is the product of choices that factor in anticipated future behaviour and thus the individual's "personal imaginaries" or "narratives of the future" (Bazzani et al. 2020;Vignoli et al. 2020a/b).
Another source of bias in event history modelling comes from the modelling of employment as a time-varying covariate. In such cases, endogenous selection bias may bias the coeffi cient for past employment status, if past employment affects the confounders of current employment (Elwert 2013;Elwert/Winship 2014;Sharkey/ Elwert 2011). Multi-process models or simultaneous equation event history models have been proposed in an effort to address endogeneity (Lillard 1993). The advances in such methods have been closely connected to the development of the software package aML (Lillard/Panis 2000), which led to an upsurge in the use of these methods in family research, particularly in the early and mid-2000s (e.g. Aassve et al. 2006;Upchurch et al. 2002;Kravdal 2001;Kulu/Steele 2013;Schnor 2014). This modelling approach is based on the premise that partnership formation, marriage, educational attainment, and fertility behaviour are transitions that may infl uence each other as these processes unfold. Multi-process modelling tries to capture this endogeneity by letting the outcome of one process (such as the formation of a partnership) infl uence the other process (such as educational attainment), and by including heterogeneity components in each process. These heterogeneity components are usually assumed to follow a joint normal distribution. A methodological limitation of these models is that they hinge on this underlying distributional assumption. 2 While this assumption may be relaxed, employing alternative ones often leads to practical problems, such as the models failing to converge (see e.g. Kravdal 2001: 198). A more substantive limitation of these models is that the unobserved heterogeneity components remain a black box, with researchers being unable to pinpoint the exact mechanisms that lead to a particular behaviour. Furthermore, even if employment and fertility can be modelled as mutually related processes, such efforts will not eliminate the bias that arises because people are acting on their anticipated behaviour in the future (the abovementioned "personal imaginaries").
Endogenous selection bias is conceptually different from the more widely known problem of common-cause confounding, whereby the bias comes from having failed to include a variable that causes both the variable of interest (employment) and the outcome variable (fertility) (Elwert 2013;Elwert/Winship 2014;Hernán et al. 2002). A typical example would be the omission of migration background, which may affect both fertility behaviour and employment. Standard regression analysis usually tries to combat this problem by including additional variables in the model. Thus, the main strategy for assessing causality in regression models is to eliminate "alternative effects" by controlling for confounders. This strategy has several implications. It hinges on the availability of variables, whereas some variables may simply not be available in a given dataset. In addition, the results may vary depending on the number and the type of variables that are included in the regression. This is also one of the main conclusions reached by Matysiak and Vignoli (2008) in their meta-analysis of 30 journal articles that used event history modelling to assess the effects of female employment on fertility. The authors concluded that the prior results were often unstable and heavily infl uenced by the inclusion of available covariates. While this is undoubtedly a valid conclusion to draw from this meta-analysis, it can leave researchers in some despair. Scholars may be torn between including further covariates and ending up with a grossly over-specifi ed model, or choosing a parsimonious modelling strategy while running the risk of being accused that the model does not suffi ciently account for all possible confounders.

Fixed-effects modelling for fertility research?
The self-selection into "treatment" is one of the main reasons why observational studies will always remain limited in their ambitions to make causal claims. While experimental designs, such as randomised controlled trials, are usually considered the gold standard for causal analysis, conducting them remains an elusive goal for most demographic and sociological researchers, as ethical reasons generally prohibit the use of such designs (Bhrolcháin/Dyson 2007). Nonetheless, most of the progress made in recent decades in the development of sociological research methods has been geared towards mimicking an experimental design approach (Gangl 2010). The terminology of a "control group" and a "treatment group" clearly signals such a commitment. The rise of the use of fi xed-effects regression modelling has been informed by this broader aim (Brüderl/Ludwig 2015;Hill et al. 2020). Fixed-effects modelling is based on the idea that a person can be used as his/her "own control group" if longitudinal data are available (Allison 2009;Brüderl/Ludwig 2015). Thus, fi xed-effects panel regressions make causal claims by exploiting the variations that exist within a person over time. A classical demographic example that demonstrates the power of this method is the investigation of the relationship between marriage and life satisfaction (Brüderl/Ludwig 2015). In contrast to a standard regression approach that compares the life satisfaction of married and unmarried people, this method follows a person throughout his/her life course, and assesses how the individual's life satisfaction changes when s/he experiences the event of marriage. In this context, marriage is considered the "treatment", while the "control group" is the individual before s/he experienced the event. Technically, this method eliminates all time-constant heterogeneity that commonly distorts standard regression analysis (Brüderl/Ludwig 2015).
Many studies have employed fi xed-effects panel regression to study different demographic and sociological themes, such as the demographic determinants of life satisfaction and other dimensions of well-being. Moreover, childbirth has been used regularly as an independent variable in these models (e.g. Myrskylä/Margolis 2014). It is noticeable, however, that only a few studies have employed this method to examine childbearing behaviour as an outcome variable. In the few cases in which the method was employed for fertility research, the results were mainly published in economic journals (Boca et al. 2005;Huttunen/Kellokumpu 2016;Michaud/ Tatsiramos 2011). 3 Have demographers and sociologists been too slow to catch up with this methodological innovation? Or are there sound reasons why this method has not yet been widely adopted in fertility research?
A possible answer to these questions is that, the study of fertility from a parityspecifi c point of view is almost an imperative in demographic and sociological research. The decision to have a fi rst child is regarded as a distinct choice that is determined by factors that are different to those that result in higher-order births. If we subscribe to this idea, we need to model the transition to fi rst parenthood as a unique event. Fixed-effects panel regression runs into a problem in such cases, as it is based on the premise that the event (i.e. the birth) should be observed before and after a treatment (e.g. a change of employment status). If an event is observed only once in a lifetime, there is a fundamental fl aw in the logic as there is no before and after. If the event of the fi rst birth occurred before the "treatment", it cannot be repeated afterwards.
To get around this problem, some researchers have treated the number of children as a continuous variable (see e.g. Boca et al. 2005;Huttunen/Kellokumpu 2016;Michaud/Tatsiramos 2011). Obviously, this approach confl ates different birth parities. But there are other concerns as well. Births are infrequent events in contemporary European societies, and people rarely have more than two or three children. As such, there is little variation in the outcome variable, and some of the variation comes from the more "unusual" cases of people with larger families. Moreover, employment and fertility are events that strongly infl uence each other. The birth of a fi rst child often results in drastic changes in a woman's work pattern. A large share of women do not work shortly after giving birth, but re-enter the labour market at some later point in time. In other instances, women may space their fi rst and second children closely together to minimise their family-related employment interruptions. The complex interplay that exists between fertility and women's employment gets completely lost in these models, such that it ultimately becomes diffi cult to grasp the meaning of the underlying behavioural patterns. Thus, the use of fi xed-effects panel regression is not a panacea for causal analysis (Collischon/ Eberl 2020;Hill et al. 2020) and it is clearly not the best way to move forward in the analysis of fertility behaviour. 4

Fertility intentions and panel data
While fi xed-effects modelling may not be the preferred method for examining fertility behaviour, it can be applied more effectively to the analysis of fertility preferences. The opportunities to analyse fertility preferences with European panel data accelerated in the early 2000s, when data became available from sources such as the Generations and Gender Survey, the British Understanding Society study, and the German Family Panel, which included recurrent items on the respondents' fertility desires, ideals, and intentions. 5 Kuhnt, Kreyenfeld, and Trappe (2017) used data from the German family panel (pairfam) to study changes in the ideal number of children across the life course by means of fi xed-effects modelling. The authors did not fi nd any association between women's employment status and their ideal number of children. The only variable that was shown to have a substantial effect was "having (further) children", which suggests that the respondents were adjusting their ideals to align them with their behaviour. Kuhnt, Minkus, and Buhr (2020) employed the same data, but used the intention to have a child (within the next two years) as an outcome variable in a multinomial fi xed-effects regression model (with the categories "certainly yes", "uncertain", and "certainly no"). While the authors did not fi nd that a woman's employment status affected her fertility intentions, the model showed that men's worries about fi nding a suitable job increased the likelihood of being uncertain about the plans for having a child within the next two years. A non-trivial problem that tends to arise in this type of investigation is that some respondents may realise their fertility intentions and have a child during the study period. As such, there is selective drop out from the study population. 6 Other studies have focused more specifi cally on the link between intentions and fertility behaviour. Berrington and Pattaro (2014) used data from the UK data 4 Schröder and Brüderl (2008) used an event history model to study fi rst birth rates. Instead of controlling for employment status, they controlled for whether a person had changed his/her employment status (i.e. moved from non-employment to employment). In doing so, the authors used standard event history analysis, but also exploited the within-variation. 5 The collection of large-scale longitudinal data that included items that measured fertility desires, intentions, and ideals was initiated in the US as far back as the 1970s. Much of the work on fertility intentions was motivated by studies based on US data (see e.g. Hayford 2009; Thomson 1997). 6 A downturn in these investigations is that all birth parities were pooled in this investigation, Thus, the fertility intentions of a particular woman that is observed across time may refer to different birth parities. Furthermore, problems of selectivity may arise, because women who did not state a preference, or who were pregnant at the time of the interview, are excluded from the sample. In a similar vein, the realisation of an intention (childbirth) is not adequately accounted for in this model. "Understanding Society" to compare women's fertility intentions at age 23 with their actual number of children at age 46. This study found that highly educated women were less likely to realise their intentions. However, the results also showed that a woman's employment status at age 23 did not seem to be related to her chances of realising her early fertility intentions. Other studies examined how a woman's fertility intentions as measured in a given year determined the probability of the woman realising those intentions by the next year or the next round of data collection (e.g. Hanappi et al. 2017;Kuhnt/Trappe 2016;Riederer et al. 2019;Spéder/Kapitány 2009). These studies have greatly advanced our understanding of the obstacles women face in realising their fertility plans. Ultimately, however, these models do not solve the problem that a woman may self-select into a certain employment status across her life course.

Natural experiments
Our interest in understanding the causal effects of female employment on fertility stems from the fact that in economic and demographic theories of fertility, women's labour market participation has been seen as one of the prime factors contributing to declining or low fertility rates. Refuting the hypothesis that women's employment leads to lower fertility not only challenges the predictions from conventional theories; indeed, in the narrow economic sense, it means that the "income effect" has ultimately trumped the "substitution effect". More broadly, it implies that women's role in society, the meaning of female work, and the economic foundations of the family have all shifted. As these are strong implications, it is imperative that we rule out the possibility that selection into employment has produced the observed outcomes.
A standard approach used in economics to establish causality is to search for natural experiments; i.e. events that affect the variable of interest (female employment, non-employment, or unemployment), but that are beyond the control of the individual. Examples of such natural experiments are large-scale lay-offs, bankruptcies, and fi rm closures. These events have been used previously in labour market research to examine how unemployment, operationalised over these exogenous unemployment shocks, alters people's behaviour and well-being. However, this approach has seldom been used in fertility research (see, however, Del Bono et al. 2015;Hofmann et al. 2017;Huttunen/Kellokumpu 2016).
With regard to the small number of existing studies that adopted this approach, most were conducted by labour economists who employed matching techniques. The underlying idea of this approach is that true causal analysis requires a comparison of a control group and a treatment group. Standard observational studies suffer from the problem that the comparison group may be select, and is thus not comparable with the treatment group. To ensure comparability, matching techniques start with the sample of "treated" individuals and search the pool of the control group for those units that best correspond to the treatment group. This "search procedure" has been refi ned in recent years by employing sophisticated algorithms, most recently from machine learning operations (Lee et al. 2010). However, despite recent refi nements in these approaches, researchers using matching techniques struggle with problems similar to those faced by researchers using conventional regression models. To ensure a good match, they need a suffi cient number of covariates. If these covariates are not included, the matching techniques will result in poor matches -and, by extension, in biased results.
In addition, matching techniques are not always easy to apply to demographic processes. Hofmann and colleagues (2017) examined, for example, the causal effects of unemployment on fi rst birth risks. The paper compared the subsequent birth risks of individuals who lost their jobs after having been laid off by a fi rm with those of a control group of women who were not subject to a fi rm closing. To make the control and the treatment groups comparable, fi rm-level information was required. As a result, the sample was limited to respondents who had been employed with the same fi rm for at least 1.5 years. Women who had been working at fi rms with more than 2,000 employees or at very small fi rms had to be excluded as well, because it would have been diffi cult to fi nd suitable matches for these women. In the end, after the restrictions were imposed, the original sample of 101,910 individuals shrank to 3,286 treated and 17,836 untreated women. Although the sample restrictions were clearly spelled out and clear justifi cation given for them, the results from such an analysis are inevitably limited to a restricted subpopulation. In this case, one of the restrictions required by the sample was that the women were employed with the same fi rm for at least 1.5 years. This limited the results to women who were fi rmly established in the labour market. Thus, while this model may have generated careful causal effects, the external validity of the results was limited.

Narratives of the future
A new impetus for understanding the causal link of the "female employment and fertility nexus" comes from studies that have tried to integrate future uncertainties into models of rational decision-making. These models break from the presumption that many decisions, including the decision to have children, are long-term commitments. It is therefore argued that actors need to take into account the uncertainties regarding the future implications of their choices. Some future states of the world may be extrapolated from the "shadow of the past" (i.e. it may be assumed that a woman's husband will continue to have a high income if he is currently earning well). Other states of the world are more diffi cult to predict, and actors may "play" with imaginaries to deal with these uncertainties (Beckert/Bronk 2018).
"Narratives" are standard concepts in psychology and sociology, and have recently also been employed in economics (Bènabou et al. 2018;Shiller 2020) and demographic research (Vignoli et al. 2020a). It has been argued that while conventional fertility models have primarily dealt with the effects of present states, fertility choices are governed by narratives of the future, which stem from the "human capacity to place oneself in an imagined situation that cannot be deduced from present conditions" (Vignoli et al. 2020b). In other words, the decision to have a child depends on how a woman visualises the future; for example, whether she can imagine being a "working mother", or whether she can only imagine being a "homemaker". This area of research is still evolving, and among the major challenges scholars have faced is that of generating suitable item sets that can capture these "narratives of the future" (Vignoli et al. 2020b) or "images of the future self" (Bachrach /Morgan 2016: 466). What is interesting about this debate is that it shifts attention away from developing sophisticated econometric techniques for studying the causal effect of employment on fertility, and instead focuses on fi nding appropriate item sets to measure future "models" or "narratives" of the world. It has also been proposed that more qualitative research should be conducted, including longitudinal qualitative research that helps to elucidate how these "narratives" are formed and altered across time (Bernardi 2021;Vignoli et al. 2020b).
The concept of "future narratives" resonates well with the considerations developed by Friedman, Hechter, and Kanazawa (1994). They provided a rationale for why some women may decide to have children in a seemingly unstable und insecure economic situation, and why this decision may still be regarded as rational. On the one hand, childbirth tends to block biographical "alternatives". Thus, having children while unemployed may limit a woman's chances of swiftly returning to the labour market and fi nding a job with good career prospects. On the other hand, this limitation may be perceived as a relief and a source of "uncertainty reduction", as it structures a woman's otherwise uncertain life course. 7 For women who foresee that they will not be able to succeed in the labour market, taking on the role of a homemaker may be regarded as a meaningful and socially accepted "biographical alternative" to an uncertain labour market trajectory. Clearly, the societal context matters here because it defi nes the available "narratives" or "biographical alternative". For example, policies may encourage or discourage a certain earner model; or the normative fabric of a society may defi ne gender roles, attitudes towards non-parental care, and, more generally, attitudes towards parents' roles as earners and carers.

3
Policy change and fertility

Isolating policy effects
In response to declining and low birth rates in European countries, policy-makers have become increasingly interested in the question of whether policies could halt the downward trend in period fertility and reduce childlessness, particularly among highly educated and work-oriented women. In this debate, it was argued that the expansion of publicly fi nanced child care and the implementation of earningsrelated parental leave benefi ts are key measures that can simultaneously support higher levels of gender equality and fertility. Against this background, the European Commission issued recommendations to expand public day care and to implement suitable parental leave regulations (Annesley 2007). Although many national governments followed their own agenda and did not necessarily comply with the EU recommendations, most European governments have scaled up their public day care services and implemented parental leave regulations since the 1990s (Daly/ Ferragina 2018). Child care services have been gradually expanded, and the speed of this expansion has depended in part on the capacities and conditions at local municipality level. Researchers have thus been able to use the regional variations in these child care expansion efforts to isolate the possible policy effects (e.g. Bauernschuster et al. 2016;Hank/Kreyenfeld 2003;Krapf 2014;Kravdal 1996). 8 However, as parental leave benefi t regulations were generally implemented at the national level, scholars had to use other strategies to examine the impact of these measures on birth behaviour. Much of the early demographic research into the causal impact of policy changes on fertility behaviour has dealt with developing economies, and thus with highfertility settings. For these regions, randomised controlled trials have been regularly employed to evaluate the effectiveness of family planning programmes (Mwaikambo et al. 2011). For advanced economies, these methods are applied less commonly in the area of fertility behaviour. There have, however, been investigations of how sexual education (and other "moderate" interventions) have affected teenage pregnancy rates (for an overview, see Bennett/Assefi 2005). Ethical issues, including concerns about equal access to public services, often prevent the implementation of randomised control trials to evaluate the impact of more coercive policies on fertility behaviour (Bhrolcháin/Dyson 2007).
However, the programme evaluation literature offers a battery of techniques for isolating policy effects with observational data (LaLonde 1986). A standard method that is often used is the difference-in-difference (DiD) approach, which is commonly combined with the abovementioned matching techniques. This approach suffers from the same drawbacks as fi xed-effects modelling (see Section 2). The DiD requires an observation before and after the intervention to isolate the causal effect of a policy. As the fi rst childbirth is an absorbing event, DiD cannot be employed to study fi rst birth processes. Researchers could combine all birth orders so that they would be able to observe events before and after treatment (a policy reform). However, this would violate a basic understanding of demographic research: namely that fi rst and higher-order births are distinct processes.
In light of these challenges, researchers have increasingly turned to the regression discontinuity approach to examine how policy reforms impact behaviour. The regression discontinuity approach is based on the idea that causal inference can 8 The assumption here is that regional and temporal differences in child care slots depend on the capacities and willingness of local actors to increase child care services. If municipalities offer child care services based on expected demand, regional changes in provision rates cannot be used as a suitable exogenous policy measure, as they already refl ect the usage of care services. be generated by comparing the behaviour of individuals immediately before and after a cut-off date of a policy reform. This method has, for example, been used to examine how changes in parental leave regulations have affected labour supply (Ginja et al. 2020), living arrangements (Cygan-Rehm et al. 2018), as well as childbirth (e.g. Cygan-Rehm 2016; Farré/González 2019; Tamm 2013). While these models are central to the fi eld of policy evaluation, they have high data demands. In order to generate signifi cant results, large sample sizes that include reasonable numbers of events that are observed immediately before and after a policy reform are needed. While this need for large amounts of data is a general concern when using these types of models, it is a particularly pressing problem in fertility research as births are rare events. Therefore, most survey data will not provide a suitable basis for applying the regression discontinuity approach to birth behaviour. Consequently, this method can mainly be used in fertility research if large-scale register data are available.

Criteria and considerations for the analysis of policy effects
In order to study policy effects, suitable counterfactuals are needed. The Nordic countries have been ideal for the counterfactual approach. These countries are at the vanguard of progressive parental leave regulations, having already overhauled their family policies in the 1970s and 1980s. Furthermore, as these countries have large-scale register data suitable for individual-level fertility analysis, studies conducted in these countries were among the fi rst to generate solid evidence of how parental leave policies could alter fertility patterns (Andersson 1999;Andersson et al. 2006;Björklund 2006;Neyer/Andersson 2008). A stylised fi nding from this body of research is that the introduction of the so-called "speed-premium" 9 in the Swedish parental leave system led to an increase in second birth rates, and a shortening of birth intervals. This evidence was generated based on simple parityspecifi c birth rates (i.e. from event history models) for Sweden that were compared with patterns in other Nordic countries. In the Nordic context, where there are only minor differences in the cultural and economic developments between countries, it seems straightforward to analyse Swedish fertility and to use neighbouring Norway or Denmark as counterfactuals. Other European countries often lack such clear-cut comparison groups, because the data are often less comparable across countries; and because countries differ on multiple dimensions, including in their cultural and economic conditions, and in their fertility patterns prior to a reform.
In addition to the challenge of guaranteeing similar prior conditions, Bhrolcháin and Dyson (2007) have listed further criteria for performing meaningful causal analysis of demographic outcomes. In particular, they have emphasised the importance of 9 Parental leave benefi ts in Sweden cover roughly 80 percent of prior income. If the fi rst and second births are spaced closely apart, women will receive the same parental leave benefi t for the second child as for the fi rst one, even if they had reduced their employment and earnings between the two births. The same applies to higher-order births (Andersson et al. 2006). contiguity: i.e. that the effect should follow shortly after the cause (Bhrolcháin/Dyson 2007: 8). Moreover, the regression discontinuity approach (see above) capitalises on the idea that a policy reform can be regarded as a sudden external shock that leads to an immediate reaction. Sharp cut-off dates for eligibility for certain benefi ts are very important tools in understanding how policies operate. However, not all policy reforms are implemented in a clear-cut fashion. Furthermore, there may be a mismatch between the de jure regulations and the de facto implementation of a policy. Some groups may be eligible for a service or payment, but may not be able to take advantage of it, either because they are simply not aware of their eligibility or because they are unable to master the bureaucratic hurdles involved in applying for a state benefi t. Moreover, in family law, there is often a "margin of appreciation" of how legal texts are interpreted and enforced in practice. This is most evident in the regulation of child custody or alimony, in which judges have a certain leeway in their rulings. There is only a small margin of appreciation in the case of parental leave regulations, as eligibility is clearly defi ned in law. However, for some families, bureaucratic hurdles may have indeed been a barrier to taking advantage of these programmes. The abovementioned parental leave benefi t reform requires parents to document their earnings of the last 12 months. The German Family Ministry has only recently intensifi ed efforts to make the procedure easier (by retrieving the earnings information automatically from the pension registers). Returning to the abovementioned principle of contiguity, we observe that this principle seems to confl ict at least somewhat with the ways in which some policies play out in practice. Thus, there may not be an immediate "reform effect" because it can take time for the knowledge of how to apply for certain benefi ts to trickle down to all layers of society. In addition, governments may need time not only to optimise their systems and thus remove bureaucratic hurdles, but also to make the application procedure easier so that access is guaranteed even for hard-to-reach populations, such as those who do not speak the native language.
Also relevant in this context is the policy process. From the ministerial draft, to the discussion in parliament, to the enactment of the law, and to its ultimate implementation; this process can easily stretch across several months, or even years. Depending on the magnitude and relevance of the reform, this debate may receive media coverage. As a result, even if we are able to pin-point the exact date when a policy reform came into force, the reform may have increased levels of public awareness much earlier. For example, Germany drastically curbed ex-spousal maintenance after divorce in 2008 (Radenacker 2020). As this policy process was covered in the media, it is likely that women who foresaw that they would get divorced had altered their employment behaviour well in advance of the actual date of the implementation of that reform. The parental leave benefi t reform, which will be analysed below, came into force on 1 January 2007. The reform bill was passed on 20 June 2006. 10 It is possible that some couples foresaw that this policy would come into effect, and acted accordingly. Thus, births in January 2007 may have been planned in anticipation of the implementation of the reform. Therefore, an immediate reaction to the reform may have been observed. However, given that births are not that easy to time for most couples, a more realistic scenario is that there was a certain time lag between the implementation of this reform and its effects on fertility.
Policies are rarely formulated in a vacuum, and how they operate also depends on the context. In some cases, a policy may not succeed because people do not use it. Levels of policy usage may vary for different reasons. For example, eligible individuals may not consider applying for a transfer for fear of being stigmatised if they collected the benefi t. Similarly, parents may not take advantage of an available child care slot because societal norms sanction the use of day care at young ages. Neyer and Andersson (2008: 702) argued that certain policies may be deemed to have failed as a result of a lack of coherence and a mismatch with the broader societal system. For example, the German parental leave benefi t reform was basically copied from the Swedish system. Unlike Sweden, which had already abolished individual taxation in the 1970s, Germany retained its system of joint taxation. The German system continues to provide tax relief if the "second earner" (usually the woman) has worked less. Thus, the system does not provide a clear commitment to the "dual earner model", but instead provides strong incentives to opt for the single earner model. Kalwij (2010: 704) concluded that researchers should never analyse policies in isolation, nor should they "simply sum up the various family policies". Instead, researchers should consider the interplay of the different measures and how they align and resonate with the overall logic of the system.
Neyer and Andersson (2008) introduced the concept of critical junctures into this debate. This terminology was originally used in the political economy literature to depict turning points in institutional developments. Critical junctures are policy reforms that have a lasting effect on the entire welfare state framework because they determine its future pathway and can "lock" the system into moving in a certain direction (Pierson 2000;Thelen 1999). It also follows that the constellation of different policies generates a unique and separate dimension. In addition to pointing to this idiosyncratic nature of policies, the authors suggested to analyse possible interaction effects between the existing policy measures and the newly introduced ones. For example, it is conceivable that parental leave policies would only be effective if they were combined with an increase in the provision of child care. While child care policies could be integrated by exploiting regional variations, other policies are more diffi cult to integrate (such as the effects that come from the tax and transfer system). The use of large-scale cross-national data, combined with information on the features of the different welfare states, could provide some important insights into how policies operate in different contexts (Wesolowski et al. 2020). How did the German parental leave benefi t reform affect fi rst birth behaviour?

The 2007 parental leave benefi t reform
Below, we use the German parental leave benefi t reform of 2007 to illustrate the diffi culties involved in isolating policy effects. The German parental leave benefi t reform came into force on 1 January 2007. The reform, which was copied from the parental leave benefi t programmes of the Swedish system, was regarded in the academic community as marking a major departure from the previous system in Germany (Fleckenstein 2011;Spieß/Wrohlich 2008). The prior regulations provided a fl at-rate benefi t of 300 euros per month for the duration of two years, while the new regulations provided a shorter term of parental leave of only 12 months. Moreover, the benefi t was now earnings-related, with the net earnings in the year prior to childbirth serving as the basis for the calculation of the benefi t. 11 Although the term during which benefi ts were received was shortened, the new regulations generally led to an increase in monthly payments, one exception being the unemployed and persons who were not integrated into the labour market. The benefi ts of these individuals were often lower under the new system. Some scholars characterised the reform as a major and radical shift and a clear departure from the logic of the previous system, which was often classifi ed as a conservative and familialistic welfare state regime (Fleckenstein 2011). In addition, the new regulations provided strong incentives for women to become established in the labour market before having children. Thus, a clear-cut hypothesis follows from these observations: namely, that the association between female employment status and earnings should have changed after the reform, with women becoming more likely to postpone childbearing until they were fi rmly established in the labour market. This hypothesis will be tested below based on large-scale register data.

Data and variables
Data for this investigation come from the German Pension Registers. We use the "VSKT 2015", which is a sample drawn from the registers and which includes persons with German citizenship who had an active pension account in 2015, i.e. persons who were not yet retired (Stegmann 2018). The main benefi ts of using this dataset are its large sample size and its detailed and reliable employment, earnings, and fertility biographies (for a validation of the fertility histories, see Kreyenfeld/ Mika 2006). There are, however, several disadvantages to using this dataset. First, the pension data cover only about 90 percent of the resident population in Germany, as opposed to the total population. People in certain professions (farmers, civil servants) are not included in the registers. An additional drawback is that it contains 11 There is an earnings cap for the calculation of the benefi t (currently 1,800 euros).
only a few variables that can be employed for the investigation. For example, partner information or information at the household level are not available in the dataset. Thus, the analyses must rely on a "sparse model" with a limited number of covariates. Another drawback is that the data only include individuals born between 1948 and 1985. This means that coverage of younger ages for the very recent time period is poor.
The data have been transferred into a person-month dataset, and have been restricted to women aged between 20 and 40 in the years 2005 to 2010. We do not analyse earlier years because unemployment is one of our main covariates, and because measures of unemployment are not fully comparable for longer time periods due to a major policy reform of the unemployment benefi t scheme that came into force in 2004 (the so-called Hartz IV reforms). We do not analyse later years because coverage of very young women for the period after 2010 is poor. Sensitivity analyses were conducted that cover longer time periods for older women (see Fig. A2 in the appendix).
The main outcome variable is the fi rst childbirth, which was pre-dated by nine months. Women who had a child before the observation window are not part of this investigation. Cases are censored at "last clearance", which is the date when the German pension fund contacts a person to verify the information provided in the registers. The total number of women in the analytical sample is 48,843, which corresponds to 2,833,078 person-months. These women gave birth to 13,913 fi rst children in the observation window (see Table A1 in the appendix for the sample statistics).
The main covariate is the employment status. Employment status is available on a monthly basis in the registers. We distinguish between women who were in (1) schooling, (2) employment, (3) unemployment, and (4) others. Schooling includes any educational episode, including school attendance, vocational training, and university education. Not all educational episodes are equally relevant for pension payments. For this reason, university education (as well as further education) is not fully recorded in the registers. 12 Thus, some educational episodes cannot be identifi ed, and they will appear in the "other" category (see below). Employment includes "regular" employment. It does not include any form of informal or marginal employment. 13 Unemployment includes periods of registered unemployment (including ALG II). The "other" category is heterogeneous. It comprises some episodes of university education, marginal employment, or other episodes in which a person was out of the labour market for other reasons. 12 Vocational training generates "pension points", while university education is not immediately "pension-relevant". Although an individual does not acquire any "pension points" while in university education, these periods count as "pension-relevant periods". An individual needs a certain minimum of pension-relevant periods before s/he can claim a pension. The maximum duration of time that can be claimed was reduced to 96 months in 1992. 13 Marginal employment is partially or fully exempt from social security contributions and income tax. Since 1999, the marginally employed are required to pay contributions to the public pension fund, but they are still exempt from other social security payments (such as contributions to the unemployment insurance system).
In order to capture the effect of the policy reform, we control for the time period. Even with large register data, the sample sizes are too small to conduct the analysis by single years, ages, and employment status and earnings. For that reason, we have grouped the calendar years into the following three broad categories: the years 2005-2006, 2007-2008, and 2009-2010. Our aim is not to measure precisely the effects of the reform; instead, we seek to map the time trends around the reform. We will return to this limitation later on.
We control for major socio-demographic characteristics. We include a binary variable that indicates whether a person was living in eastern or western Germany. This variable is a time-constant variable and denotes whether the person was living in eastern Germany (including East Berlin) or western Germany at the time at which they were last contacted by the German pension fund. Age is the baseline hazard in the event history model. It is included as a categorical variable in the model that distinguishes between ages 20-24, 25-29, 30-34, and 35-40. These cut-points have been chosen arbitrarily, but the results remain largely unchanged if more fi negrained ones are used. Moreover, other model specifi cations that do not make any parametric assumption regarding the baseline hazard, like the Cox model, lead to similar results.
A considerable disadvantage of this dataset is the lack of variables that could be used as control variables. While the data include information on level of education, this information is incomplete since it is provided by the employer on a voluntary basis only. However, a signifi cant advantage of the register data is that they provide very detailed monthly earning histories. Earnings information is stored in terms of pension points, with one pension point constituting the average earnings in a given year. Based on this information, we have generated a variable that gives information on each individual's monthly earnings, distinguishing between low earnings (up to 50 percent of average earnings), medium earnings (50 percent to less than 100 percent of average earnings), and high earnings (100 percent or more of average earnings). This variable is a time-varying covariate that changes its value as people progress through time. As earnings generally increase with age, this variable is strongly correlated with age. We control for age, but we also provide analyses by age group.

Modelling strategies
The empirical investigation we conduct below relies on conventional event history modelling. In order to specify the baseline hazard, we use a piecewise constant model. Compared to the more widely used Cox model, there are several benefi ts to using the piecewise constant model. Most importantly, it generates estimates for the baseline hazard, which means that it provides straightforward measures for the effect of the main process time (age). 14 We start with a simple model that controls for the main socio-demographic covariates (age, year, employment, earnings, region). In the next step, we examine the effect of the policy reform by including an interaction term of the calendar year and the main covariate of interest (women's employment and earnings). The policy reform provides strong incentives to postpone childbearing until an individual's earnings are reasonably high. Thus, we assume that the association between women's employment and earnings is different before and after reform, i.e. it is assumed that stable employment and high earnings have become a prerequisite for family formation in recent years. Table 1 displays the results from the event history models, with employment (Model 1) and earnings (Model 2) as the major covariates, and fi rst birth (or rather fi rst pregnancy) as the outcome variable. All results are provided as relative risks. The models show that birth risks were lower at younger (20-24) and older ages (35-40), but were fairly similar at ages 25-29 and 30-34. The models also indicate that the fi rst birth rates were higher in eastern than in western Germany, refl ecting the earlier family formation tendencies among eastern German women. Turning our attention to the effects of calendar time, the model suggests that the fi rst birth rates had increased somewhat since 2005-2006. Model 1 shows that educational participation lowered the fi rst birth rates by about 65 percent, which is in line with earlier evidence for Germany (e.g. Andersson et al. 2014;Blossfeld/Huinink 1991;Schmitt 2012). Unemployment (versus employment) does not seem to infl uence birth behaviour. It should, however, be noted that this effect varied by region, with eastern German women being more likely than their western German counterparts to postpone parenthood during unemployment (see Table A1 in the appendix).

Determinants of fi rst birth risks
Model 2 includes a combination factor of earnings and employment, and shows a positive gradient; i.e. that higher earnings increased the fi rst birth transition rates. However, we cannot rule out the possibility that the strong positive effects we observe are attributable to an acceleration of birth intensities among high income earners at advanced ages. This would mean that the proportionality assumption of the model was violated, and that the effects of earnings on the birth rates varied by age. It is therefore possible that there were interaction effects of earnings and age. This aspect will be addressed in the next step of the analysis, which also focuses more narrowly on how the effects of earnings change over time.

Effect heterogeneity
The next step of the analysis contains more refi ned analyses by age group and calendar year. For that purpose, we have split the sample into a group of younger (ages 20-29) and older persons (ages 30-39). Furthermore, we have used the calendar year in interaction with women's earnings/employment. While an alternative strategy would be to use a three-way interaction, this would be more diffi cult to display. Figure 1 visualises the results separately for the younger and the older age groups (for the regression table, see Table A3 in the Appendix). The model corroborates earlier fi ndings (see e.g. Andersson et al. 2014) showing that unemployment is positively associated with having the fi rst birth at younger ages (below age 30), but is negatively associated with having children later in the life course. An important fi nding from the results is that the association between female Note: Date of childbirth was backdated by nine months. * p<0.05; ** p<0.01; *** p<0.001 Source: VSKT 2015, own estimates employment, earnings, and fi rst birth rates changed over time. After 2007, when the parental leave benefi t reform came into force, the fi rst birth rates increased among women with high earnings, while they declined or stagnated among women with lower earnings. Furthermore, the fertility pattern of unemployed women changed dramatically, with the fi rst birth rates of younger unemployed women decreasing by 27 percent from 2005-2006 to 2009-10. While unemployment had been conducive to childbearing at younger ages before the parental leave reform came into force, it appears that this was no longer the case. We also see a change in patterns at older ages. In this age group, the fi rst birth rates of unemployed women had dropped by 16 percent over the same time period. The results of the model provide strong evidence that the prerequisites of family formation have shifted over time in Germany. It has become increasingly important for women to have decent earnings and regular employment before starting a family.

Convincing evidence?
Can we attribute the pattern we observed to the parental leave benefi t reform? While the fi ndings allude to a decisive societal change, clearly attributing that shift to the parental leave regulations is more cumbersome. Causal modelling requires that "all reasonable alternative explanations (including confounders) must be ruled out" (Bhrolcháin/Dyson 2007: 9). We obviously cannot rule out the possibility that other policies -such as the increase in day care services for children that was initiated in 2005 -are the more important drivers of the changes in behavioural patterns. An alternative explanation may also be the effects of the global fi nancial crisis, which hit Europe at almost the same time as Germany enacted new parental leave regulations. The growing sense of economic insecurity in the wake of the crisis could have sharpened people's awareness that the single-earner model is a fragile family model -which may, in turn, have led to a reduction in fi rst birth risks among unemployed women.
Would a more causal modelling strategy be better able to discern causal effects? A model that mapped the monthly birth rates shortly before and after the reform could provide more convincing support for the claim that there was a true "reform effect". Indeed, these kinds of analyses have been conducted using data from the German birth registers. The results showed that there was an immediate increase in births at the cut-off date of 1 January 2007 (Tamm 2013). Despite the beauty of such set-ups, these fi ndings remain limited. They may not even provide clear-cut evidence that prospective parents acted in response to the reform. Instead, they probably indicate that medical doctors and midwives have some limited leeway to infl uence the dates of childbirth. Thus, given that the mechanisms at the cut-off date were likely to have been very specifi c, these results may not be generalisable. While intricate identifi cation strategies can easily "throw the baby out with the bathwater" (for the same analogy, see Hill et al. 2020: 363), the simple event history modelling approach -as illustrated above -fails to provide clear-cut causal effects, but delivers more accessible and generalisable results.

Conclusions
The aim of this paper was to provide an overview of recent attempts to perform causal fertility modelling based on longitudinal data. In particular, we focused on research that had tried to investigate the causal effects of female employment and family policies on birth behaviour in European countries since the 1990s. The empirical section of the paper paid special attention to the German policy reform of 2007, and raised the question of how this reform affected fi rst birth patterns. The "reform effect" was modelled over a time-varying variable that depicted the calendar year. A main fi nding from this investigation was that birth rates during unemployment declined over time, whereas fi rst birth risks increased at higher earnings levels. These results provide important insights into fertility dynamics in Germany, as they illustrate that the prerequisites of family formation have shifted rather dramatically in recent years. These shifts may be attributed to the parental leave policy reform. However, the analysis also revealed how diffi cult it is to rule out "alternative explanations" with simple methods, considering that the expansion of day care as well as the global fi nancial crisis occurred around the same time as the policy was implemented, and could have caused a similar pattern. Event history analysis is a concept that has evolved partially from demographic research, in which the aim is to portray the fertility behaviour of a given population. Event history modelling also shares the general understanding of sociological life course research (i.e. Elder 1985) that vital events must be situated in time and studied along major life course dimensions (i.e. by age when fi rst births are analysed, and by duration since the last birth when higher order births are the focus of attention). A commitment to the life course approach is often absent from many of the more causal modelling strategies. Thus, standard event history models are better able to deliver on the "life course dimension", and to provide more accessible "descriptions of the social world" (Brüderl/Ludwig 2015: 353). These descriptions are of considerable value, and should probably be used as the basis before moving to the next step of analysing causal effects. But what are the reasonable next steps for causal fertility research?
The literature overview demonstrated that causal modelling approaches, and fi xed-effects methods in particular, have increasingly seeped into sociological and demographic research. The wide array of European panel data, such as data from the Generations and Gender Survey, the British Understanding Society study, or the German Family Panel, have clearly fuelled this development. However, demographic and sociological scholars have been very careful when applying fi xed-effects methods to fertility processes. While these methods have been regularly used to study the question of how demographic events infl uence other determinants, such as life satisfaction and well-being, they have rarely been used to examine birth dynamics as outcome variables. This is because in family sociology and demography, taking a parity-specifi c view is almost an imperative. This includes the notion that a fi rst birth is a unique event. The analysis of fi rst births is therefore not amenable to fi xed-effects modelling since a requirement of this method is that the behaviour being studied can be repeated by the same individual under different conditions.
Other econometric methods from the programme evaluation literature, such as the regression discontinuity approach, have been employed to examine the causal effects of policy reforms on fertility behaviour. While this technique may be highly valuable in terms of advancing causal modelling in demographic research, some caution seems warranted when considering its use. It is important to keep in mind that these methods have been developed mainly for the evaluation of labour market programmes in which the outcome variables are usually frequent events, such as entering employment or receiving welfare benefi ts. This is clearly not the case for demographic events, which are rare. Thus, there may be only a few birth events around a cut-off date available in the survey data, which makes it diffi cult to generate signifi cant results. We may intuitively prefer statistically insignifi cant, but well-specifi ed results over signifi cant and biased ones. However, an underpowered model can also raise concerns. After all, model results that are nowhere near conventional levels of signifi cance will be diffi cult to interpret and hard to publish, despite all discussions and complaints about the publication bias in peer-reviewed journals. While conventional models jump to conclusions too quickly, these types of models may be characterised as being too cautious. Hill et al. (2020) pointed out that a "no-effect result", which is based on a seemingly solid causal investigation, can be harmful if it leads to the discontinuation of a potentially meaningful policy measure, for instance. What seems like a trivial and purely data-driven concern is often of considerable practical relevance, because the sample size can be a serious constraint, and because vital events are rare events.
While acknowledging that a causal perspective is important for arriving at meaningful conclusions, the paper also emphasised that the choice of method should be carefully considered when studying demographic processes. We argued that fertility research can be moved forward through more careful refl ection on the origins of the bias. The observation that people act on their anticipated behaviour was regarded as a core reason why simple models that study the relationship between women's work and birth behaviour often cannot be interpreted in a causal manner. As we noted, several scholars have proposed considering the role of "narratives", which also currently seem to be gaining ground in economic research (e.g. Bènabou et al. 2018;Shiller 2020). Instead of applying ever more sophisticated econometric modelling techniques, the future of causal fertility research in this area may lay in a careful refl ection of how suitable item sets could be developed to map these "imaginaries of the future self".
This overview paper has many limitations. It was restricted to fertility research based on studies that relied on longitudinal, prospective, or retrospective surveys, as well as on administrative data that were mostly collected in European countries. The paper discussed common-cause confounding and endogenous selection bias as prime sources of bias in studies that dealt with the relationship between female employment and fertility. In terms of the isolation of policy effects, several reasons were cited as to why standard tools from the programme evaluation literature cannot be used for studying fertility behaviour. Beyond the concerns listed in this paper, there are other reasons why an investigation may result in biased results. In particular, data quality may affect model results. We addressed the limitations related to small sample sizes when vital events are investigated based on survey data. However, there are numerous other aspects that revolve around the topic of "data quality". For example, recall bias in retrospective surveys can lead to biased models if events in the distant past are not remembered with the same precision as events that happened more recently. Furthermore, more salient events (childbirth) may be easier to remember than less salient ones (like events in a person's employment biography). While retrospective surveys suffer from recall bias and related problems, prospective studies suffer from attrition, which raises concerns about the selectivity of the drop-outs. Administrative data can have other problems depending on the registration system in a given country. For example, a problem of Nordic register data is that most of the information, such as information on employment status or childbirth, is only available on a yearly basis. For other countries, administrative data may not include the entire resident population. Data quality has not been addressed in this overview, but it is certainly an important additional and often overlooked aspect that warrants more attention.
Wisconsin) for his support in helping me to wrap my head around endogenous selection bias. Thanks also to Kai Baron (WZB) for his suggestions to engage with the economic literature on "narrative economics". For valuable comments on matching techniques in demographic research, I would also like to thank Daniel Brüggmann (German Pension Fund). Many thanks must also go to Tatjana Mika for sharing her expertise and knowledge on the German pension data. For language editing I would like to thank Miriam Hils. All remaining errors are my own. 1.19*** 1.30*** Employment: High earnings (100% and more) 1.11 1.68*** Note: Date of childbirth was backdated by nine months. Two separate models for younger ages (age 20-29) and older ages (age 30-40) were estimated. Furthermore, employment status was used in interaction with calendar year. The interaction model was re-estimated with different references categories, so that the effect of unemployment (versus low income) could more easily be compared for a given year. Further covariates in the model are region (East/West), age (categorised and time-varying) and educational participation and other employment status in interaction with the time period. * p<0.05; ** p<0.01; *** p<0.001 Source: VSKT 2015, own estimates